Element-wise vector addition microbenchmarks.

Posted on March 4, 2018 by Fūzetsu

At work we have found a need for a function

vsum :: [Vector Double] -> Vector Double

This was a fairly hot function so making it fairly quick would be nice. I have at the time written a small benchmark in the project but it was difficult to measure it with everything else that was going on and we only had a limited data set. In benchmarks we use non-empty list of vectors instead (which is always fine with us, removes runtime check and a failure case). There is also an assumptions that all the vectors are uniform length.

Today I finally broke through and decided to just isolate the function and write a benchmark-suite. I have published the code on GitHub. I don’t plan on putting it on Hackage as it’s not a library nor a useful executable.

What I don’t do in the repository is discuss results that I have gotten. I will be referring to functions defined in the repository. I have copied and pasted them at the bottom of the post for quick reference.

Single vector, many elements

The input sample is a single vector with 1 million elements.

FoldZip, RecurseZip, RecurseZipWithN all perform about the same: after all their job is to simply return the vector.

The ST-family of implementations suffers. It’s pretty easy to guess why. Let’s look at one of the implementations:

vsum (v :| vs) = ST.runST $ do
  vec <- VG.thaw v
  let vlen = VG.length v
  forM_ vs $ \v ->
    let go n | n == vlen = pure ()
        go n = do
          VG.unsafeModify vec (+ VG.unsafeIndex v n) n
          go (n + 1)
    in go 0
  VG.unsafeFreeze vec

We just thawed the vector (resulting in copying) when it was not necessary: in the case of single vector input, we should just yield the vector. I think we could fix this two ways:

Add vsum (v :| []) = v case. This will handle the awkward edge case.
Use modify. This will only perform a copy of the vector when needed. In this case it won’t be needed, we won’t perform any destructive updates and should be able to get off scot-free with low cost.

Let me add both of these ways and let’s compare results. As I’m lazy, I will only add it for the UncheckedStFromBack variant as I don’t want to add 8 new modules when this is the “superior” version as you will see from the results later.

benchmarking Data.Vector.Unboxed/1x1000000/FoldZip
time                 9.012 ns   (8.963 ns .. 9.072 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 7.168 ns   (6.909 ns .. 7.418 ns)
std dev              785.4 ps   (658.9 ps .. 962.8 ps)
variance introduced by outliers: 93% (severely inflated)

benchmarking Data.Vector.Unboxed/1x1000000/RecurseZip
time                 9.251 ns   (9.203 ns .. 9.317 ns)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 7.325 ns   (7.030 ns .. 7.613 ns)
std dev              785.3 ps   (638.0 ps .. 961.4 ps)
variance introduced by outliers: 93% (severely inflated)

benchmarking Data.Vector.Unboxed/1x1000000/RecurseZipWithN
time                 8.712 ns   (8.554 ns .. 8.997 ns)
                     0.996 R²   (0.990 R² .. 1.000 R²)
mean                 6.846 ns   (6.525 ns .. 7.243 ns)
std dev              964.1 ps   (717.2 ps .. 1.391 ns)
variance introduced by outliers: 96% (severely inflated)

benchmarking Data.Vector.Unboxed/1x1000000/UncheckedStFromBack
time                 514.6 μs   (507.5 μs .. 523.4 μs)
                     0.997 R²   (0.992 R² .. 0.999 R²)
mean                 412.5 μs   (393.9 μs .. 431.3 μs)
std dev              46.70 μs   (37.85 μs .. 58.42 μs)
variance introduced by outliers: 81% (severely inflated)

benchmarking Data.Vector.Unboxed/1x1000000/UncheckedStFromBackModify
time                 1.064 ms   (991.1 μs .. 1.143 ms)
                     0.979 R²   (0.970 R² .. 0.999 R²)
mean                 796.9 μs   (750.5 μs .. 858.1 μs)
std dev              145.4 μs   (104.6 μs .. 192.7 μs)
variance introduced by outliers: 91% (severely inflated)

benchmarking Data.Vector.Unboxed/1x1000000/UncheckedStFromBackBailEmpty
time                 11.01 ns   (10.68 ns .. 11.33 ns)
                     0.990 R²   (0.984 R² .. 0.994 R²)
mean                 9.270 ns   (8.907 ns .. 9.716 ns)
std dev              1.129 ns   (926.8 ps .. 1.456 ns)
variance introduced by outliers: 94% (severely inflated)

Well. modify version was disappointing. It’s even worse than the original by a factor of two. I suppose it shows I don’t understand it and I will need to study what it actually does. Here’s the implementation I chose for it:

vsum (v0 :| vs) = VG.modify (\vec -> forM_ vs $ \v ->
  let go 0 = pure ()
      go n = do
        let n1 = n - 1
        VG.unsafeModify vec (+ VG.unsafeIndex v n1) n1
        go n1
  in go vlen) v0
  where
    vlen = VG.length v0

The good news is that bailing out on empty tail helps a lot and the check only takes extra 2ns. I think it’s worth it if the alternative is paying 500µs on a hit. It of course depends on the use-case: if you’re the only consumer of the function and you know you’ll always have multiple vectors, maybe you want to save the 2ns.

If you’re interested in the numbers, see the html report for this case or the textual output.

Many vectors, single element

The input sample is 1 million vectors with a single element each.

Some weirdness starts here: RecurseZipWithN is the fastest by far for boxed vectors. It is however by far the slowest for storable and unboxed vectors.

I wonder if the size of the function is hindering some optimisations. Let’s add a new variant: one that still does zipWith6 but if there is not enough work left, it folds over the rest of the input.

vsum (v0 :| v1 : v2 : v3 : v4 : v5 : vs) =
  vsum (VG.zipWith6 (\e0 e1 e2 e3 e4 e5 -> e0 + e1 + e2 + e3 + e4 + e5) v0 v1 v2 v3 v4 v5 :| vs)
vsum (v0 :| vs) = foldl' (VG.zipWith (+)) v0 vs

Slightly better for boxed case and even worse for storable and unboxed cases. We could try with zipWith3 but I suspect it would not help much.

The UncheckedStFromBackBailEmpty variant from previous section actually performs even better than the original though only marginally. The modify variant is comparable to the original so that’s hopeful at least. The ST variants are by far the fastest for this case with bail empty being the best but not by a lot. All are acceptable.

If you’re interested in the numbers, see the html report for this case or the textual output.

Many vectors, many elements

The input sample is 5000 vectors with 5000 elements and the results are pretty much the same as for the previous case. Or so I would have wanted to say but then:

benchmarking Data.Vector.Unboxed/5000x5000/UncheckedStFromBackModify
time                 23.68 ms   (23.60 ms .. 23.76 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
vector-sum-benchmarks-bench: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (-9223372036854775808,1000)
CallStack (from HasCallStack):
  error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-JlawpRjIcMJIYPJVsWriIA:Data.Vector.Internal.Check
vector-sum-benchmarks-bench: thread blocked indefinitely in an MVar operation
Benchmark vector-sum-benchmarks-bench: ERROR

Interesting. I don’t want to sit through another minutes long run (boxed vectors take 7 seconds per iteration on 5000x5000 compared to 20ms for unboxed and 50ms for storable) so let’s drop down to 1000x1000. I definitely will have to keep that crash in mind. I wonder why it crashed there and not in Storable or boxed. Very bizarre. I’m actually guessing it might a bug in criterion considering the weird index (overflow?) and that it came up half way through its printing. Actually that seems to be exactly it: here’s the (fixed) criterion issue. Fixed two months ago at that. I guess I can ask. I could update criterion through stack.yaml but I don’t want to invalidate previous results in middle of the experiment. Let’s go to 1000x1000.

The results are very similar to many vectors, single element case, just scaled differently. zipWithN solutions are still the fastest for boxed vectors (factor of 2x better over others) and just completely awful for storable and unboxed (factor of 20-30x worse compared to others).

If you’re interested in the numbers, see the html report for this case or the textual output.

200 vectors, 200 elements

Out of curiousity I also ran with a smaller data set of 200x200. With this amount of data the zipWithN versions are actually worse even for boxed vectors. The usual ST bail empty version is the fastest: well, on par with UncheckedStFromFront which seems to have taken the lead but only within a small margin. It seems that there’s a point for unboxed vectors where zipWithN functions stop being the worst and become the best. Perhaps it’s all the other functions that degrade with larger inputs.

For unboxed and storable vectors the story is the usual: ST variants win with unchecked ST from back versions coming out on top.

If you’re interested in the numbers, see the html report for this case or the textual output.

General remarks

Comments with regards to the overall process

Data generation

The data generation is nothing exotic:

-- | Creates data suitable for various vsum implementations.
createData
  :: VG.Vector v Double
  => Int -- ^ Number of vectors
  -> Int -- ^ Length of vectors
  -> NonEmpty (v (Double))
createData vs l = Data.List.NonEmpty.fromList $
  Prelude.map (VG.generate l . mult) [1 .. vs]
  where
    mult :: Int -> Int -> Double
    mult x y = fromIntegral (x * y)

As we’re using criterion (and weigh for allocations), this is evaluated to normal form before being passed into the benchmark.

Number gathering

It would take too long to run full set of benchmarks every time I wanted to make a change. I have started with the functions listed below in the appendix as the baseline and only ran the full set of criterion benchmarks once. The number of benchmark cases that run is 3 * length sizes * length vsumFunctions. This number can grow fairly fast and especially boxed vector benchmarks take a long time.

Therefore once I had a base, I had re-ran benchmarks for the cases I was interested in for this blog post. These are the numbers you see above. You can see the original full-run numbers if you wish: text version and (WARNING: this report is rather large and will potentially hang your browser. It’s not very useful to look at either because every time is on the same diagram and boxed vector operations for large inputs completely dwarf the rest making it useless as visual aid) html version. You’ll note that the ST functions to deal with singleton cases are missing as well as zipWith6 variant of zipWithN: these functions were introduced during this blog post.

Memory usage

There are also allocation numbers available. These are in full as they only need to run a single iteration per case. You can see them here. The numbers are fairly uniform. zipWithN cases seem to have a little less live data sticking around but pay in significant increase in number of garbage collections. All other differences between the allocation numbers are fairly negligible.

Contrived test cases

We’re benchmarking the speed of the functions on their own. They do not live with other code and don’t benefit from any optimisations one might get if they were inlined into real codebase. As always, take these numbers with a grain of salt. They are at best like-for-like comparisons. In real code performance may change due to fusion &c. Write benchmarks for your application before changing your code based on microbenchmarks here (or anywhere).

Different vector types

Boxed vectors seem to behave very differently performance-wise compared to storable and unboxed vectors. It’s unclear to me as to why. Of course I expect them to be much slower but it makes me curious why they don’t scale in the same way as the other vector types. Why are the ST functions worse than zipWithN variants for larger input data for boxed vectors? This may be an investigation for another day.

Specialisation and inlining

I had to explicitly specialise the functions otherwise GHC worked on the unspecialised generic versions and the results were quite a lot different. This is a standard practice anyway so remember to do it in your code too. Remember to add INLINABLE too as SPECIALISE pragmas turn off inlining by default. In real code, we may instead benefit from INLINE over SPECIALISE. For more information on these things check out this excellent post on GHC optimisation and fusion.

ST traversal from the back and from the front

Some of the ST functions presented have two variants: one that traverses vectors from the back and one from the front. The reason why I’m measuring both is that it does seem to make a noticable difference with traversal from the back usually winning slightly. I have tried this because I was told fairly recently that checking if something is zero can be faster than comparing to another number as there are special instructions to do that. I do not know if that’s what’s happening or if it is true but it seems faster.

Non-exhaustive examples

There are multiple other ways we could write vsum. A simple example is instead of using explicit indexing into vector traversals in ST, just use imapM_. If I remember correctly this one optimises to the same loop as forward traversals but of course you’re relying on optimisations. It would be very good to add more cases. Another example is aforementioned zipWithN for N other than 3. You never know. Mixing existing examples is also possible: we can mix bail empty and modify: we always bail on empty tail (so we don’t pay the price of pointless modify which we found out to be quite bad) but we may receive benefits of modify in common case, if there are any.

Additions encouraged and most welcome.

hmatrix, repa, accelerate, parallel, …

None of the fancy parallel libraries were used. Not even the not fancy ones. Mostly because I was lazy but in big part because those usually only play nicely together in certain scenarios. In benchmarks on real codebase, vsum we had there was faster only up to the point at which hmatrix started winning out for rather large vectors. I want to avoid poorly benchmarking these libraries where data conversions &c. would completely skew the data. Often the rest of your program already deals in those formats.

RTS, hardware

No fancy RTS flags were passed, only -N. This means parallel GC was on &c. Everything was on fair ground however so that’s fine.

The numbers presented here were ran on i7 6770k, 64GB rig, NixOS rig.

Conclusion

The general winner has emerged and it is

vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v0 :| []) = v0
vsum (v0 :| vs) = ST.runST $ do
  vec <- VG.thaw v0
  let vlen = VG.length v0
  forM_ vs $ \v ->
    let go 0 = pure ()
        go n = do
          let n1 = n - 1
          VG.unsafeModify vec (+ VG.unsafeIndex v n1) n1
          go n1
    in go vlen
  VG.unsafeFreeze vec

This performed very marginally worse for the unusual case of a single vector (2ns loss). It actually performed very marginally better than the unchecked cases on other inputs. Rarely some other ST function would match it or come very slightly ahead: the differences can easily be dismissed as noise in measurements. Note this function is quite unsafe and will do awful, horrible things if any of the vectors is smaller than the first one. This assumption is OK for our use-case and can be easily statically verified. If this was to be a library function, more care should be taken. At the very least use ! for indexing instead and crash the application rather than leaking memory contents to an attacker.

For boxed vectors, zipWithN versions were on top after a certain amount of data. If you’re using those vectors, be wary. You shouldn’t be using them for numerical data usually anyway.

I would definitely like to understand why zipWithN versions are so badly performing as well as why we see such a discrepancy in patterns for boxed vectors.

I would also like to confirm the back/front traversal speed reasons. If there’s interest I might do it in another post. It might just be the case of peeking into the generated ASM. I would also like to repeat the experiments with LLVM as well as play with different compilation flags to both GHC and LLVM.

Appendix: Functions used in the benchmarking

Fold zip

-- | Fold a (+) zip with first vector as inital value.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| vs) = foldl' (VG.zipWith (+)) v vs
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

This is a straight forward zip going vector by vector.

Recurse zip

-- | Similar to FoldZip but use explicit recursion for the zips.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| v1 : vs) = vsum (VG.zipWith (+) v v1 :| vs)
vsum (v :| []) = v
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

Recurse zipWithN

-- | Matches on up to 6 vectors and uses zipWithN to consume as many
-- of the as we can.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v0 :| v1 : v2 : v3 : v4 : v5 : vs) =
  vsum (VG.zipWith6 (\e0 e1 e2 e3 e4 e5 -> e0 + e1 + e2 + e3 + e4 + e5) v0 v1 v2 v3 v4 v5 :| vs)
vsum (v0 :| v1 : v2 : v3 : v4 : vs) =
  vsum (VG.zipWith5 (\e0 e1 e2 e3 e4 -> e0 + e1 + e2 + e3 + e4) v0 v1 v2 v3 v4 :| vs)
vsum (v0 :| v1 : v2 : v3 : vs) =
  vsum (VG.zipWith4 (\e0 e1 e2 e3 -> e0 + e1 + e2 + e3) v0 v1 v2 v3 :| vs)
vsum (v0 :| v1 : v2 : vs) =
  vsum (VG.zipWith3 (\e0 e1 e2 -> e0 + e1 + e2) v0 v1 v2 :| vs)
vsum (v0 :| v1 : vs) =
  vsum (VG.zipWith (+) v0 v1 :| vs)
vsum (v0 :| []) = v0
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

This is a recursive implementation using up to zipWith6 (as that’s the highest defined in the vector library).

Checked ST from back

-- | Go through vectors one by one and mutate an accumulating vector
-- in place. Uses checked operation for indexing of non-initial vector.
-- Vectors are traversed from the back.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| vs) = ST.runST $ do
  vec <- VG.thaw v
  let vlen = VG.length v
  forM_ vs $ \v ->
    let go 0 = pure ()
        go n = do
          let n1 = n - 1
          VG.unsafeModify vec (+ v VG.! n1) n1
          go n1
    in go vlen
  VG.unsafeFreeze vec
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

Checked ST from front

-- | Go through vectors one by one and mutate an accumulating vector
-- in place. Uses checked operation for indexing of non-initial vector.
-- Vectors are traversed from the front.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| vs) = ST.runST $ do
  vec <- VG.thaw v
  let vlen = VG.length v
  forM_ vs $ \v ->
    let go n | n == vlen = pure ()
        go n = do
          VG.unsafeModify vec (+ v VG.! n) n
          go (n + 1)
    in go 0
  VG.unsafeFreeze vec
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

Unchecked ST from back (security vulnerability edition)

-- | Go through vectors one by one and mutate an accumulating vector
-- in place. Uses unsafe operations, performs no bounds checks.
-- Vectors are traversed from the back.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| vs) = ST.runST $ do
  vec <- VG.thaw v
  let vlen = VG.length v
  forM_ vs $ \v ->
    let go 0 = pure ()
        go n = do
          let n1 = n - 1
          VG.unsafeModify vec (+ VG.unsafeIndex v n1) n1
          go n1
    in go vlen
  VG.unsafeFreeze vec
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}

Unchecked ST from front (security vulnerability edition)

-- | Go through vectors one by one and mutate an accumulating vector
-- in place. Uses unsafe operations, performs no bounds checks.
-- Vectors are traversed from the front.
vsum :: VG.Vector v Double => NonEmpty (v Double) -> v Double
vsum (v :| vs) = ST.runST $ do
  vec <- VG.thaw v
  let vlen = VG.length v
  forM_ vs $ \v ->
    let go n | n == vlen = pure ()
        go n = do
          VG.unsafeModify vec (+ VG.unsafeIndex v n) n
          go (n + 1)
    in go 0
  VG.unsafeFreeze vec
{-# INLINABLE vsum #-}
{-# SPECIALISE vsum :: NonEmpty (V.Vector Double) -> V.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VS.Vector Double) -> VS.Vector Double #-}
{-# SPECIALISE vsum :: NonEmpty (VU.Vector Double) -> VU.Vector Double #-}