Adding huge arrays element wise

If you had two huge arrays, say a hundred million elements in each, and you had to add them element wise, how would you go about doing it?

So, what I mean is, I have

arr1 = [1, 2, 3, 4, 5]
arr2 = [5, 4, 3, 2, 1]

and I want to end up with

arr3 # => [6, 6, 6, 6, 6]

I am looking for both idiomatic and optimized ways of doing it. How would you do it?

arr1.zip(arr2).map(&:sum)

That’s definitely one way of doing it. It’s also the most Ruby idiomatic, that I myself could think of. The problem is, that with 100 million elements, this ends up creating 100 million temporary arrays (the #zip) and then summing each of them. On my MacBook Pro it takes about a minute to run.

Anybody else have a more optimized way of doing this?


Disclaimer: I am trying to implement a Ruby front-end for the Bohrium project (GitHub - bh107/bohrium: Automatic parallelization of Python/NumPy, C, and C++ codes on Linux and MacOSX). With this I am currently able to accomplish the above array addition in about four seconds. I just want good benchmarks against optimized and idiomatic Ruby.

At least on current rubies you could do arr1.lazy.zip...

1 Like

I would likely think of preserving memory with something that large and use pop iterations. It would be less idiomatic and more optimized.

arr1 = [1, 2, 3, 4, 5]
arr2 = [5, 4, 3, 2, 1]
arr3 = []

while !arr1.empty? || !arr2.empty?
  arr3 << arr1.pop.to_i + arr2.pop.to_i
end

arr3
# => [6, 6, 6, 6, 6]

Since adding zero is the same as having an identity function it doesn’t matter if either collection is of different sizes.

2 Likes

As far as I can tell your version is different from zipping. Your results will have the length of the longer input, while the zipping variants will have the length of the shorter input.


edit

I just remembered zipWith from Haskell, and it seems as if Enumerable#zip can take a block, it works different from what I am used from haskell though…

In ruby the block passed to Enumerable#zip is executed for its sideeffects only, none of the inputs is changed, neither is an accumulator returned, so we have to build a result array in the block:

a = [1, 2, 3, 4, 5]
b = [5, 4, 3, 2, 1]
r = []

a.zip(b) do |x, y| c << x + y end
#=> [6, 6, 6, 6, 6]

At least in theory, this should not allocate more than 4 blocks at once, (2 for input, one for current result and one for next result when current outgrows capazity), though due to GC and stuff older chunks may survive as long as they do.

But GC does trick us in any of the aforementioned solutions to the problem.

1 Like

True. It depends on what a person wants; there is no data lost here. In an additive sense that would seem to be more applicable. To match and change the behavior to the shorter list version you only need to switch the OR to an AND.

1 Like

I don’t quite remember how Integer is implemented in Ruby. If the items being iterated over are truly integers I believe there is a full list of valid integers preset and all numbers are implemented with a pointer. This way any time you use numbers you’re not allocating any more memory than a pointer.

This is vaguely my memory of it and I could be wrong. But from what I understand pointers are always faster in C than other types so the language is designed to take advantage of it in this way.

If this is correct… the only space in memory being taken up is a list of pointers. It will take the same amount of memory as integers would but now is more performant. Shortening the list through pop will allow the garbage collector to free up the space allocated by that pointer. Although current Ruby implementations of the GC don’t free up pages of memory well and are fragmented so you don’t truly recover all the memory you should (you do for Ruby, but it’s not all freed for other applications to use).

I still like to develop in terms of limited memory available. It’s true that my way completely modifies the original source, but that is my intent. Running apps on servers in VMs can often require stricter memory allotments.

Another implementation worth trying for performance is matrix addition. Just wrap the lists of numbers in Matrix and add those.

require 'matrix'
Matrix[[1,2,3]] + Matrix[[3,2,1]]
# => Matrix[[4, 4, 4]]

If I need the entire array output or for further computations using #lazy will only add overhead.

Here are some of my preliminary results. bohrium is my own framework.

I’ll have to refine and do even more benchmarks, but this is quite nice results for a couple of days work. :wink:

ADD
n = 100000000
                           user     system      total        real
zip                   37.370000   3.880000  41.250000 ( 44.306667)
zip_lazy              67.280000   3.920000  71.200000 ( 73.213113)
zip_sum               23.390000   3.860000  27.250000 ( 29.049920)
zip2                  36.220000  45.820000  82.040000 ( 83.346458)
idx                   12.070000   1.930000  14.000000 ( 14.118037)
idx<<                 10.010000   1.400000  11.410000 ( 11.543182)
map!                   9.230000   0.800000  10.030000 ( 10.057617)
matrix                38.310000   3.650000  41.960000 ( 42.172704)
matrix 2d             37.490000   2.210000  39.700000 ( 39.855633)
numo/narray            2.490000   1.030000   3.520000 (  3.548754)
numo/narray 2d        17.810000   5.340000  23.150000 ( 23.444470)
bohrium               12.430000   5.510000  17.940000 ( 18.242302)
bohrium ones add       6.290000   7.220000  13.510000 ( 12.559481)
bohrium ones add!      3.800000   2.800000   6.600000 (  4.817494)
bohrium ones +         3.670000   2.630000   6.300000 (  4.189983)

The benchmark code can be found here: Benchmarking various array addition techniques with Ruby · GitHub

You might want to preload the two arrays in a setup/before section (initializing the memory for the initial arrays shouldn’t be part of this benchmark). And, more importantly, explicitly call GC.start before/after the benchmarks. Because the GC may hit periodically during overlapping tests (the data from a previous benchmark being GC’d in the next one).

Benchmark.bm method doesn’t call GC.start in it, but it does in Benchmark.bmbm ruby/benchmark.rb at c08f7b80889b531865e74bc5f573df8fa27f2088 · ruby/ruby · GitHub Or you can use any of the Minitest performance benchmarks which will include GC calls properly minitest/benchmark.rb at f99ff61540069c9efbb8f5dbc09b3384bd14d53e · seattlerb/minitest · GitHub

If you’d like to go crazy with it try disabling the GC :wink:

Running GC.start in between every test doesn’t yield significantly different results, but thanks for your insight.

1 Like

I’ve changed the benchmark to have all the array initialization done before benchmarks and I dropped the length of the arrays down from 100 million to 10 million since my computer can’t handle it (I run out of RAM).

require "benchmark"

n = 10_000_000
puts "ADD"
puts "n = #{n}"

ARY1 = [2] * n
ARY2 = [3] * n

ARY1B = [2] * n

ARY3 = Array.new(n)
ARY3B = []

require "matrix"
M1 = Matrix[[2] * n]
M2 = Matrix[[3] * n]
M12D = Matrix[[[2] * n]]
M22D = Matrix[[[3] * n]]

require "numo/narray"
M1N1 = Numo::Int64.new(n).fill(2)
M2N1 = Numo::Int64.new(n).fill(3)
M1N2 = Numo::Int64.new(n, 1).fill(2)
M2N2 = Numo::Int64.new(n, 1).fill(3)


Benchmark.bmbm(20) do |measure|
  measure.report("zip") do
    ARY1.zip(ARY2).map { |i| i.reduce(:+) }[0..10].to_s
  end

  measure.report("zip_lazy") do
    ARY1.lazy.zip(ARY2).map { |i| i.reduce(:+) }.to_a[0..10].to_s
  end

  measure.report("zip_sum") do
    ARY1.zip(ARY2).map(&:sum)[0..10].to_s
  end

  measure.report("zip2") do
    ARY1.zip(ARY2).map { |a, b| a + b }[0..10].to_s
  end

  measure.report("idx") do
    ARY1.each_with_index do |elem, idx|
      ARY3[idx] = elem + ARY2[idx]
    end

    ARY3[0..10].to_s
  end

  measure.report("idx<<") do
    ARY1.each_with_index do |elem, idx|
      ARY3B << elem + ARY2[idx]
    end

    ARY3B[0..10].to_s
  end

  measure.report("map!") do
    ARY1B.map!.with_index do |elem, idx|
      elem + ARY2[idx]
    end

    ARY1B[0..10].to_s
  end

  measure.report("matrix") do
    (M1 + M2).to_a[0..10].to_s
  end

  measure.report("matrix 2d") do
    (M12D + M22D).to_a[0][0..10].to_s
  end

  measure.report("numo/narray") do
    (M1N1 + M2N1).to_a[0..10].to_s
  end

  measure.report("numo/narray 2d") do
    (M1N2 + M2N2).to_a[0..10].to_s
  end
end

And here are the benchmark results I got.

ADD
n = 10000000
Rehearsal --------------------------------------------------------
zip                    4.000000   0.220000   4.220000 (  4.220262)
zip_lazy              10.770000   0.040000  10.810000 ( 10.826313)
zip_sum                3.790000   0.090000   3.880000 (  3.878067)
zip2                   2.890000   0.050000   2.940000 (  2.944797)
idx                    1.230000   0.000000   1.230000 (  1.234467)
idx<<                  1.320000   0.050000   1.370000 (  1.360323)
map!                   1.020000   0.000000   1.020000 (  1.024812)
matrix                 4.550000   0.110000   4.660000 (  4.704962)
matrix 2d              4.780000   0.150000   4.930000 (  4.992296)
numo/narray            0.240000   0.060000   0.300000 (  0.311132)
numo/narray 2d         2.780000   0.100000   2.880000 (  2.878992)
---------------------------------------------- total: 38.240000sec

                           user     system      total        real
zip                    3.540000   0.030000   3.570000 (  3.566836)
zip_lazy               9.950000   0.000000   9.950000 (  9.953135)
zip_sum                3.900000   0.080000   3.980000 (  3.983307)
zip2                   3.210000   0.030000   3.240000 (  3.246011)
idx                    1.280000   0.000000   1.280000 (  1.288621)
idx<<                  1.420000   0.040000   1.460000 (  1.462082)
map!                   1.110000   0.040000   1.150000 (  1.149247)
matrix                 4.460000   0.040000   4.500000 (  4.525342)
matrix 2d              4.080000   0.100000   4.180000 (  4.186811)
numo/narray            0.520000   0.020000   0.540000 (  0.550162)
numo/narray 2d         2.960000   0.040000   3.000000 (  2.999584)

Now I’ve found that when doing the same work in Rust I don’t have any issues with memory at 100 million integers per array.

In Rust doing 3 different kinds of tests 10 million times I get the following times.

running 3 tests
test add_push    ... bench:  (0.072995736) sec/iter (+/- 0.004712853)
test add_replace ... bench:  (0.036814299) sec/iter (+/- 0.000794285)
test add_zip     ... bench:  (0.063002991) sec/iter (+/- 0.002482924)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

I’ve translated the time from nanoseconds to seconds. I’ve also pre-initialized the arrays of data outside the benchmark just as I did in the Ruby code above. I had to cheat on replace since Rust’s current benchmark system has a hard 300 times of repetition and no begin/setup feature like Minitest… I wanted to test the time cost of only the action so the resulting data is off by about that much but that is irrelevant.

1 Like

I was testing including the array initialization because I am interested in the overall user performance experience, but I will make benchmarks with only the inner operations as well. Thanks for your input @danielpclark.

Bohrium isn’t really meant for speed, but rather ease of use, as it is (hopefully) faster or just as fast as an array programming application from Fortran, but easier to use (because of Ruby). It also includes a Python front-end that is aimed towards scientific personnel.

With your benchmarks I get the following (adding in Bohrium):

ADD
n = 100000000
Rehearsal --------------------------------------------------------
zip                   41.230000   5.190000  46.420000 ( 47.307321)
zip_lazy              83.250000  35.390000 118.640000 (124.717859)
zip_sum               23.970000   3.830000  27.800000 ( 28.612711)
zip2                  27.010000  14.840000  41.850000 ( 44.468215)
idx                   10.800000   1.180000  11.980000 ( 12.399966)
idx<<                  9.900000   0.730000  10.630000 ( 10.906907)
map!                   9.150000   0.770000   9.920000 ( 10.070394)
matrix                56.550000  18.450000  75.000000 ( 88.927717)
matrix 2d             44.230000  10.820000  55.050000 ( 59.399583)
numo/narray            4.070000   7.420000  11.490000 ( 14.199084)
numo/narray 2d        24.340000  12.780000  37.120000 ( 41.537901)
bohrium ones add       8.220000  12.230000  20.450000 ( 23.152769)
bohrium ones add!      5.850000   9.020000  14.870000 ( 13.227341)
bohrium ones +         6.510000  12.340000  18.850000 ( 19.728395)
--------------------------------------------- total: 500.070000sec

                           user     system      total        real
zip                   43.600000  41.950000  85.550000 ( 89.065446)
zip_lazy              82.780000  49.240000 132.020000 (141.281292)
zip_sum               26.720000   7.280000  34.000000 ( 36.268452)
zip2                  31.520000  20.230000  51.750000 ( 55.223682)
idx                    9.470000   0.530000  10.000000 ( 10.098236)
idx<<                 10.200000   1.190000  11.390000 ( 11.776370)
map!                   9.310000   0.310000   9.620000 (  9.801279)
matrix                43.200000  11.730000  54.930000 ( 57.946711)
matrix 2d             42.400000   3.400000  45.800000 ( 47.867578)
numo/narray            3.760000   3.990000   7.750000 ( 10.218080)
numo/narray 2d        25.670000  12.800000  38.470000 ( 42.124430)
bohrium ones add       3.210000   4.500000   7.710000 (  5.254156)
bohrium ones add!      3.170000   4.270000   7.440000 (  4.834174)
bohrium ones +         2.770000   2.730000   5.500000 (  4.066540)

Showing that bohrium is fast, and I would argue easy to use :slight_smile:

1 Like