Last week I read this SO post detailing a very coincidental situation where clashing cache-line from multiple threads did NOT cause serious slowdown. It all boils down to a very specific "hack" CPU performs when waiting to write to register that allows the thread to still operating on that to-be-write value.

To demonstrate the point, the answer used Atomic lock to show that if you force threads to wait for lock between each "write", you will get what you expect – significant slowdown.

Today, Oscar Smith posted another SO post about Julia multi-threading not scaling linearly for a simple task. We will see how the previous SO post relate to it.

The problem statement

The sequential version:

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    "total = $total"
end

@benchmark slow(10^3, "9")
total = 6.590720190283038
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.637 μs …  2.393 ms  ┊ GC (min … max): 0.00% … 95.40%
 Time  (median):     48.312 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   53.324 μs ± 92.308 μs  ┊ GC (mean ± σ):  7.83% ±  4.41%

   ▂▃▄▆██▇▅▄▃▂▂▂▂▁▁                                           ▂
  ▇████████████████████▇▅▇▇▆▆▄▄▅▃▅▃▃▁▄▃▄▄▄▃▅▆▅█████▆▆▆▄▄▅▅▆▇▆ █
  45.6 μs      Histogram: log(frequency) by time      70.1 μs <

 Memory estimate: 94.36 KiB, allocs estimate: 2006.

The author tried a naive multi-threading approach using Atomic locks:

function slow_thread(n::Int, digits::String)
   total = Threads.Atomic{Float64}(0)
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           Threads.atomic_add!(total, 1.0 / i)
       end
   end
   "total = $total"
end
@benchmark slow_thread(10^3, "9")
total = Base.Threads.Atomic{Float64}(6.590720190283038)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  51.518 μs …  19.429 ms  ┊ GC (min … max):  0.00% … 99.11%
 Time  (median):     71.666 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   82.123 μs ± 401.979 μs  ┊ GC (mean ± σ):  11.45% ±  2.42%

                           ▁▅█▇▃▁                               
  ▂▁▂▂▁▁▁▁▁▁▂▂▂▂▃▃▄▅▅▇▇▇█▇████████▆▆▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  51.5 μs         Histogram: frequency by time           95 μs <

 Memory estimate: 98.33 KiB, allocs estimate: 2064.

Which has almost zero speed-up (with 8 threads, in this case).

Pre-allocation and reduce?

The next step is to break up the accumulation variable of each thread. Concretely, making a Vector with length equals to 8 and let each thread only adding to their own entry in the Vector:

function slow_thread_pre(n::Int, digits::String)
   totals = zeros(Threads.nthreads())
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           totals[Threads.threadid()] += 1.0 / i
       end
   end
   "total = $(sum(totals))"
end
@benchmark slow_thread_pre(10^3, "9")
total = 6.5907201902830375
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.930 μs …  21.480 ms  ┊ GC (min … max):  0.00% … 99.50%
 Time  (median):     42.320 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   56.431 μs ± 475.956 μs  ┊ GC (mean ± σ):  18.81% ±  2.23%

   ▅█▅▂▃▅▅▄▂▁                                                   
  ▄██████████▇▆▅▄▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁ ▃
  36.9 μs         Histogram: frequency by time         78.2 μs <

 Memory estimate: 96.42 KiB, allocs estimate: 2028.

Nice! Finally some visible improvments. But this is defenitely not scaling linearly, even though our task is trivially parallel. What's stopping us? False Sharing is, as described in deatil by an Intel article. The gist of it is that cores in your CPU doesn't access cache/memory in unit of bit or even byte, many consecutive bytes in CPU cache share "cache-line" to a core. If multiple cores are accesing neighboring "indeces" of the memory, then the same cache-line is updated by different cores. This forces a memory update to keep coherency, which in turn wastes a lot of time.

The easiest way to verify this hypothesis is to manually increase the separation of varibles in memory:

#stri means strided
function slow_thread_pre_stri(n::Int, digits::String)
   totals = zeros(Threads.nthreads()*10)
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           totals[Threads.threadid()*10] += 1.0 / i
       end
   end
   "total = $(sum(totals))"
end
@benchmark slow_thread_pre_stri(10^3, "9")
total = 6.5907201902830375
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.221 μs …  23.513 ms  ┊ GC (min … max):  0.00% … 99.59%
 Time  (median):     26.991 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   39.570 μs ± 517.661 μs  ┊ GC (mean ± σ):  31.90% ±  2.44%

   ▄█▅▂            ▁▂▂▃▂▃▃▁▁                                    
  ▃█████▆▄▃▂▂▂▂▄▅▆███████████▇▇▆▆▅▄▃▃▃▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▃
  16.2 μs         Histogram: frequency by time         47.7 μs <

 Memory estimate: 96.72 KiB, allocs estimate: 2028.

Which yieds much better result than just pre-allocation, thus confirming the false sharing to be a major cause of poor performance in this case.