False sharing in multi-threading


Last week I read this SO post detailing a very coincidental situation where clashing cache-line from multiple threads did NOT cause serious slowdown. It all boils down to a very specific "hack" CPU performs when waiting to write to register that allows the thread to still operating on that to-be-write value.

To demonstrate the point, the answer used Atomic lock to show that if you force threads to wait for lock between each "write", you will get what you expect – significant slowdown.

Today, Oscar Smith posted another SO post about Julia multi-threading not scaling linearly for a simple task. We will see how the previous SO post relate to it.

The problem statement

The sequential version:

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    "total = $total"
end

@benchmark slow(10^4, "9")
total = 8.223184402866208
BenchmarkTools.Trial: 
  memory estimate:  938.11 KiB
  allocs estimate:  20006
  --------------
  minimum time:     475.783 μs (0.00% GC)
  median time:      508.135 μs (0.00% GC)
  mean time:        559.904 μs (8.69% GC)
  maximum time:     2.980 ms (77.73% GC)
  --------------
  samples:          1785
  evals/sample:     1

The author tried a naive multi-threading approach using Atomic locks:

function slow_thread(n::Int, digits::String)
   total = Threads.Atomic{Float64}(0)
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           Threads.atomic_add!(total, 1.0 / i)
       end
   end
   "total = $total"
end
@benchmark slow_thread(10^4, "9")
total = Base.Threads.Atomic{Float64}(8.223184402866172)
BenchmarkTools.Trial: 
  memory estimate:  944.05 KiB
  allocs estimate:  20085
  --------------
  minimum time:     372.056 μs (0.00% GC)
  median time:      397.800 μs (0.00% GC)
  mean time:        568.460 μs (27.83% GC)
  maximum time:     72.749 ms (99.27% GC)
  --------------
  samples:          1758
  evals/sample:     1

Which has almost zero speed-up (with 8 threads, in this case).

Pre-allocation and reduce?

The next step is to break up the accumulation variable of each thread. Concretely, making a Vector with length equals to nthreads() and let each thread only adding to their own entry in the Vector:

function slow_thread_pre(n::Int, digits::String)
   totals = zeros(Threads.nthreads())
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           totals[Threads.threadid()] += 1.0 / i
       end
   end
   "total = $(sum(totals))"
end
@benchmark slow_thread_pre(10^4, "9")
total = 8.223184402866204
BenchmarkTools.Trial: 
  memory estimate:  941.91 KiB
  allocs estimate:  20048
  --------------
  minimum time:     180.683 μs (0.00% GC)
  median time:      280.392 μs (0.00% GC)
  mean time:        472.197 μs (39.78% GC)
  maximum time:     71.602 ms (99.39% GC)
  --------------
  samples:          2213
  evals/sample:     1

Nice! Finally some visible improvments. But this is defenitely not scaling linearly, even though our task is trivially parallel. What's stopping us? False Sharing is, as described in deatil by an Intel article. The gist of it is that cores in your CPU doesn't access cache/memory in unit of bit or even byte, many consecutive bytes in CPU cache share "cache-line" to a core. If multiple cores are accesing neighboring "indeces" of the memory, then the same cache-line is updated by different cores. This forces a memory update to keep coherency, which in turn wastes a lot of time.

The easiest way to verify this hypothesis is to manually increase the separation of varibles in memory:

#stri means strided
function slow_thread_pre_stri(n::Int, digits::String)
   totals = zeros(Threads.nthreads()*10)
   Threads.@threads for i in 1:n
       if !occursin(digits, string(i))
           totals[Threads.threadid()*10] += 1.0 / i
       end
   end
   "total = $(sum(totals))"
end
@benchmark slow_thread_pre_stri(10^4, "9")
total = 8.223184402866204
BenchmarkTools.Trial: 
  memory estimate:  942.48 KiB
  allocs estimate:  20048
  --------------
  minimum time:     70.103 μs (0.00% GC)
  median time:      122.823 μs (0.00% GC)
  mean time:        306.932 μs (57.85% GC)
  maximum time:     73.803 ms (99.77% GC)
  --------------
  samples:          3254
  evals/sample:     1

Which yieds much better result than just pre-allocation, thus confirming the false sharing to be a major cause of poor performance in this case.