NUMA-aware SAXPY
This example demonstrates how to define and run a SAXPY kernel (single-precision Y[i] = a * X[i] + Y[i]
) such that it runs efficiently on a system with multiple memory domains (NUMA) using multithreading. (You likely will need to fine-tune the value of N
on your system of interest if you care about the particular measurement.)
# EXCLUDE FROM TESTING
using BenchmarkTools
using Statistics
using Random
using ThreadPinning
using KernelAbstractions
ThreadPinning.pinthreads(:numa)
@kernel function saxpy_kernel(a, @Const(X), Y)
I = @index(Global)
@inbounds Y[I] = a * X[I] + Y[I]
end
"""
measure_membw(; kwargs...) -> membw, flops
Estimate the memory bandwidth (GB/s) by performing a time measurement of a
SAXPY kernel. Returns the memory bandwidth (GB/s) and the compute (GFLOP/s).
"""
function measure_membw(
backend = CPU(); verbose = true, N = 1024 * 500_000, dtype = Float32,
init = :parallel,
)
bytes = 3 * sizeof(dtype) * N # num bytes transferred in SAXPY
flops = 2 * N # num flops in SAXY
a = dtype(3.1415)
if init == :serial
X = rand!(zeros(dtype, N))
Y = rand!(zeros(dtype, N))
else
X = rand!(KernelAbstractions.zeros(backend, dtype, N))
Y = rand!(KernelAbstractions.zeros(backend, dtype, N))
end
workgroup_size = 1024
t = @belapsed begin
kernel = saxpy_kernel($backend, $workgroup_size, $(size(Y)))
kernel($a, $X, $Y, ndrange = $(size(Y)))
KernelAbstractions.synchronize($backend)
end evals = 2 samples = 10
mem_rate = bytes * 1.0e-9 / t # GB/s
flop_rate = flops * 1.0e-9 / t # GFLOP/s
if verbose
println("\tMemory Bandwidth (GB/s): ", round(mem_rate; digits = 2))
println("\tCompute (GFLOP/s): ", round(flop_rate; digits = 2))
end
return mem_rate, flop_rate
end
# Static should be much better (on a system with multiple NUMA domains)
measure_membw(CPU());
measure_membw(CPU(; static = true));
# The following has significantly worse performance (even on systems with a single memory domain)!
# measure_membw(CPU(); init=:serial);
# measure_membw(CPU(; static=true); init=:serial);
Important remarks:
- Pin your threads systematically to the available physical (or virtual) CPU-cores. ThreadPinning.jl is your friend.
- Opt-out of Julia's dynamic task scheduling (especially task migration) by using
CPU(; static=true)
instead ofCPU()
. - Initialize your data in parallel(!). It is of utmost importance to use a parallel access pattern for initialization that is as similar as possible as the access pattern of your computational kernel. The reason for this is "NUMA first-touch policy".
KernelAbstractions.zeros(backend, dtype, N)
is your friend.
Demonstration:
If above example is run with 128 Julia threads on a Noctua 2 compute node (128 physical cores distributed over two AMD Milan 7763 CPUs with 4 NUMA domains each), one may get the following numbers (comments for demonstration purposes):
Memory Bandwidth (GB/s): 145.64 # backend = CPU(), init = :parallel
Compute (GFLOP/s): 24.27
Memory Bandwidth (GB/s): 333.83 # backend = CPU(; static=true), init = :parallel
Compute (GFLOP/s): 55.64
Memory Bandwidth (GB/s): 32.74 # backend = CPU(), init = :serial
Compute (GFLOP/s): 5.46
Memory Bandwidth (GB/s): 32.46 # backend = CPU(; static=true), init = :serial
Compute (GFLOP/s): 5.41
The key observations are the following:
- Serial initialization leads to subpar performance (at least a factor of 4.5) independent of the chosen CPU backend. This is a manifestation of remark 3 above.
- The static CPU backend gives >2x better performance than the one based on the dynamic
Threads.@spawn
. This is a manifestation of remark 2 (and, in some sense, also 1) above.