NUMA-aware SAXPY

This example demonstrates how to define and run a SAXPY kernel (single-precision Y[i] = a * X[i] + Y[i]) such that it runs efficiently on a system with multiple memory domains (NUMA) using multithreading. (You likely will need to fine-tune the value of N on your system of interest if you care about the particular measurement.)

# EXCLUDE FROM TESTING
using BenchmarkTools
using Statistics
using Random
using ThreadPinning
using KernelAbstractions

ThreadPinning.pinthreads(:numa)

@kernel function saxpy_kernel(a, @Const(X), Y)
    I = @index(Global)
    @inbounds Y[I] = a * X[I] + Y[I]
end

"""
  measure_membw(; kwargs...) -> membw, flops

Estimate the memory bandwidth (GB/s) by performing a time measurement of a
SAXPY kernel. Returns the memory bandwidth (GB/s) and the compute (GFLOP/s).
"""
function measure_membw(
        backend = CPU(); verbose = true, N = 1024 * 500_000, dtype = Float32,
        init = :parallel,
    )
    bytes = 3 * sizeof(dtype) * N # num bytes transferred in SAXPY
    flops = 2 * N # num flops in SAXY

    a = dtype(3.1415)
    if init == :serial
        X = rand!(zeros(dtype, N))
        Y = rand!(zeros(dtype, N))
    else
        X = rand!(KernelAbstractions.zeros(backend, dtype, N))
        Y = rand!(KernelAbstractions.zeros(backend, dtype, N))
    end
    workgroup_size = 1024

    t = @belapsed begin
        kernel = saxpy_kernel($backend, $workgroup_size, $(size(Y)))
        kernel($a, $X, $Y, ndrange = $(size(Y)))
        KernelAbstractions.synchronize($backend)
    end evals = 2 samples = 10

    mem_rate = bytes * 1.0e-9 / t # GB/s
    flop_rate = flops * 1.0e-9 / t # GFLOP/s

    if verbose
        println("\tMemory Bandwidth (GB/s): ", round(mem_rate; digits = 2))
        println("\tCompute (GFLOP/s): ", round(flop_rate; digits = 2))
    end
    return mem_rate, flop_rate
end

# Static should be much better (on a system with multiple NUMA domains)
measure_membw(CPU());
measure_membw(CPU(; static = true));

# The following has significantly worse performance (even on systems with a single memory domain)!
# measure_membw(CPU(); init=:serial);
# measure_membw(CPU(; static=true); init=:serial);

Important remarks:

Pin your threads systematically to the available physical (or virtual) CPU-cores. ThreadPinning.jl is your friend.
Opt-out of Julia's dynamic task scheduling (especially task migration) by using CPU(; static=true) instead of CPU().
Initialize your data in parallel(!). It is of utmost importance to use a parallel access pattern for initialization that is as similar as possible as the access pattern of your computational kernel. The reason for this is "NUMA first-touch policy". KernelAbstractions.zeros(backend, dtype, N) is your friend.

Demonstration:

If above example is run with 128 Julia threads on a Noctua 2 compute node (128 physical cores distributed over two AMD Milan 7763 CPUs with 4 NUMA domains each), one may get the following numbers (comments for demonstration purposes):

Memory Bandwidth (GB/s): 145.64 # backend = CPU(), init = :parallel
Compute (GFLOP/s): 24.27

Memory Bandwidth (GB/s): 333.83 # backend = CPU(; static=true), init = :parallel
Compute (GFLOP/s): 55.64

Memory Bandwidth (GB/s): 32.74 # backend = CPU(), init = :serial
Compute (GFLOP/s): 5.46

Memory Bandwidth (GB/s): 32.46 # backend = CPU(; static=true), init = :serial
Compute (GFLOP/s): 5.41

The key observations are the following:

Serial initialization leads to subpar performance (at least a factor of 4.5) independent of the chosen CPU backend. This is a manifestation of remark 3 above.
The static CPU backend gives >2x better performance than the one based on the dynamic Threads.@spawn. This is a manifestation of remark 2 (and, in some sense, also 1) above.