Accumulate / Prefix Sum / Scan
AcceleratedKernels.accumulate!
— Functionaccumulate!(
op, v::AbstractArray, backend::Backend=get_backend(v);
init,
neutral=GPUArrays.neutral_element(op, eltype(v)),
dims::Union{Nothing, Int}=nothing,
inclusive::Bool=true,
# Algorithm choice
alg::AccumulateAlgorithm=DecoupledLookback(),
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
temp_flags::Union{Nothing, AbstractArray}=nothing,
)
accumulate!(
op, dst::AbstractArray, src::AbstractArray, backend::Backend=get_backend(v);
init,
neutral=GPUArrays.neutral_element(op, eltype(dst)),
dims::Union{Nothing, Int}=nothing,
inclusive::Bool=true,
# Algorithm choice
alg::AccumulateAlgorithm=DecoupledLookback(),
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
temp_flags::Union{Nothing, AbstractArray}=nothing,
)
Compute accumulated running totals along a sequence by applying a binary operator to all elements up to the current one; often used in GPU programming as a first step in finding / extracting subsets of data.
Other names: prefix sum, thrust::scan
, cumulative sum; inclusive (or exclusive) if the first element is included in the accumulation (or not).
For compatibility with the Base.accumulate!
function, we provide the two-array interface too, but we do not need the constraint of dst
and src
being different; to minimise memory use, we recommend using the single-array interface (the first one above).
CPU
The CPU implementation is currently single-threaded; we are waiting on a multithreaded implementation in OhMyThreads.jl (issue).
GPU
For the 1D case (dims=nothing
), the alg
can be one of the following:
DecoupledLookback()
: the default algorithm, using opportunistic lookback to reuse earlier blocks' results; requires device-level memory consistency guarantees, which Apple Metal does not provide.ScanPrefixes()
: a simpler algorithm that scans the prefixes of each block, with no lookback; it has similar performance asDecoupledLookback()
for large block sizes, and small to medium arrays, but poorer scaling for many blocks; there is no performance degradation belowblock_size^2
elements.
A different, unique algorithm is used for the multi-dimensional case (dims
is an integer).
The block_size
should be a power of 2 and greater than 0.
The temporaries are only used for the 1D case (dims=nothing
): temp
stores per-block aggregates; temp_flags
is only used for the DecoupledLookback()
algorithm for flagging if blocks are ready; they should both have at least (length(v) + 2 * block_size - 1) ÷ (2 * block_size)
elements; also, eltype(v) === eltype(temp)
is required; the elements in temp_flags
can be any integers, but Int8
is used by default to reduce memory usage.
Platform-Specific Notes
On Metal, the alg=ScanPrefixes()
algorithm is used by default, as Apple Metal GPUs do not have strong enough memory consistency guarantees for the DecoupledLookback()
algorithm - which produces incorrect results about 0.38% of the time (the beauty of parallel algorithms, ey). Also, block_size=1024
is used here by default to reduce the number of coupled lookbacks.
Examples
Example computing an inclusive prefix sum (the typical GPU "scan"):
import AcceleratedKernels as AK
using oneAPI
v = oneAPI.ones(Int32, 100_000)
AK.accumulate!(+, v, init=0)
# Use a different algorithm
AK.accumulate!(+, v, alg=AK.ScanPrefixes())
AcceleratedKernels.accumulate
— Functionaccumulate(
op, v::AbstractArray, backend::Backend=get_backend(v);
init,
neutral=GPUArrays.neutral_element(op, eltype(v)),
dims::Union{Nothing, Int}=nothing,
inclusive::Bool=true,
# Algorithm choice
alg::AccumulateAlgorithm=DecoupledLookback(),
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
temp_flags::Union{Nothing, AbstractArray}=nothing,
)
Out-of-place version of accumulate!
.