Accumulate / Prefix Sum / Scan

AcceleratedKernels.accumulate!Function
accumulate!(
    op, v::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=GPUArrays.neutral_element(op, eltype(v)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # Algorithm choice
    alg::AccumulateAlgorithm=DecoupledLookback(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

accumulate!(
    op, dst::AbstractArray, src::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=GPUArrays.neutral_element(op, eltype(dst)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # Algorithm choice
    alg::AccumulateAlgorithm=DecoupledLookback(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

Compute accumulated running totals along a sequence by applying a binary operator to all elements up to the current one; often used in GPU programming as a first step in finding / extracting subsets of data.

Other names: prefix sum, thrust::scan, cumulative sum; inclusive (or exclusive) if the first element is included in the accumulation (or not).

For compatibility with the Base.accumulate! function, we provide the two-array interface too, but we do not need the constraint of dst and src being different; to minimise memory use, we recommend using the single-array interface (the first one above).

CPU

The CPU implementation is currently single-threaded; we are waiting on a multithreaded implementation in OhMyThreads.jl (issue).

GPU

For the 1D case (dims=nothing), the alg can be one of the following:

  • DecoupledLookback(): the default algorithm, using opportunistic lookback to reuse earlier blocks' results; requires device-level memory consistency guarantees, which Apple Metal does not provide.
  • ScanPrefixes(): a simpler algorithm that scans the prefixes of each block, with no lookback; it has similar performance as DecoupledLookback() for large block sizes, and small to medium arrays, but poorer scaling for many blocks; there is no performance degradation below block_size^2 elements.

A different, unique algorithm is used for the multi-dimensional case (dims is an integer).

The block_size should be a power of 2 and greater than 0.

The temporaries are only used for the 1D case (dims=nothing): temp stores per-block aggregates; temp_flags is only used for the DecoupledLookback() algorithm for flagging if blocks are ready; they should both have at least (length(v) + 2 * block_size - 1) ÷ (2 * block_size) elements; also, eltype(v) === eltype(temp) is required; the elements in temp_flags can be any integers, but Int8 is used by default to reduce memory usage.

Platform-Specific Notes

On Metal, the alg=ScanPrefixes() algorithm is used by default, as Apple Metal GPUs do not have strong enough memory consistency guarantees for the DecoupledLookback() algorithm - which produces incorrect results about 0.38% of the time (the beauty of parallel algorithms, ey). Also, block_size=1024 is used here by default to reduce the number of coupled lookbacks.

Examples

Example computing an inclusive prefix sum (the typical GPU "scan"):

import AcceleratedKernels as AK
using oneAPI

v = oneAPI.ones(Int32, 100_000)
AK.accumulate!(+, v, init=0)

# Use a different algorithm
AK.accumulate!(+, v, alg=AK.ScanPrefixes())
source
AcceleratedKernels.accumulateFunction
accumulate(
    op, v::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=GPUArrays.neutral_element(op, eltype(v)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # Algorithm choice
    alg::AccumulateAlgorithm=DecoupledLookback(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place version of accumulate!.

source