Accumulate · AcceleratedKernels.jl

Accumulate / Prefix Sum / Scan

AcceleratedKernels.accumulate! — Function

accumulate!(
    op, v::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=neutral_element(op, eltype(v)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=2,

    # Algorithm choice
    alg::AccumulateAlgorithm=ScanPrefixes(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

accumulate!(
    op, dst::AbstractArray, src::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=neutral_element(op, eltype(dst)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=2,

    # Algorithm choice
    alg::AccumulateAlgorithm=ScanPrefixes(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

Compute accumulated running totals along a sequence by applying a binary operator to all elements up to the current one; often used in GPU programming as a first step in finding / extracting subsets of data.

Other names: prefix sum, thrust::scan, cumulative sum; inclusive (or exclusive) if the first element is included in the accumulation (or not).

For compatibility with the Base.accumulate! function, we provide the two-array interface too, but we do not need the constraint of dst and src being different; to minimise memory use, we recommend using the single-array interface (the first one above).

CPU

Use at most max_tasks threads with at least min_elems elements per task.

Note that accumulation is typically a memory-bound operation, so multithreaded accumulation only becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:

Accumulating more complex types, e.g. accumulation of tuples / structs / strings.
More complex operators, e.g. op=custom_complex_function.

GPU

For the 1D case (dims=nothing), the alg can be one of the following:

ScanPrefixes(): the default algorithm that scans the prefixes of each block, with no lookback; it has better performance than DecoupledLookback() for large block sizes, and small to medium arrays, but poorer scaling for many blocks; there is no performance degradation below block_size^2 elements, but it remains fast well into millions of elements.
DecoupledLookback(): a more complex algorithm using opportunistic lookback to reuse earlier blocks' results; requires device-level memory consistency guarantees (which Apple Metal does not provide) and atomic orderings; theoretically more scalable for many blocks.

A different, unique algorithm is used for the multi-dimensional case (dims is an integer).

The block_size should be a power of 2 and greater than 0.

The temporaries are only used for the 1D case (dims=nothing): temp stores per-block aggregates; temp_flags is only used for the DecoupledLookback() algorithm for flagging if blocks are ready; they should both have at least (length(v) + 2 * block_size - 1) ÷ (2 * block_size) elements; also, eltype(v) === eltype(temp) is required; the elements in temp_flags can be any integers, but UInt8 is used by default to reduce memory usage.

Examples

Example computing an inclusive prefix sum (the typical GPU "scan"):

import AcceleratedKernels as AK
using oneAPI

v = oneAPI.ones(Int32, 100_000)
AK.accumulate!(+, v, init=0)

# Use a different algorithm
AK.accumulate!(+, v, alg=AK.DecoupledLookback())

source

AcceleratedKernels.accumulate — Function

accumulate(
    op, v::AbstractArray, backend::Backend=get_backend(v);
    init,
    neutral=neutral_element(op, eltype(v)),
    dims::Union{Nothing, Int}=nothing,
    inclusive::Bool=true,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=2,

    # Algorithm choice
    alg::AccumulateAlgorithm=ScanPrefixes(),

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    temp_flags::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place version of accumulate!.

source