Reduce · AcceleratedKernels.jl

Reductions

Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.

Other names: Kokkos:parallel_reduce, fold, aggregate.

AcceleratedKernels.reduce — Function

reduce(
    op, src::AbstractArray, backend::Backend=get_backend(src);
    init,
    neutral=neutral_element(op, eltype(src)),
    dims::Union{Nothing, Int}=nothing,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=1,

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    switch_below::Int=0,
)

Reduce src along dimensions dims using the binary operator op. If dims is nothing, reduce src to a scalar. If dims is an integer, reduce src along that dimension. The init value is used as the initial value for the reduction; neutral is the neutral element for the operator op.

The returned type is the same as init - to control output precision, specify init explicitly.

CPU settings

Use at most max_tasks threads with at least min_elems elements per task. For N-dimensional arrays (dims::Int) multithreading currently only becomes faster for max_tasks >= 4; all other cases are scaling linearly with the number of threads.

Note that multithreading reductions only improves performance for cases with more compute-heavy operations, which hide the memory latency and thread launch overhead - that includes:

Reducing more complex types, e.g. reduction of tuples / structs / strings.
More complex operators, e.g. op=custom_complex_op_function.

For non-memory-bound operations, reductions scale almost linearly with the number of threads.

GPU settings

The block_size parameter controls the number of threads per block.

The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along a dimension (dims is an integer), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.

The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).

Examples

Computing a sum, reducing down to a scalar that is copied to host:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
vsum = AK.reduce((x, y) -> x + y, v; init=zero(eltype(v)))

Computing dimensionwise sums in a 2D matrix:

import AcceleratedKernels as AK
using Metal

m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsum = AK.reduce(+, m; init=zero(eltype(m)), dims=1)
mcolsum = AK.reduce(+, m; init=zero(eltype(m)), dims=2)

source