Reductions

Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.

  • Other names: Kokkos:parallel_reduce, fold, aggregate.

AcceleratedKernels.reduceFunction
reduce(
    op, src::AbstractArray, backend::Backend=get_backend(src);
    init,
    neutral=GPUArrays.neutral_element(op, eltype(src)),
    dims::Union{Nothing, Int}=nothing,

    # CPU settings
    scheduler=:static,
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    switch_below::Int=0,
)

Reduce src along dimensions dims using the binary operator op. If dims is nothing, reduce src to a scalar. If dims is an integer, reduce src along that dimension. The init value is used as the initial value for the reduction; neutral is the neutral element for the operator op.

CPU settings

The scheduler can be one of the OhMyThreads.jl schedulers, i.e. :static, :dynamic, :greedy or :serial. Assuming the workload is uniform (as the GPU algorithm prefers), :static is used by default; if you need fine-grained control over your threads, consider using OhMyThreads.jl directly.

Use at most max_tasks threads with at least min_elems elements per task.

GPU settings

The block_size parameter controls the number of threads per block.

The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along a dimension (dims is an integer), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.

The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).

Platform-Specific Notes

N-dimensional reductions on the CPU are not parallel yet (issue), and defer to Base.reduce.

Examples

Computing a sum, reducing down to a scalar that is copied to host:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
vsum = AK.reduce((x, y) -> x + y, v; init=zero(eltype(v)))

Computing dimensionwise sums in a 2D matrix:

import AcceleratedKernels as AK
using Metal

m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsum = AK.reduce(+, m; init=zero(eltype(m)), dims=1)
mcolsum = AK.reduce(+, m; init=zero(eltype(m)), dims=2)
source