Reductions
Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
- Other names:
Kokkos:parallel_reduce,fold,aggregate.
AcceleratedKernels.reduce — Functionreduce(
op, src::AbstractArray, backend::Backend=get_backend(src);
init,
neutral=neutral_element(op, eltype(src)),
dims::Union{Nothing, Int}=nothing,
# CPU settings
max_tasks::Int=Threads.nthreads(),
min_elems::Int=1,
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
switch_below::Int=0,
)Reduce src along dimensions dims using the binary operator op. If dims is nothing, reduce src to a scalar. If dims is an integer, reduce src along that dimension. The init value is used as the initial value for the reduction; neutral is the neutral element for the operator op.
The returned type is the same as init - to control output precision, specify init explicitly.
CPU settings
Use at most max_tasks threads with at least min_elems elements per task. For N-dimensional arrays (dims::Int) multithreading currently only becomes faster for max_tasks >= 4; all other cases are scaling linearly with the number of threads.
Note that multithreading reductions only improves performance for cases with more compute-heavy operations, which hide the memory latency and thread launch overhead - that includes:
- Reducing more complex types, e.g. reduction of tuples / structs / strings.
- More complex operators, e.g.
op=custom_complex_op_function.
For non-memory-bound operations, reductions scale almost linearly with the number of threads.
GPU settings
The block_size parameter controls the number of threads per block.
The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along a dimension (dims is an integer), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.
The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).
Examples
Computing a sum, reducing down to a scalar that is copied to host:
import AcceleratedKernels as AK
using CUDA
v = CuArray{Int16}(rand(1:1000, 100_000))
vsum = AK.reduce((x, y) -> x + y, v; init=zero(eltype(v)))Computing dimensionwise sums in a 2D matrix:
import AcceleratedKernels as AK
using Metal
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsum = AK.reduce(+, m; init=zero(eltype(m)), dims=1)
mcolsum = AK.reduce(+, m; init=zero(eltype(m)), dims=2)