Reductions
Apply a custom binary operator reduction on all elements in an iterable; can be used to compute minima, sums, counts, etc.
- Other names:
Kokkos:parallel_reduce
,fold
,aggregate
.
AcceleratedKernels.reduce
— Functionreduce(
op, src::AbstractArray, backend::Backend=get_backend(src);
init,
neutral=GPUArrays.neutral_element(op, eltype(src)),
dims::Union{Nothing, Int}=nothing,
# CPU settings
scheduler=:static,
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
switch_below::Int=0,
)
Reduce src
along dimensions dims
using the binary operator op
. If dims
is nothing
, reduce src
to a scalar. If dims
is an integer, reduce src
along that dimension. The init
value is used as the initial value for the reduction; neutral
is the neutral element for the operator op
.
CPU settings
The scheduler
can be one of the OhMyThreads.jl schedulers, i.e. :static
, :dynamic
, :greedy
or :serial
. Assuming the workload is uniform (as the GPU algorithm prefers), :static
is used by default; if you need fine-grained control over your threads, consider using OhMyThreads.jl
directly.
Use at most max_tasks
threads with at least min_elems
elements per task.
GPU settings
The block_size
parameter controls the number of threads per block.
The temp
parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing
), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size)
is required. For reduction along a dimension (dims
is an integer), temp
is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src
, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce
for CPU arrays for exact behavior.
The switch_below
parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing
).
Platform-Specific Notes
N-dimensional reductions on the CPU are not parallel yet (issue), and defer to Base.reduce
.
Examples
Computing a sum, reducing down to a scalar that is copied to host:
import AcceleratedKernels as AK
using CUDA
v = CuArray{Int16}(rand(1:1000, 100_000))
vsum = AK.reduce((x, y) -> x + y, v; init=zero(eltype(v)))
Computing dimensionwise sums in a 2D matrix:
import AcceleratedKernels as AK
using Metal
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsum = AK.reduce(+, m; init=zero(eltype(m)), dims=1)
mcolsum = AK.reduce(+, m; init=zero(eltype(m)), dims=2)