MapReduce
Equivalent to reduce(op, map(f, iterable))
, without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).
- Other names:
transform_reduce
, somefold
implementations include the mapping function too.
AcceleratedKernels.mapreduce
— Functionmapreduce(
f, op, src::AbstractArray, backend::Backend=get_backend(src);
init,
neutral=GPUArrays.neutral_element(op, eltype(src)),
dims::Union{Nothing, Int}=nothing,
# CPU settings
scheduler=:static,
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
switch_below::Int=0,
)
Reduce src
along dimensions dims
using the binary operator op
after applying f
elementwise. If dims
is nothing
, reduce src
to a scalar. If dims
is an integer, reduce src
along that dimension. The init
value is used as the initial value for the reduction (i.e. after mapping).
The neutral
value is the neutral element (zero) for the operator op
, which is needed for an efficient GPU implementation that also allows a nonzero init
.
CPU settings
The scheduler
can be one of the OhMyThreads.jl schedulers, i.e. :static
, :dynamic
, :greedy
or :serial
. Assuming the workload is uniform (as the GPU algorithm prefers), :static
is used by default; if you need fine-grained control over your threads, consider using OhMyThreads.jl
directly.
Use at most max_tasks
threads with at least min_elems
elements per task.
GPU settings
The block_size
parameter controls the number of threads per block.
The temp
parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing
), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size)
is required. For reduction along a dimension (dims
is an integer), temp
is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src
, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce
for CPU arrays for exact behavior.
The switch_below
parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing
).
Example
Computing a sum of squares, reducing down to a scalar that is copied to host:
import AcceleratedKernels as AK
using CUDA
v = CuArray{Int16}(rand(1:1000, 100_000))
vsumsq = AK.mapreduce(x -> x * x, (x, y) -> x + y, v; init=zero(eltype(v)))
Computing dimensionwise sums of squares in a 2D matrix:
import AcceleratedKernels as AK
using Metal
f(x) = x * x
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=1)
mcolsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=2)