MapReduce

Equivalent to reduce(op, map(f, iterable)), without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).

Other names: transform_reduce, some fold implementations include the mapping function too.

AcceleratedKernels.mapreduce — Function

mapreduce(
    f, op, src::AbstractArray, backend::Backend=get_backend(src);
    init,
    neutral=neutral_element(op, eltype(src)),
    dims::Union{Nothing, Int}=nothing,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=1,

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    switch_below::Int=0,
)

Reduce src along dimensions dims using the binary operator op after applying f elementwise. If dims is nothing, reduce src to a scalar. If dims is an integer, reduce src along that dimension. The init value is used as the initial value for the reduction (i.e. after mapping).

The neutral value is the neutral element (zero) for the operator op, which is needed for an efficient GPU implementation that also allows a nonzero init.

The returned type is the same as init - to control output precision, specify init explicitly.

CPU settings

Use at most max_tasks threads with at least min_elems elements per task. For N-dimensional arrays (dims::Int) multithreading currently only becomes faster for max_tasks >= 4; all other cases are scaling linearly with the number of threads.

GPU settings

The block_size parameter controls the number of threads per block.

The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along a dimension (dims is an integer), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.

The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).

Example

Computing a sum of squares, reducing down to a scalar that is copied to host:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
vsumsq = AK.mapreduce(x -> x * x, (x, y) -> x + y, v; init=zero(eltype(v)))

Computing dimensionwise sums of squares in a 2D matrix:

import AcceleratedKernels as AK
using Metal

f(x) = x * x
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=1)
mcolsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=2)

source