MapReduce

Equivalent to reduce(op, map(f, iterable)), without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).

Other names: transform_reduce, some fold implementations include the mapping function too.

AcceleratedKernels.mapreduce — Function

mapreduce(
    f, op, src::AbstractArray, backend::Backend=get_backend(src);
    init,
    neutral=neutral_element(op, typeof(init)),
    dims=nothing,

    # CPU settings
    max_tasks::Int=Threads.nthreads(),
    min_elems::Int=1,

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    switch_below::Int=0,
)

mapreduce(f, op, A::AbstractArray, B::AbstractArray, As::AbstractArray...; init, kwargs...)
mapreduce(f, op, A::AbstractArray, B::AbstractArray, As::AbstractArray..., backend::Backend; init, kwargs...)

Reduce src along dimensions dims using the binary operator op after applying f elementwise. If dims is nothing or :, reduce src to a scalar. If dims is an integer or a collection of integers, reduce src along those dimension(s). The init value is used as the initial value for the reduction (i.e. after mapping).

The neutral value is the neutral element for the operator op, which is needed for an efficient GPU implementation that also allows a nonzero init.

The returned type is the same as init - to control output precision, specify init explicitly.

Multiple input arrays are supported with the same axes. This follows Base.mapreduce(f, op, A, B, ...) semantics: f is mapped across corresponding elements of the inputs and the mapped values are reduced without materializing the intermediate array. Mismatched axes throw DimensionMismatch; singleton-expanding broadcast semantics are reserved for internal Broadcasted sources used by array backends.

CPU settings

Use at most max_tasks threads with at least min_elems elements per task. For N-dimensional arrays (dims is an integer or a collection of integers) multithreading currently only becomes faster for max_tasks >= 4; all other cases are scaling linearly with the number of threads.

GPU settings

The block_size parameter controls the number of threads per block and must be a power of two.

The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing or dims=:), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along dimensions (dims is an integer or a collection of integers), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension(s) which become 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.

The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).

Example

Computing a sum of squares, reducing down to a scalar that is copied to host:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
vsumsq = AK.mapreduce(x -> x * x, (x, y) -> x + y, v; init=zero(eltype(v)))

Computing dimensionwise sums of squares in a 2D matrix:

import AcceleratedKernels as AK
using Metal

f(x) = x * x
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=1)
mcolsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=2)

Computing a two-input dimensional reduction:

rows = AK.mapreduce((x, y) -> x * y, +, a, b; init=0f0, dims=1)

An explicit backend may be passed after all input arrays:

rows = AK.mapreduce((x, y) -> x * y, +, a, b, backend; init=0f0, dims=1)

source