MapReduce

Equivalent to reduce(op, map(f, iterable)), without saving the intermediate mapped collection; can be used to e.g. split documents into words (map) and count the frequency thereof (reduce).

  • Other names: transform_reduce, some fold implementations include the mapping function too.

AcceleratedKernels.mapreduceFunction
mapreduce(
    f, op, src::AbstractArray, backend::Backend=get_backend(src);
    init,
    neutral=GPUArrays.neutral_element(op, eltype(src)),
    dims::Union{Nothing, Int}=nothing,

    # CPU settings
    scheduler=:static,
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
    switch_below::Int=0,
)

Reduce src along dimensions dims using the binary operator op after applying f elementwise. If dims is nothing, reduce src to a scalar. If dims is an integer, reduce src along that dimension. The init value is used as the initial value for the reduction (i.e. after mapping).

The neutral value is the neutral element (zero) for the operator op, which is needed for an efficient GPU implementation that also allows a nonzero init.

CPU settings

The scheduler can be one of the OhMyThreads.jl schedulers, i.e. :static, :dynamic, :greedy or :serial. Assuming the workload is uniform (as the GPU algorithm prefers), :static is used by default; if you need fine-grained control over your threads, consider using OhMyThreads.jl directly.

Use at most max_tasks threads with at least min_elems elements per task.

GPU settings

The block_size parameter controls the number of threads per block.

The temp parameter can be used to pass a pre-allocated temporary array. For reduction to a scalar (dims=nothing), length(temp) >= 2 * (length(src) + 2 * block_size - 1) ÷ (2 * block_size) is required. For reduction along a dimension (dims is an integer), temp is used as the destination array, and thus must have the exact dimensions required - i.e. same dimensionwise sizes as src, except for the reduced dimension which becomes 1; there are some corner cases when one dimension is zero, check against Base.reduce for CPU arrays for exact behavior.

The switch_below parameter controls the threshold below which the reduction is performed on the CPU and is only used for 1D reductions (i.e. dims=nothing).

Example

Computing a sum of squares, reducing down to a scalar that is copied to host:

import AcceleratedKernels as AK
using CUDA

v = CuArray{Int16}(rand(1:1000, 100_000))
vsumsq = AK.mapreduce(x -> x * x, (x, y) -> x + y, v; init=zero(eltype(v)))

Computing dimensionwise sums of squares in a 2D matrix:

import AcceleratedKernels as AK
using Metal

f(x) = x * x
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
mrowsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=1)
mcolsumsq = AK.mapreduce(f, +, m; init=zero(eltype(m)), dims=2)
source