Predicates

Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.

Other names: not often implemented standalone on GPUs, typically included as part of a reduction.

AcceleratedKernels.any — Function

any(
    pred, v::AbstractArray, backend::Backend=get_backend(v);

    # Algorithm choice
    alg::PredicatesAlgorithm=ConcurrentWrite(),

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,
)

Check if any element of v satisfies the predicate pred (i.e. some pred(v[i]) == true). Optimised differently to mapreduce due to shortcircuiting behaviour of booleans.

Other names: not often implemented standalone on GPUs, typically included as part of a reduction.

CPU

Multithreaded parallelisation is only worth it for large arrays, relatively expensive predicates, and/or rare occurrence of true; use max_tasks and min_elems to only use parallelism when worth it in your application. When only one thread is needed, there is no overhead.

GPU

There are two possible alg choices:

ConcurrentWrite(): the default algorithm, using concurrent writing to a global flag; there is only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple threads writing to the same memory location - even if writing the same value - hang the device.
MapReduce(; temp=nothing, switch_below=0): a conservative mapreduce-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set the temp and switch_below keyword arguments to be forwarded to mapreduce.

Platform-Specific Notes

On oneAPI, alg=MapReduce() is the default as on some Intel GPUs concurrent global writes hang the device.

Examples

import AcceleratedKernels as AK
using CUDA

v = CuArray(rand(Float32, 100_000))
AK.any(x -> x < 1, v)

Using a different algorithm:

AK.any(x -> x < 1, v, alg=AK.MapReduce(switch_below=100))

Checking a more complex condition with unmaterialised index ranges:

function complex_any(x, y)
    AK.any(eachindex(x), AK.get_backend(x)) do i
        x[i] < 0 && y[i] > 0
    end
end

complex_any(CuArray(rand(Float32, 100)), CuArray(rand(Float32, 100)))

source

AcceleratedKernels.all — Function

all(
    pred, v::AbstractArray, backend::Backend=get_backend(v);

    # Algorithm choice
    alg::PredicatesAlgorithm=ConcurrentWrite(),

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,
)

Check if all elements of v satisfy the predicate pred (i.e. all pred(v[i]) == true). Optimised differently to mapreduce due to shortcircuiting behaviour of booleans.

Other names: not often implemented standalone on GPUs, typically included as part of a reduction.

CPU

GPU

There are two possible alg choices:

ConcurrentWrite(): the default algorithm, using concurrent writing to a global flag; there is only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple threads writing to the same memory location - even if writing the same value - hang the device.
MapReduce(; temp=nothing, switch_below=0): a conservative mapreduce-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set the temp and switch_below keyword arguments to be forwarded to mapreduce.

Platform-Specific Notes

On oneAPI, alg=MapReduce() is the default as on some Intel GPUs concurrent global writes hang the device.

Examples

import AcceleratedKernels as AK
using Metal

v = MtlArray(rand(Float32, 100_000))
AK.all(x -> x > 0, v)

Using a different algorithm:

AK.all(x -> x > 0, v, alg=AK.MapReduce(switch_below=100))

Checking a more complex condition with unmaterialised index ranges:

function complex_all(x, y)
    AK.all(eachindex(x), AK.get_backend(x)) do i
        x[i] > 0 && y[i] < 0
    end
end

complex_all(CuArray(rand(Float32, 100)), CuArray(rand(Float32, 100)))

source

Note on the cooperative keyword: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set cooperative=false to use a safer mapreduce-based implementation.