Predicates
Apply a predicate to check if all / any elements in a collection return true. Could be implemented as a reduction, but is better optimised with stopping the search once a false / true is found.
- Other names: not often implemented standalone on GPUs, typically included as part of a reduction.
AcceleratedKernels.any
— Functionany(
pred, v::AbstractArray, backend::Backend=get_backend(v);
# Algorithm choice
alg::PredicatesAlgorithm=ConcurrentWrite(),
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
)
Check if any element of v
satisfies the predicate pred
(i.e. some pred(v[i]) == true
). Optimised differently to mapreduce
due to shortcircuiting behaviour of booleans.
Other names: not often implemented standalone on GPUs, typically included as part of a reduction.
CPU
Multithreaded parallelisation is only worth it for large arrays, relatively expensive predicates, and/or rare occurrence of true; use max_tasks
and min_elems
to only use parallelism when worth it in your application. When only one thread is needed, there is no overhead.
GPU
There are two possible alg
choices:
ConcurrentWrite()
: the default algorithm, using concurrent writing to a global flag; there is only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple threads writing to the same memory location - even if writing the same value - hang the device.MapReduce(; temp=nothing, switch_below=0)
: a conservativemapreduce
-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set thetemp
andswitch_below
keyword arguments to be forwarded tomapreduce
.
Platform-Specific Notes
On oneAPI, alg=MapReduce()
is the default as on some Intel GPUs concurrent global writes hang the device.
Examples
import AcceleratedKernels as AK
using CUDA
v = CuArray(rand(Float32, 100_000))
AK.any(x -> x < 1, v)
Using a different algorithm:
AK.any(x -> x < 1, v, alg=AK.MapReduce(switch_below=100))
Checking a more complex condition with unmaterialised index ranges:
function complex_any(x, y)
AK.any(eachindex(x), AK.get_backend(x)) do i
x[i] < 0 && y[i] > 0
end
end
complex_any(CuArray(rand(Float32, 100)), CuArray(rand(Float32, 100)))
AcceleratedKernels.all
— Functionall(
pred, v::AbstractArray, backend::Backend=get_backend(v);
# Algorithm choice
alg::PredicatesAlgorithm=ConcurrentWrite(),
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
)
Check if all elements of v
satisfy the predicate pred
(i.e. all pred(v[i]) == true
). Optimised differently to mapreduce
due to shortcircuiting behaviour of booleans.
Other names: not often implemented standalone on GPUs, typically included as part of a reduction.
CPU
Multithreaded parallelisation is only worth it for large arrays, relatively expensive predicates, and/or rare occurrence of true; use max_tasks
and min_elems
to only use parallelism when worth it in your application. When only one thread is needed, there is no overhead.
GPU
There are two possible alg
choices:
ConcurrentWrite()
: the default algorithm, using concurrent writing to a global flag; there is only one platform we are aware of (Intel UHD 620 integrated graphics cards) where multiple threads writing to the same memory location - even if writing the same value - hang the device.MapReduce(; temp=nothing, switch_below=0)
: a conservativemapreduce
-based implementation which can be used on all platforms, but does not use shortcircuiting optimisations. You can set thetemp
andswitch_below
keyword arguments to be forwarded tomapreduce
.
Platform-Specific Notes
On oneAPI, alg=MapReduce()
is the default as on some Intel GPUs concurrent global writes hang the device.
Examples
import AcceleratedKernels as AK
using Metal
v = MtlArray(rand(Float32, 100_000))
AK.all(x -> x > 0, v)
Using a different algorithm:
AK.all(x -> x > 0, v, alg=AK.MapReduce(switch_below=100))
Checking a more complex condition with unmaterialised index ranges:
function complex_all(x, y)
AK.all(eachindex(x), AK.get_backend(x)) do i
x[i] > 0 && y[i] < 0
end
end
complex_all(CuArray(rand(Float32, 100)), CuArray(rand(Float32, 100)))
Note on the cooperative
keyword: some older platforms crash when multiple threads write to the same memory location in a global array (e.g. old Intel Graphics); if all threads were to write the same value, it is well-defined on others (e.g. CUDA F4.2 says "If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined."). This "cooperative" thread behaviour allows for a faster implementation; if you have a platform - the only one I know is Intel UHD Graphics - that crashes, set cooperative=false
to use a safer mapreduce
-based implementation.