General Looping: foreachindex / foraxes
General workhorses for converting normal Julia for loops into GPU code, for example:
# Copy kernel testing throughput
function cpu_copy!(dst, src)
for i in eachindex(src)
dst[i] = src[i]
end
endWould be written for GPU as:
import AcceleratedKernels as AK
function gpu_copy!(dst, src)
AK.foreachindex(src) do i
dst[i] = src[i]
end
endYes, simply change for i in eachindex(itr) into AK.foreachindex(itr) do i to run it on GPUs / multithreaded - magic! (or just amazing language design)
This is a parallelised for-loop over the indices of an iterable; converts normal Julia code to GPU kernels running one thread per index. On CPUs it executes static index ranges on max_tasks threads, with user-defined min_elems to be processed by each thread; if only a single thread ends up being needed, the loop is inlined and executed without spawning threads.
- Other names:
Kokkos::parallel_for,RAJA::forall,thrust::transform.
Example:
import AcceleratedKernels as AK
function f(a, b)
# Don't use global arrays inside a `foreachindex`; types must be known
@assert length(a) == length(b)
AK.foreachindex(a) do i
# Note that we don't have to explicitly pass b into the lambda
if b[i] > 0.5
a[i] = 1
else
a[i] = 0
end
# Showing arbitrary if conditions; can also be written as:
# @inbounds a[i] = b[i] > 0.5 ? 1 : 0
end
end
# Use any backend, e.g. CUDA, ROCm, oneAPI, Metal, or CPU
using oneAPI
v1 = oneArray{Float32}(undef, 100_000)
v2 = oneArray(rand(Float32, 100_000))
f(v1, v2)All GPU functions allow you to specify a block size - this is often a power of two (mostly 64, 128, 256, 512); the optimum depends on the algorithm, input data and hardware - you can try the different values and @time or @benchmark them:
@time AK.foreachindex(f, itr_gpu, block_size=512)Similarly, for performance on the CPU the overhead of spawning threads should be masked by processing more elements per thread (but there is no reason here to launch more threads than Threads.nthreads(), the number of threads Julia was started with); the optimum depends on how expensive f is - again, benchmarking is your friend:
@time AK.foreachindex(f, itr_cpu, max_tasks=16, min_elems=1000)AcceleratedKernels.foreachindex — Functionforeachindex(
f, itr, backend::Backend=get_backend(itr);
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size=256,
)Parallelised for loop over the indices of an iterable.
It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.
On CPUs at most max_tasks threads are launched, or fewer such that each thread processes at least min_elems indices; if a single task ends up being needed, f is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs).
Examples
Normally you would write a for loop like this:
function f()
x = Array(1:100)
y = similar(x)
for i in eachindex(x)
@inbounds y[i] = 2 * x[i] + 1
end
endUsing this function you can have the same for loop body over a GPU array:
using CUDA
import AcceleratedKernels as AK
function f()
x = CuArray(1:100)
y = similar(x)
AK.foreachindex(x) do i
@inbounds y[i] = 2 * x[i] + 1
end
endNote that the above code is pure arithmetic, which you can write directly (and on some platforms it may be faster) as:
using CUDA
x = CuArray(1:100)
y = 2 .* x .+ 1Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:
using oneAPI
import AcceleratedKernels as AK
x = oneArray(1:100)
# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foreachindex(x) do i
# x[i] = i
# end
function somecopy!(v)
# Because it is inside a function, the type of `v` will be known
AK.foreachindex(v) do i
v[i] = i
end
end
somecopy!(x) # This worksAcceleratedKernels.foraxes — Functionforaxes(
f, itr, dims::Union{Nothing, <:Integer}=nothing, backend::Backend=get_backend(itr);
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size=256,
)Parallelised for loop over the indices along axis dims of an iterable.
It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.
On CPUs at most max_tasks threads are launched, or fewer such that each thread processes at least min_elems indices; if a single task ends up being needed, f is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs).
Examples
Normally you would write a for loop like this:
function f()
x = Array(reshape(1:30, 3, 10))
y = similar(x)
for i in axes(x, 2)
for j in axes(x, 1)
@inbounds y[j, i] = 2 * x[j, i] + 1
end
end
endUsing this function you can have the same for loop body over a GPU array:
using CUDA
import AcceleratedKernels as AK
function f()
x = CuArray(reshape(1:3000, 3, 1000))
y = similar(x)
AK.foraxes(x, 2) do i
for j in axes(x, 1)
@inbounds y[j, i] = 2 * x[j, i] + 1
end
end
endImportant note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:
using oneAPI
import AcceleratedKernels as AK
x = oneArray(reshape(1:3000, 3, 1000))
# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foraxes(x) do i
# x[i] = i
# end
function somecopy!(v)
# Because it is inside a function, the type of `v` will be known
AK.foraxes(v) do i
v[i] = i
end
end
somecopy!(x) # This works