General Looping: foreachindex
/ foraxes
General workhorses for converting normal Julia for
loops into GPU code, for example:
# Copy kernel testing throughput
function cpu_copy!(dst, src)
for i in eachindex(src)
dst[i] = src[i]
end
end
Would be written for GPU as:
import AcceleratedKernels as AK
function gpu_copy!(dst, src)
AK.foreachindex(src) do i
dst[i] = src[i]
end
end
Yes, simply change for i in eachindex(itr)
into AK.foreachindex(itr) do i
to run it on GPUs / multithreaded - magic! (or just amazing language design)
This is a parallelised for-loop over the indices of an iterable; converts normal Julia code to GPU kernels running one thread per index. On CPUs it executes static index ranges on max_tasks
threads, with user-defined min_elems
to be processed by each thread; if only a single thread ends up being needed, the loop is inlined and executed without spawning threads.
- Other names:
Kokkos::parallel_for
,RAJA::forall
,thrust::transform
.
Example:
import AcceleratedKernels as AK
function f(a, b)
# Don't use global arrays inside a `foreachindex`; types must be known
@assert length(a) == length(b)
AK.foreachindex(a) do i
# Note that we don't have to explicitly pass b into the lambda
if b[i] > 0.5
a[i] = 1
else
a[i] = 0
end
# Showing arbitrary if conditions; can also be written as:
# @inbounds a[i] = b[i] > 0.5 ? 1 : 0
end
end
# Use any backend, e.g. CUDA, ROCm, oneAPI, Metal, or CPU
using oneAPI
v1 = oneArray{Float32}(undef, 100_000)
v2 = oneArray(rand(Float32, 100_000))
f(v1, v2)
All GPU functions allow you to specify a block size - this is often a power of two (mostly 64, 128, 256, 512); the optimum depends on the algorithm, input data and hardware - you can try the different values and @time
or @benchmark
them:
@time AK.foreachindex(f, itr_gpu, block_size=512)
Similarly, for performance on the CPU the overhead of spawning threads should be masked by processing more elements per thread (but there is no reason here to launch more threads than Threads.nthreads()
, the number of threads Julia was started with); the optimum depends on how expensive f
is - again, benchmarking is your friend:
@time AK.foreachindex(f, itr_cpu, max_tasks=16, min_elems=1000)
AcceleratedKernels.foreachindex
— Functionforeachindex(
f, itr, backend::Backend=get_backend(itr);
# CPU settings
scheduler=:threads,
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size=256,
)
Parallelised for
loop over the indices of an iterable.
It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.
On CPUs at most max_tasks
threads are launched, or fewer such that each thread processes at least min_elems
indices; if a single task ends up being needed, f
is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs). The scheduler can be :polyester
to use Polyester.jl cheap threads or :threads
to use normal Julia threads; either can be faster depending on the function, but in general the latter is more composable.
Examples
Normally you would write a for loop like this:
function f()
x = Array(1:100)
y = similar(x)
for i in eachindex(x)
@inbounds y[i] = 2 * x[i] + 1
end
end
Using this function you can have the same for loop body over a GPU array:
using CUDA
import AcceleratedKernels as AK
function f()
x = CuArray(1:100)
y = similar(x)
AK.foreachindex(x) do i
@inbounds y[i] = 2 * x[i] + 1
end
end
Note that the above code is pure arithmetic, which you can write directly (and on some platforms it may be faster) as:
using CUDA
x = CuArray(1:100)
y = 2 .* x .+ 1
Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:
using oneAPI
import AcceleratedKernels as AK
x = oneArray(1:100)
# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foreachindex(x) do i
# x[i] = i
# end
function somecopy!(v)
# Because it is inside a function, the type of `v` will be known
AK.foreachindex(v) do i
v[i] = i
end
end
somecopy!(x) # This works
AcceleratedKernels.foraxes
— Functionforaxes(
f, itr, dims::Union{Nothing, <:Integer}=nothing, backend::Backend=get_backend(itr);
# CPU settings
scheduler=:threads,
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size=256,
)
Parallelised for
loop over the indices along axis dims
of an iterable.
It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.
On CPUs at most max_tasks
threads are launched, or fewer such that each thread processes at least min_elems
indices; if a single task ends up being needed, f
is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs). The scheduler can be :polyester
to use Polyester.jl cheap threads or :threads
to use normal Julia threads; either can be faster depending on the function, but in general the latter is more composable.
Examples
Normally you would write a for loop like this:
function f()
x = Array(reshape(1:30, 3, 10))
y = similar(x)
for i in axes(x, 2)
for j in axes(x, 1)
@inbounds y[j, i] = 2 * x[j, i] + 1
end
end
end
Using this function you can have the same for loop body over a GPU array:
using CUDA
import AcceleratedKernels as AK
function f()
x = CuArray(reshape(1:3000, 3, 1000))
y = similar(x)
AK.foraxes(x, 2) do i
for j in axes(x, 1)
@inbounds y[j, i] = 2 * x[j, i] + 1
end
end
end
Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:
using oneAPI
import AcceleratedKernels as AK
x = oneArray(reshape(1:3000, 3, 1000))
# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foraxes(x) do i
# x[i] = i
# end
function somecopy!(v)
# Because it is inside a function, the type of `v` will be known
AK.foraxes(v) do i
v[i] = i
end
end
somecopy!(x) # This works