General Looping: foreachindex / foraxes

General workhorses for converting normal Julia for loops into GPU code, for example:

# Copy kernel testing throughput
function cpu_copy!(dst, src)
    for i in eachindex(src)
        dst[i] = src[i]
    end
end

Would be written for GPU as:

import AcceleratedKernels as AK

function gpu_copy!(dst, src)
    AK.foreachindex(src) do i
        dst[i] = src[i]
    end
end

Yes, simply change for i in eachindex(itr) into AK.foreachindex(itr) do i to run it on GPUs / multithreaded - magic! (or just amazing language design)

This is a parallelised for-loop over the indices of an iterable; converts normal Julia code to GPU kernels running one thread per index. On CPUs it executes static index ranges on max_tasks threads, with user-defined min_elems to be processed by each thread; if only a single thread ends up being needed, the loop is inlined and executed without spawning threads.

  • Other names: Kokkos::parallel_for, RAJA::forall, thrust::transform.

Example:

import AcceleratedKernels as AK

function f(a, b)
    # Don't use global arrays inside a `foreachindex`; types must be known
    @assert length(a) == length(b)
    AK.foreachindex(a) do i
        # Note that we don't have to explicitly pass b into the lambda
        if b[i] > 0.5
            a[i] = 1
        else
            a[i] = 0
        end

        # Showing arbitrary if conditions; can also be written as:
        # @inbounds a[i] = b[i] > 0.5 ? 1 : 0
    end
end

# Use any backend, e.g. CUDA, ROCm, oneAPI, Metal, or CPU
using oneAPI
v1 = oneArray{Float32}(undef, 100_000)
v2 = oneArray(rand(Float32, 100_000))
f(v1, v2)

All GPU functions allow you to specify a block size - this is often a power of two (mostly 64, 128, 256, 512); the optimum depends on the algorithm, input data and hardware - you can try the different values and @time or @benchmark them:

@time AK.foreachindex(f, itr_gpu, block_size=512)

Similarly, for performance on the CPU the overhead of spawning threads should be masked by processing more elements per thread (but there is no reason here to launch more threads than Threads.nthreads(), the number of threads Julia was started with); the optimum depends on how expensive f is - again, benchmarking is your friend:

@time AK.foreachindex(f, itr_cpu, max_tasks=16, min_elems=1000)
AcceleratedKernels.foreachindexFunction
foreachindex(
    f, itr, backend::Backend=get_backend(itr);

    # CPU settings
    scheduler=:threads,
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size=256,
)

Parallelised for loop over the indices of an iterable.

It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.

On CPUs at most max_tasks threads are launched, or fewer such that each thread processes at least min_elems indices; if a single task ends up being needed, f is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs). The scheduler can be :polyester to use Polyester.jl cheap threads or :threads to use normal Julia threads; either can be faster depending on the function, but in general the latter is more composable.

Examples

Normally you would write a for loop like this:

function f()
    x = Array(1:100)
    y = similar(x)
    for i in eachindex(x)
        @inbounds y[i] = 2 * x[i] + 1
    end
end

Using this function you can have the same for loop body over a GPU array:

using CUDA
import AcceleratedKernels as AK

function f()
    x = CuArray(1:100)
    y = similar(x)
    AK.foreachindex(x) do i
        @inbounds y[i] = 2 * x[i] + 1
    end
end

Note that the above code is pure arithmetic, which you can write directly (and on some platforms it may be faster) as:

using CUDA
x = CuArray(1:100)
y = 2 .* x .+ 1

Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:

using oneAPI
import AcceleratedKernels as AK

x = oneArray(1:100)

# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foreachindex(x) do i
#     x[i] = i
# end

function somecopy!(v)
    # Because it is inside a function, the type of `v` will be known
    AK.foreachindex(v) do i
        v[i] = i
    end
end

somecopy!(x)    # This works
source
AcceleratedKernels.foraxesFunction
foraxes(
    f, itr, dims::Union{Nothing, <:Integer}=nothing, backend::Backend=get_backend(itr);

    # CPU settings
    scheduler=:threads,
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size=256,
)

Parallelised for loop over the indices along axis dims of an iterable.

It allows you to run normal Julia code on a GPU over multiple arrays - e.g. CuArray, ROCArray, MtlArray, oneArray - with one GPU thread per index.

On CPUs at most max_tasks threads are launched, or fewer such that each thread processes at least min_elems indices; if a single task ends up being needed, f is inlined and no thread is launched. Tune it to your function - the more expensive it is, the fewer elements are needed to amortise the cost of launching a thread (which is a few μs). The scheduler can be :polyester to use Polyester.jl cheap threads or :threads to use normal Julia threads; either can be faster depending on the function, but in general the latter is more composable.

Examples

Normally you would write a for loop like this:

function f()
    x = Array(reshape(1:30, 3, 10))
    y = similar(x)
    for i in axes(x, 2)
        for j in axes(x, 1)
            @inbounds y[j, i] = 2 * x[j, i] + 1
        end
    end
end

Using this function you can have the same for loop body over a GPU array:

using CUDA
import AcceleratedKernels as AK

function f()
    x = CuArray(reshape(1:3000, 3, 1000))
    y = similar(x)
    AK.foraxes(x, 2) do i
        for j in axes(x, 1)
            @inbounds y[j, i] = 2 * x[j, i] + 1
        end
    end
end

Important note: to use this function on a GPU, the objects referenced inside the loop body must have known types - i.e. be inside a function. For example:

using oneAPI
import AcceleratedKernels as AK

x = oneArray(reshape(1:3000, 3, 1000))

# CRASHES - typical error message: "Reason: unsupported dynamic function invocation"
# AK.foraxes(x) do i
#     x[i] = i
# end

function somecopy!(v)
    # Because it is inside a function, the type of `v` will be known
    AK.foraxes(v) do i
        v[i] = i
    end
end

somecopy!(x)    # This works
source