Overview · AcceleratedKernels.jl

Parallel algorithm building blocks for the Julia ecosystem, targeting multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal and Nvidia CUDA (and any future backends added to the JuliaGPU organisation).

What's Different?

As far as I am aware, this is the first cross-architecture parallel standard library from a unified codebase - that is, the code is written as KernelAbstractions.jl backend-agnostic kernels, which are then transpiled to a GPU backend; that means we benefit from all the optimisations available on the native platform and official compiler stacks. For example, unlike open standards like OpenCL that require GPU vendors to implement that API for their hardware, we target the existing official compilers. And while performance-portability libraries like Kokkos and RAJA are powerful for large C++ codebases, they require US National Lab-level development and maintenance efforts to effectively forward calls from a single API to other OpenMP, CUDA Thrust, ROCm rocThrust, oneAPI DPC++ libraries developed separately.

As a simple example, this is how a normal Julia for-loop can be converted to an accelerated kernel - for both multithreaded CPUs and Nvidia / AMD / Intel / Apple GPUs, with native performance - by changing a single line:

<table> <tr> <td> CPU Code </td> <td> Multithreaded / GPU code </td> <tr>

# Copy kernel testing throughput

function cpu_copy!(dst, src)
    for i in eachindex(src)
        dst[i] = src[i]
    end
end

</td> <td>

import AcceleratedKernels as AK

function ak_copy!(dst, src)
    AK.foreachindex(src) do i
        dst[i] = src[i]
    end
end

</td> </tr> </table>

Again, this is only possible because of the unique Julia compilation model, the JuliaGPU organisation work for reusable GPU backend infrastructure, and especially the KernelAbstractions.jl backend-agnostic kernel language. Thank you.

Status

The AcceleratedKernels.jl GPU sort and accumulate implementations were adopted as the official AMDGPU algorithms! The API is starting to stabilise; it follows the Julia standard library fairly closely - and additionally exposing all temporary arrays for memory reuse. For any new ideas / requests, please join the conversation on Julia Discourse or post an issue.

We have an extensive randomised test suite that we run on the CPU (single- and multi-threaded) backend on Windows, Ubuntu and MacOS for Julia LTS, Stable, and Pre-Release, plus the CUDA, AMDGPU, oneAPI and Metal backends on the JuliaGPU buildkite - the exact same tests are run on all architectures to ensure uniform interfaces.

AcceleratedKernels.jl is also a fundamental building block of applications developed at EvoPhase, so it will see continuous heavy use with industry backing. Long-term stability, performance improvements and support are priorities for us.

Acknowledgements

Designed and built by Andrei-Leonard Nicusan, maintained with contributors.

Much of this work was possible because of the fantastic HPC resources at the University of Birmingham and the Birmingham Environment for Academic Research, which gave us free on-demand access to thousands of CPUs and GPUs that we experimented on, and the support teams we nagged. In particular, thank you to Kit Windows-Yule and Andrew Morris on the BlueBEAR and Baskerville T2 supercomputers' leadership, and Simon Branford, Simon Hartley, James Allsopp and James Carpenter for computing support.

License

AcceleratedKernels.jl is MIT-licensed. Enjoy.