sort and friends

Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:

  • sort! (in-place), sort (out-of-place)
  • sortperm!, sortperm
  • Other names: sort, sort_team, sort_team_by_key, stable_sort or variations in Kokkos, RAJA, Thrust that I know of.

Function signatures:

AcceleratedKernels.sort!Function
sort!(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Sorts the array v in-place using the specified backend. The lt, by, rev, and order arguments are the same as for Base.sort.

CPU

CPU settings: use at most max_tasks threads to sort the array such that at least min_elems elements are sorted by each thread. A parallel sample_sort! is used, processing independent slices of the array and deferring to Base.sort! for the final local sorts.

Note that the Base Julia sort! is mainly memory-bound, so multithreaded sorting only becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:

  • Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
  • More complex comparators, e.g. by=custom_complex_function or lt=custom_lt_function.
  • Less cache-predictable data movement, e.g. sortperm.

GPU

GPU settings: use block_size threads per block to sort the array. A parallel merge_sort! is used.

For both CPU and GPU backends, the temp argument can be used to reuse a temporary buffer of the same size as v to store the sorted output.

Examples

Simple parallel CPU sort using all available threads (as given by julia --threads N):

import AcceleratedKernels as AK
v = rand(1000)
AK.sort!(v)

Parallel GPU sorting, passing a temporary buffer to avoid allocating a new one:

using oneAPI
import AcceleratedKernels as AK
v = oneArray(rand(1000))
temp = similar(v)
AK.sort!(v, temp=temp)
source
AcceleratedKernels.sortFunction
sort(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place sort, same settings as sort!.

source
AcceleratedKernels.sortperm!Function
sortperm!(
    ix::AbstractArray,
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Save into ix the index permutation of v such that v[ix] is sorted. The lt, by, rev, and order arguments are the same as for Base.sortperm. The same algorithms are used as for sort! with custom by-index comparators.

source
AcceleratedKernels.sortpermFunction
sortperm(
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place sortperm, same settings as sortperm!.

source

Specific implementations that the interfaces above forward to:

  • sample_sort! - multithreaded CPU sample sort, deferring to Base.sort! on independent slices.
  • merge_sort! (in-place), merge_sort (out-of-place) - sort arbitrary objects with custom comparisons.
  • merge_sort_by_key!, merge_sort_by_key - sort a vector of keys along with a "payload", a vector of corresponding values.
  • merge_sortperm!, merge_sortperm, merge_sortperm_lowmem!, merge_sortperm_lowmem - compute a sorting index permutation.

Function signatures:

AcceleratedKernels.sample_sort!Function
sample_sort!(
    v::AbstractArray;

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    max_tasks=Threads.nthreads(),
    min_elems=1,
    temp::Union{Nothing, AbstractArray}=nothing,
)
source
AcceleratedKernels.sample_sortperm!Function
sample_sortperm!(
    ix::AbstractArray, v::AbstractArray;

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    max_tasks=Threads.nthreads(),
    min_elems=1,
    temp::Union{Nothing, AbstractArray}=nothing,
)
source
AcceleratedKernels.merge_sort!Function
merge_sort!(
    v::AbstractGPUArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sortFunction
merge_sort(
    v::AbstractGPUArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sort_by_key!Function
merge_sort_by_key!(
    keys::AbstractArray,
    values::AbstractArray,
    backend::Backend=get_backend(keys);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp_keys::Union{Nothing, AbstractArray}=nothing,
    temp_values::Union{Nothing, AbstractArray}=nothing,
)
source
AcceleratedKernels.merge_sort_by_keyFunction
merge_sort_by_key(
    keys::AbstractGPUArray,
    values::AbstractGPUArray,
    backend::Backend=get_backend(keys);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp_keys::Union{Nothing, AbstractGPUArray}=nothing,
    temp_values::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sortperm!Function
merge_sortperm!(
    ix::AbstractGPUArray,
    v::AbstractGPUArray,
    backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    inplace::Bool=false,
    block_size::Int=256,
    temp_ix::Union{Nothing, AbstractGPUArray}=nothing,
    temp_v::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sortpermFunction
merge_sortperm(
    v::AbstractGPUArray, backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    inplace::Bool=false,
    block_size::Int=256,
    temp_ix::Union{Nothing, AbstractGPUArray}=nothing,
    temp_v::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sortperm_lowmem!Function
merge_sortperm_lowmem!(
    ix::AbstractGPUArray,
    v::AbstractGPUArray,
    backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractGPUArray}=nothing,
)
source
AcceleratedKernels.merge_sortperm_lowmemFunction
merge_sortperm_lowmem(
    v::AbstractGPUArray, backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractGPUArray}=nothing,
)
source

Example:

import AcceleratedKernels as AK
using AMDGPU

v = ROCArray(rand(Int32, 100_000))
AK.sort!(v)

As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the temp argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:

v = ROCArray(rand(Float32, 100_000))
temp = similar(v)
AK.sort!(v, temp=temp)