Sorting · AcceleratedKernels.jl

`sort` and friends

Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:

sort! (in-place), sort (out-of-place)
sortperm!, sortperm
Other names: sort, sort_team, sort_team_by_key, stable_sort or variations in Kokkos, RAJA, Thrust that I know of.

Function signatures:

AcceleratedKernels.sort! — Function

sort!(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Sorts the array v in-place using the specified backend. The lt, by, rev, and order arguments are the same as for Base.sort.

CPU

CPU settings: use at most max_tasks threads to sort the array such that at least min_elems elements are sorted by each thread. A parallel sample_sort! is used, processing independent slices of the array and deferring to Base.sort! for the final local sorts.

Note that the Base Julia sort! is mainly memory-bound, so multithreaded sorting only becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:

Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
More complex comparators, e.g. by=custom_complex_function or lt=custom_lt_function.
Less cache-predictable data movement, e.g. sortperm.

GPU

GPU settings: use block_size threads per block to sort the array. A parallel merge_sort! is used.

For both CPU and GPU backends, the temp argument can be used to reuse a temporary buffer of the same size as v to store the sorted output.

Examples

Simple parallel CPU sort using all available threads (as given by julia --threads N):

import AcceleratedKernels as AK
v = rand(1000)
AK.sort!(v)

Parallel GPU sorting, passing a temporary buffer to avoid allocating a new one:

using oneAPI
import AcceleratedKernels as AK
v = oneArray(rand(1000))
temp = similar(v)
AK.sort!(v, temp=temp)

source

AcceleratedKernels.sort — Function

sort(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place sort, same settings as sort!.

source

AcceleratedKernels.sortperm! — Function

sortperm!(
    ix::AbstractArray,
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Save into ix the index permutation of v such that v[ix] is sorted. The lt, by, rev, and order arguments are the same as for Base.sortperm. The same algorithms are used as for sort! with custom by-index comparators.

source

AcceleratedKernels.sortperm — Function

sortperm(
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    # CPU settings
    max_tasks=Threads.nthreads(),
    min_elems=1,

    # GPU settings
    block_size::Int=256,

    # Temporary buffer, same size as `v`
    temp::Union{Nothing, AbstractArray}=nothing,
)

Out-of-place sortperm, same settings as sortperm!.

source

Specific implementations that the interfaces above forward to:

sample_sort! - multithreaded CPU sample sort, deferring to Base.sort! on independent slices.
merge_sort! (in-place), merge_sort (out-of-place) - sort arbitrary objects with custom comparisons.
merge_sort_by_key!, merge_sort_by_key - sort a vector of keys along with a "payload", a vector of corresponding values.
merge_sortperm!, merge_sortperm, merge_sortperm_lowmem!, merge_sortperm_lowmem - compute a sorting index permutation.

Function signatures:

AcceleratedKernels.sample_sort! — Function

sample_sort!(
    v::AbstractArray;

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    max_tasks=Threads.nthreads(),
    min_elems=1,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.sample_sortperm! — Function

sample_sortperm!(
    ix::AbstractArray, v::AbstractArray;

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    max_tasks=Threads.nthreads(),
    min_elems=1,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sort! — Function

merge_sort!(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sort — Function

merge_sort(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sort_by_key! — Function

merge_sort_by_key!(
    keys::AbstractArray,
    values::AbstractArray,
    backend::Backend=get_backend(keys);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp_keys::Union{Nothing, AbstractArray}=nothing,
    temp_values::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sort_by_key — Function

merge_sort_by_key(
    keys::AbstractArray,
    values::AbstractArray,
    backend::Backend=get_backend(keys);

    lt=isless,
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp_keys::Union{Nothing, AbstractArray}=nothing,
    temp_values::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sortperm! — Function

merge_sortperm!(
    ix::AbstractArray,
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    inplace::Bool=false,
    block_size::Int=256,
    temp_ix::Union{Nothing, AbstractArray}=nothing,
    temp_v::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sortperm — Function

merge_sortperm(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    inplace::Bool=false,
    block_size::Int=256,
    temp_ix::Union{Nothing, AbstractArray}=nothing,
    temp_v::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sortperm_lowmem! — Function

merge_sortperm_lowmem!(
    ix::AbstractArray,
    v::AbstractArray,
    backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

AcceleratedKernels.merge_sortperm_lowmem — Function

merge_sortperm_lowmem(
    v::AbstractArray, backend::Backend=get_backend(v);

    lt=(<),
    by=identity,
    rev::Union{Nothing, Bool}=nothing,
    order::Base.Order.Ordering=Base.Order.Forward,

    block_size::Int=256,
    temp::Union{Nothing, AbstractArray}=nothing,
)

source

Example:

import AcceleratedKernels as AK
using AMDGPU

v = ROCArray(rand(Int32, 100_000))
AK.sort!(v)

As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the temp argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:

v = ROCArray(rand(Float32, 100_000))
temp = similar(v)
AK.sort!(v, temp=temp)

sort and friends

`sort` and friends