sort and friends
Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:
sort!(in-place),sort(out-of-place)sortperm!,sortperm- Other names:
sort,sort_team,sort_team_by_key,stable_sortor variations in Kokkos, RAJA, Thrust that I know of.
Function signatures:
AcceleratedKernels.sort! — Functionsort!(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)Sorts the array v in-place using the specified backend. The lt, by, rev, and order arguments are the same as for Base.sort.
CPU
CPU settings: use at most max_tasks threads to sort the array such that at least min_elems elements are sorted by each thread. A parallel sample_sort! is used, processing independent slices of the array and deferring to Base.sort! for the final local sorts.
Note that the Base Julia sort! is mainly memory-bound, so multithreaded sorting only becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:
- Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
- More complex comparators, e.g.
by=custom_complex_functionorlt=custom_lt_function. - Less cache-predictable data movement, e.g.
sortperm.
GPU
GPU settings: use block_size threads per block to sort the array. A parallel merge_sort! is used.
For both CPU and GPU backends, the temp argument can be used to reuse a temporary buffer of the same size as v to store the sorted output.
Examples
Simple parallel CPU sort using all available threads (as given by julia --threads N):
import AcceleratedKernels as AK
v = rand(1000)
AK.sort!(v)Parallel GPU sorting, passing a temporary buffer to avoid allocating a new one:
using oneAPI
import AcceleratedKernels as AK
v = oneArray(rand(1000))
temp = similar(v)
AK.sort!(v, temp=temp)AcceleratedKernels.sort — Functionsort(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)Out-of-place sort, same settings as sort!.
AcceleratedKernels.sortperm! — Functionsortperm!(
ix::AbstractArray,
v::AbstractArray,
backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)Save into ix the index permutation of v such that v[ix] is sorted. The lt, by, rev, and order arguments are the same as for Base.sortperm. The same algorithms are used as for sort! with custom by-index comparators.
AcceleratedKernels.sortperm — Functionsortperm(
v::AbstractArray,
backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)Out-of-place sortperm, same settings as sortperm!.
Specific implementations that the interfaces above forward to:
sample_sort!- multithreaded CPU sample sort, deferring to Base.sort! on independent slices.merge_sort!(in-place),merge_sort(out-of-place) - sort arbitrary objects with custom comparisons.merge_sort_by_key!,merge_sort_by_key- sort a vector of keys along with a "payload", a vector of corresponding values.merge_sortperm!,merge_sortperm,merge_sortperm_lowmem!,merge_sortperm_lowmem- compute a sorting index permutation.
Function signatures:
AcceleratedKernels.sample_sort! — Functionsample_sort!(
v::AbstractArray;
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
max_tasks=Threads.nthreads(),
min_elems=1,
temp::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.sample_sortperm! — Functionsample_sortperm!(
ix::AbstractArray, v::AbstractArray;
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
max_tasks=Threads.nthreads(),
min_elems=1,
temp::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sort! — Functionmerge_sort!(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sort — Functionmerge_sort(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sort_by_key! — Functionmerge_sort_by_key!(
keys::AbstractArray,
values::AbstractArray,
backend::Backend=get_backend(keys);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp_keys::Union{Nothing, AbstractArray}=nothing,
temp_values::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sort_by_key — Functionmerge_sort_by_key(
keys::AbstractArray,
values::AbstractArray,
backend::Backend=get_backend(keys);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp_keys::Union{Nothing, AbstractArray}=nothing,
temp_values::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sortperm! — Functionmerge_sortperm!(
ix::AbstractArray,
v::AbstractArray,
backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
inplace::Bool=false,
block_size::Int=256,
temp_ix::Union{Nothing, AbstractArray}=nothing,
temp_v::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sortperm — Functionmerge_sortperm(
v::AbstractArray, backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
inplace::Bool=false,
block_size::Int=256,
temp_ix::Union{Nothing, AbstractArray}=nothing,
temp_v::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sortperm_lowmem! — Functionmerge_sortperm_lowmem!(
ix::AbstractArray,
v::AbstractArray,
backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
)AcceleratedKernels.merge_sortperm_lowmem — Functionmerge_sortperm_lowmem(
v::AbstractArray, backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractArray}=nothing,
)Example:
import AcceleratedKernels as AK
using AMDGPU
v = ROCArray(rand(Int32, 100_000))
AK.sort!(v)As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the temp argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:
v = ROCArray(rand(Float32, 100_000))
temp = similar(v)
AK.sort!(v, temp=temp)