sort
and friends
Sorting algorithms with similar interface and default settings as the Julia Base ones, on GPUs:
sort!
(in-place),sort
(out-of-place)sortperm!
,sortperm
- Other names:
sort
,sort_team
,sort_team_by_key
,stable_sort
or variations in Kokkos, RAJA, Thrust that I know of.
Function signatures:
AcceleratedKernels.sort!
— Functionsort!(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)
Sorts the array v
in-place using the specified backend. The lt
, by
, rev
, and order
arguments are the same as for Base.sort
.
CPU
CPU settings: use at most max_tasks
threads to sort the array such that at least min_elems
elements are sorted by each thread. A parallel sample_sort!
is used, processing independent slices of the array and deferring to Base.sort!
for the final local sorts.
Note that the Base Julia sort!
is mainly memory-bound, so multithreaded sorting only becomes faster if it is a more compute-heavy operation to hide memory latency - that includes:
- Sorting more complex types, e.g. lexicographic sorting of tuples / structs / strings.
- More complex comparators, e.g.
by=custom_complex_function
orlt=custom_lt_function
. - Less cache-predictable data movement, e.g.
sortperm
.
GPU
GPU settings: use block_size
threads per block to sort the array. A parallel merge_sort!
is used.
For both CPU and GPU backends, the temp
argument can be used to reuse a temporary buffer of the same size as v
to store the sorted output.
Examples
Simple parallel CPU sort using all available threads (as given by julia --threads N
):
import AcceleratedKernels as AK
v = rand(1000)
AK.sort!(v)
Parallel GPU sorting, passing a temporary buffer to avoid allocating a new one:
using oneAPI
import AcceleratedKernels as AK
v = oneArray(rand(1000))
temp = similar(v)
AK.sort!(v, temp=temp)
AcceleratedKernels.sort
— Functionsort(
v::AbstractArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)
Out-of-place sort, same settings as sort!
.
AcceleratedKernels.sortperm!
— Functionsortperm!(
ix::AbstractArray,
v::AbstractArray,
backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)
Save into ix
the index permutation of v
such that v[ix]
is sorted. The lt
, by
, rev
, and order
arguments are the same as for Base.sortperm
. The same algorithms are used as for sort!
with custom by-index comparators.
AcceleratedKernels.sortperm
— Functionsortperm(
v::AbstractArray,
backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
# CPU settings
max_tasks=Threads.nthreads(),
min_elems=1,
# GPU settings
block_size::Int=256,
# Temporary buffer, same size as `v`
temp::Union{Nothing, AbstractArray}=nothing,
)
Out-of-place sortperm, same settings as sortperm!
.
Specific implementations that the interfaces above forward to:
sample_sort!
- multithreaded CPU sample sort, deferring to Base.sort! on independent slices.merge_sort!
(in-place),merge_sort
(out-of-place) - sort arbitrary objects with custom comparisons.merge_sort_by_key!
,merge_sort_by_key
- sort a vector of keys along with a "payload", a vector of corresponding values.merge_sortperm!
,merge_sortperm
,merge_sortperm_lowmem!
,merge_sortperm_lowmem
- compute a sorting index permutation.
Function signatures:
AcceleratedKernels.sample_sort!
— Functionsample_sort!(
v::AbstractArray;
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
max_tasks=Threads.nthreads(),
min_elems=1,
temp::Union{Nothing, AbstractArray}=nothing,
)
AcceleratedKernels.sample_sortperm!
— Functionsample_sortperm!(
ix::AbstractArray, v::AbstractArray;
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
max_tasks=Threads.nthreads(),
min_elems=1,
temp::Union{Nothing, AbstractArray}=nothing,
)
AcceleratedKernels.merge_sort!
— Functionmerge_sort!(
v::AbstractGPUArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sort
— Functionmerge_sort(
v::AbstractGPUArray, backend::Backend=get_backend(v);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sort_by_key!
— Functionmerge_sort_by_key!(
keys::AbstractArray,
values::AbstractArray,
backend::Backend=get_backend(keys);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp_keys::Union{Nothing, AbstractArray}=nothing,
temp_values::Union{Nothing, AbstractArray}=nothing,
)
AcceleratedKernels.merge_sort_by_key
— Functionmerge_sort_by_key(
keys::AbstractGPUArray,
values::AbstractGPUArray,
backend::Backend=get_backend(keys);
lt=isless,
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp_keys::Union{Nothing, AbstractGPUArray}=nothing,
temp_values::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sortperm!
— Functionmerge_sortperm!(
ix::AbstractGPUArray,
v::AbstractGPUArray,
backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
inplace::Bool=false,
block_size::Int=256,
temp_ix::Union{Nothing, AbstractGPUArray}=nothing,
temp_v::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sortperm
— Functionmerge_sortperm(
v::AbstractGPUArray, backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
inplace::Bool=false,
block_size::Int=256,
temp_ix::Union{Nothing, AbstractGPUArray}=nothing,
temp_v::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sortperm_lowmem!
— Functionmerge_sortperm_lowmem!(
ix::AbstractGPUArray,
v::AbstractGPUArray,
backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractGPUArray}=nothing,
)
AcceleratedKernels.merge_sortperm_lowmem
— Functionmerge_sortperm_lowmem(
v::AbstractGPUArray, backend::Backend=get_backend(v);
lt=(<),
by=identity,
rev::Union{Nothing, Bool}=nothing,
order::Base.Order.Ordering=Base.Order.Forward,
block_size::Int=256,
temp::Union{Nothing, AbstractGPUArray}=nothing,
)
Example:
import AcceleratedKernels as AK
using AMDGPU
v = ROCArray(rand(Int32, 100_000))
AK.sort!(v)
As GPU memory is more expensive, all functions in AcceleratedKernels.jl expose any temporary arrays they will use (the temp
argument); you can supply your own buffers to make the algorithms not allocate additional GPU storage, e.g.:
v = ROCArray(rand(Float32, 100_000))
temp = similar(v)
AK.sort!(v, temp=temp)