Roadmap / Future Plans
Help is very welcome for any of the below:
Automated optimisation / tuning of e.g.
block_size
for a given input; can be made algorithm-agnostic.Maybe some thing like
AK.@tune reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024)
. Macro wizards help!Or make it general like:
AK.@tune begin reduce(f, src, init=init, block_size=$block_size, switch_below=$switch_below) block_size=(64, 128, 256, 512, 1024) switch_below=(1, 10, 100, 1000, 10000) end
We need multithreaded implementations of
sort
, N-dimensionalmapreduce
(inOhMyThreads.tmapreduce
) andaccumulate
(again, probably inOhMyThreads
).Any way to expose the warp-size from the backends? Would be useful in reductions.
Add a performance regressions runner.
Other ideas? Post an issue, or open a discussion on the Julia Discourse.