GPU code written in CUDAnative.jl can be as fast or even outperform CUDA C compiled with
nvcc (on the condition that the same hardware features are used). This section will describe how to do so, and what to be careful about.
When optimizing code, it is important to know what to optimize. Luckily, the CUDA toolkit ships an excellent profiler,
nvpp as the Eclipse-based UI. The CUDAnative compiler is fully compatible with these tools, and generates the required line number information to debug performance issues.
Although CUDAnative exports a
@profile macro, it does not serve the same purpose as
Base.@profile. Rather, it instructs the CUDA profiler to start right before the first kernel launch. This avoids profiling during the time Julia or CUDAnative precompile code, and result in a much more compact timeline view. If you want to use this feature, disable the
nvvp option to "Start profiling at application start". As with all Julia code, also perform a warm-up iteration without the profiler activated.
For true source-level profiling akin to
Base.@profile, look at
nvvp's PC Sampling View (requires compute capability >= 5.2, CUDA >= 7.5). In the future, we might have a
CUDAnative.@profile offering similar functionality, using the NVIDIA CUPTI library.
This section is a WIP. Some things to consider:
Float64is expensive, but literal floats are
Same for integers; although the performance hit is small, it increases register pressure.