Intrinsics

Intrinsics

This section lists the package's public functionality that corresponds to special CUDA functions to be used in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and Dimensions

CUDAnative.gridDimFunction.
gridDim()::CuDim3

Returns the dimensions of the grid.

source
CUDAnative.blockIdxFunction.
blockIdx()::CuDim3

Returns the block index within the grid.

source
CUDAnative.blockDimFunction.
blockDim()::CuDim3

Returns the dimensions of the block.

source
CUDAnative.threadIdxFunction.
threadIdx()::CuDim3

Returns the thread index within the block.

source
CUDAnative.warpsizeFunction.
warpsize()::UInt32

Returns the warp size (in threads).

source

Memory Types

Shared Memory

@cuStaticSharedMem(typ::Type, dims) -> CuDeviceArray{typ,Shared}

Get an array of type typ and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant (without requiring constant propagation, see JuliaLang/julia#5560), or an error will be thrown and the generator function will be called dynamically.

Multiple statically-allocated shared memory arrays can be requested by calling this macro multiple times.

source
@cuDynamicSharedMem(typ::Type, dims, offset::Integer=0) -> CuDeviceArray{typ,Shared}

Get an array of type typ and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable and the dimension and offset parameters should be constant (without requiring constant propagation, see JuliaLang/julia#5560), or an error will be thrown and the generator function will be called dynamically.

Dynamic shared memory also needs to be allocated beforehand, when calling the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

Note that calling this macro multiple times does not result in different shared arrays; only a single dynamically-allocated shared memory array exists.

source

Synchronization

sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

source

Warp Vote

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDAnative.vote_allFunction.
vote_all(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for all of them.

source
CUDAnative.vote_anyFunction.
vote_any(predicate::Bool)

Evaluate predicate for all active threads of the warp and return non-zero if and only if predicate evaluates to non-zero for any of them.

source
vote_ballot(predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

source

Warp Shuffle

CUDAnative.shflFunction.
shfl_idx(val, src::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane src

source
CUDAnative.shfl_upFunction.
shfl_up(val, src::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller.

source
CUDAnative.shfl_downFunction.
shfl_down(val, src::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller.

source
CUDAnative.shfl_xorFunction.
shfl_xor(val, src::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID.

source

Formatted Output

Print a formatted string in device context on the host standard output:

@cuprintf("%Fmt", args...)

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

source