1=============================
2User Guide for NVPTX Back-end
3=============================
4
5.. contents::
6   :local:
7   :depth: 3
8
9
10Introduction
11============
12
13To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
14along with a defined set of conventions used to represent GPU programming
15concepts. This document provides an overview of the general usage of the back-
16end, including a description of the conventions used and the set of accepted
17LLVM IR.
18
19.. note:: 
20   
21   This document assumes a basic familiarity with CUDA and the PTX
22   assembly language. Information about the CUDA Driver API and the PTX assembly
23   language can be found in the `CUDA documentation
24   <http://docs.nvidia.com/cuda/index.html>`_.
25
26
27
28Conventions
29===========
30
31Marking Functions as Kernels
32----------------------------
33
34In PTX, there are two types of functions: *device functions*, which are only
35callable by device code, and *kernel functions*, which are callable by host
36code. By default, the back-end will emit device functions. Metadata is used to
37declare a function as a kernel function. This metadata is attached to the
38``nvvm.annotations`` named metadata object, and has the following format:
39
40.. code-block:: llvm
41
42   !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
43
44The first parameter is a reference to the kernel function. The following
45example shows a kernel function calling a device function in LLVM IR. The
46function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
47
48.. code-block:: llvm
49
50    define float @my_fmad(float %x, float %y, float %z) {
51      %mul = fmul float %x, %y
52      %add = fadd float %mul, %z
53      ret float %add
54    }
55
56    define void @my_kernel(float* %ptr) {
57      %val = load float* %ptr
58      %ret = call float @my_fmad(float %val, float %val, float %val)
59      store float %ret, float* %ptr
60      ret void
61    }
62
63    !nvvm.annotations = !{!1}
64    !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
65
66When compiled, the PTX kernel functions are callable by host-side code.
67
68
69Address Spaces
70--------------
71
72The NVPTX back-end uses the following address space mapping:
73
74   ============= ======================
75   Address Space Memory Space
76   ============= ======================
77   0             Generic
78   1             Global
79   2             Internal Use
80   3             Shared
81   4             Constant
82   5             Local
83   ============= ======================
84
85Every global variable and pointer type is assigned to one of these address
86spaces, with 0 being the default address space. Intrinsics are provided which
87can be used to convert pointers between the generic and non-generic address
88spaces.
89
90As an example, the following IR will define an array ``@g`` that resides in
91global device memory.
92
93.. code-block:: llvm
94
95    @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
96
97LLVM IR functions can read and write to this array, and host-side code can
98copy data to it by name with the CUDA Driver API.
99
100Note that since address space 0 is the generic space, it is illegal to have
101global variables in address space 0.  Address space 0 is the default address
102space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
103variables.
104
105
106NVPTX Intrinsics
107================
108
109Address Space Conversion
110------------------------
111
112'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
113^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
114
115Syntax:
116"""""""
117
118These are overloaded intrinsics.  You can use these on any pointer types.
119
120.. code-block:: llvm
121
122    declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
123    declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
124    declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
125    declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
126
127Overview:
128"""""""""
129
130The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
131address space to a generic address space pointer.
132
133Semantics:
134""""""""""
135
136These intrinsics modify the pointer value to be a valid generic address space
137pointer.
138
139
140'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
141^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
142
143Syntax:
144"""""""
145
146These are overloaded intrinsics.  You can use these on any pointer types.
147
148.. code-block:: llvm
149
150    declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
151    declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
152    declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
153    declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
154
155Overview:
156"""""""""
157
158The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
159address space to a pointer in the target address space.  Note that these
160intrinsics are only useful if the address space of the target address space of
161the pointer is known.  It is not legal to use address space conversion
162intrinsics to convert a pointer from one non-generic address space to another
163non-generic address space.
164
165Semantics:
166""""""""""
167
168These intrinsics modify the pointer value to be a valid pointer in the target
169non-generic address space.
170
171
172Reading PTX Special Registers
173-----------------------------
174
175'``llvm.nvvm.read.ptx.sreg.*``'
176^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
177
178Syntax:
179"""""""
180
181.. code-block:: llvm
182
183    declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
184    declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
185    declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
186    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
187    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
188    declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
189    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
190    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
191    declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
192    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
193    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
194    declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
195    declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
196
197Overview:
198"""""""""
199
200The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
201special registers, in particular the kernel launch bounds.  These registers
202map in the following way to CUDA builtins:
203
204   ============ =====================================
205   CUDA Builtin PTX Special Register Intrinsic
206   ============ =====================================
207   ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
208   ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
209   ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
210   ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
211   ============ =====================================
212
213
214Barriers
215--------
216
217'``llvm.nvvm.barrier0``'
218^^^^^^^^^^^^^^^^^^^^^^^^^^^
219
220Syntax:
221"""""""
222
223.. code-block:: llvm
224
225  declare void @llvm.nvvm.barrier0()
226
227Overview:
228"""""""""
229
230The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
231instruction, equivalent to the ``__syncthreads()`` call in CUDA.
232
233
234Other Intrinsics
235----------------
236
237For the full set of NVPTX intrinsics, please see the
238``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
239
240
241Executing PTX
242=============
243
244The most common way to execute PTX assembly on a GPU device is to use the CUDA
245Driver API. This API is a low-level interface to the GPU driver and allows for
246JIT compilation of PTX code to native GPU machine code.
247
248Initializing the Driver API:
249
250.. code-block:: c++
251
252    CUdevice device;
253    CUcontext context;
254
255    // Initialize the driver API
256    cuInit(0);
257    // Get a handle to the first compute device
258    cuDeviceGet(&device, 0);
259    // Create a compute device context
260    cuCtxCreate(&context, 0, device);
261
262JIT compiling a PTX string to a device binary:
263
264.. code-block:: c++
265
266    CUmodule module;
267    CUfunction funcion;
268
269    // JIT compile a null-terminated PTX string
270    cuModuleLoadData(&module, (void*)PTXString);
271
272    // Get a handle to the "myfunction" kernel function
273    cuModuleGetFunction(&function, module, "myfunction");
274
275For full examples of executing PTX assembly, please see the `CUDA Samples
276<https://developer.nvidia.com/cuda-downloads>`_ distribution.
277