The CUDA Handbook

Nicholas Wilt

Mentioned 2

The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization. The accompanying open source code–more than 25,000 lines of it, freely available at–is specifically intended to be reused and repurposed by developers. Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts: Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible. Part II, Details, provides thorough descriptions of every aspect of CUDA, including Memory Streams and events Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5 The streaming multiprocessors, including descriptions of all features through SM 3.5 Programming multiple GPUs Texturing The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases. Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including Streaming workloads Reduction Parallel prefix sum (Scan) N-body Image Processing These algorithms cover the full range of potential CUDA applications.

More on

Mentioned in questions and answers.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get Useful Answers to your CUDA Questions on Stack Overflow" below.

Online documentation for many aspects of CUDA programming is available here.

The CUDA platform enables application development using several languages and associated APIs, including:

There are also frameworks that extend CUDA to enable a smoother development process like Managed CUDA, which has features like debugging and profiling.

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

How to get Useful Answers to your CUDA Questions on Stack Overflow

Here are a number of suggestions to users new to CUDA and/or Stack Overflow. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which requires calling cudaDeviceSynchronize(). Here is an example of how to do error checking in CUDA programs.
  • If you are getting unspecified launch failure it is likely that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if cuda-memcheck is reporting any errors.
  • Search Stack Overflow (and the web!) for similar questions before asking yours. Some questions are frequently asked, as for example on
  • Include an as-simple-as-possible code example in your question and you are much more likely to get a useful answer. If the code is short and self-contained (so users can test it themselves), that is even better.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is.


Assume that there's a large int array inside the host memory to which threads of a kernel read or write and cannot be held inside GPU global memory. When accesses to the elements of the array is coalesced, we can expect burst read/writes from/to the host memory which gets close to the theoretical bandwidth of the PCI-Express lanes. But how much will the bandwidth be compared to the theoretical maximum bandwidth when the read or writes are totally sporadic (non-coalesced)?

Based on below picture from here, for every write transaction there should be a 128-bits long packet. But it assumes 32-bit addressing (considering UVA requires 64-bit machines) and 4 bytes as the data size. If these assumptions are true, one can expect around one-fourth of maximum bandwidth for writes and one-seventh for reads. Am I correct? Are these assumptions true?

enter image description here


I designed and performed a test for above question with an uint array. The bandwidth I got was around 480 MB/s for both heavy sporadic reads and heavy sporadic writes on a PCI-Express link with about 6.6 GB/s bandwidth, showing 13 to 14 times slow down. How can it be explained?

Only NVIDIA knows for sure - it might have to do with host memory only being accessible from one of the memory partitions - but that is a known performance limitation with mapped pinned memory. From p127-128 of The CUDA Handbook: "It is important that mapped pinned memory be accessed with coalesced memory transactions (Section 5.2.9). The performance penalty for uncoalesced memory transactions ranges from 6x to 2x. But even on SM 2.x and later GPUs, whose caches were supposed to make coalescing an obsolete consideration, the penalty is significant."

Sounds like the penalty is even higher on more recent hardware.

Realated tags