The CUDA Handbook begins where CUDA by Example (Addison-Wesley, 2011) leaves off, discussing CUDA hardware and software in greater detail and covering both CUDA 5.0 and Kepler. Every CUDA developer, from the casual to the most sophisticated, will find something here of interest and immediate usefulness. Newer CUDA developers will see how the hardware processes commands and how the driver checks progress; more experienced CUDA developers will appreciate the expert coverage of topics such as the driver API and context migration, as well as the guidance on how best to structure CPU/GPU data interchange and synchronization. The accompanying open source code–more than 25,000 lines of it, freely available at www.cudahandbook.com–is specifically intended to be reused and repurposed by developers. Designed to be both a comprehensive reference and a practical cookbook, the text is divided into the following three parts: Part I, Overview, gives high-level descriptions of the hardware and software that make CUDA possible. Part II, Details, provides thorough descriptions of every aspect of CUDA, including Memory Streams and events Models of execution, including the dynamic parallelism feature, new with CUDA 5.0 and SM 3.5 The streaming multiprocessors, including descriptions of all features through SM 3.5 Programming multiple GPUs Texturing The source code accompanying Part II is presented as reusable microbenchmarks and microdemos, designed to expose specific hardware characteristics or highlight specific use cases. Part III, Select Applications, details specific families of CUDA applications and key parallel algorithms, including Streaming workloads Reduction Parallel prefix sum (Scan) N-body Image Processing These algorithms cover the full range of potential CUDA applications.
CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get Useful Answers to your CUDA Questions on Stack Overflow" below.
Online documentation for many aspects of CUDA programming is available here.
The CUDA platform enables application development using several languages and associated APIs, including:
There are also frameworks that extend CUDA to enable a smoother development process like Managed CUDA, which has features like debugging and profiling.
You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.
Here are a number of suggestions to users new to CUDA and/or Stack Overflow. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!
cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which requires calling
cudaDeviceSynchronize(). Here is an example of how to do error checking in CUDA programs.
unspecified launch failureit is likely that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if
cuda-memcheckis reporting any errors.
Assume that there's a large
int array inside the host memory to which threads of a kernel read or write and cannot be held inside GPU global memory. When accesses to the elements of the array is coalesced, we can expect burst read/writes from/to the host memory which gets close to the theoretical bandwidth of the PCI-Express lanes. But how much will the bandwidth be compared to the theoretical maximum bandwidth when the read or writes are totally sporadic (non-coalesced)?
Based on below picture from here, for every write transaction there should be a 128-bits long packet. But it assumes 32-bit addressing (considering UVA requires 64-bit machines) and 4 bytes as the data size. If these assumptions are true, one can expect around one-fourth of maximum bandwidth for writes and one-seventh for reads. Am I correct? Are these assumptions true?
I designed and performed a test for above question with an
uint array. The bandwidth I got was around 480 MB/s for both heavy sporadic reads and heavy sporadic writes on a PCI-Express link with about 6.6 GB/s bandwidth, showing 13 to 14 times slow down. How can it be explained?
Only NVIDIA knows for sure - it might have to do with host memory only being accessible from one of the memory partitions - but that is a known performance limitation with mapped pinned memory. From p127-128 of The CUDA Handbook: "It is important that mapped pinned memory be accessed with coalesced memory transactions (Section 5.2.9). The performance penalty for uncoalesced memory transactions ranges from 6x to 2x. But even on SM 2.x and later GPUs, whose caches were supposed to make coalescing an obsolete consideration, the penalty is significant."
Sounds like the penalty is even higher on more recent hardware.