CUDA by Example

Jason Sanders, Edward Kandrot

Mentioned 8

The complete guide to developing high-performance applications with CUDA - written by CUDA development team members, and supported by NVIDIA * *Breakthrough techniques for using the power of graphics processors to create highperformance general purpose applications. *Packed with realistic, C-based examples -- from basic to advanced. *Covers one of today's most highly-anticipated new technologies for software development wherever performance is crucial: finance, design automation, science, simulation, graphics, and beyond. NVIDIA graphics processors have immense computational power. With NVIDIA's breakthrough CUDA software platform, that power can be put to work in virtually any type of software development that requires exceptionally high performance, from finance to physics. Now, for the first time, two of NVIDIA's senior CUDA developers thoroughly introduce the platform, and show developers exactly how to make the most of it. CUDA C by Example is the first book on CUDA development for professional programmers - and the only book created with NVIDIA's direct involvement. Concise and practical, it focuses on presenting proven techniques and concrete example code for building high-performance parallelized CUDA programs with C. Programmers familiar with C will need no other skills or experience to get started - making high-performance programming more accessible than it's ever been before.

More on

Mentioned in questions and answers.

I am working to become a scientific programmer. I have enough background in Math and Stat but rather lacking on programming background. I found it very hard to learn how to use a language for scientific programming because most of the reference for SP are close to trivial.

My work involves statistical/financial modelling and none with physics model. Currently, I use Python extensively with numpy and scipy. Done R/Mathematica. I know enough C/C++ to read code. No experience in Fortran.

I dont know if this is a good list of language for a scientific programmer. If this is, what is a good reading list for learning the syntax and design pattern of these languages in scientific settings.

this might be useful: the nature of mathematical modeling

Writing Scientific Software: A Guide to Good Style is a good book with overall advice for modern scientific programming.

I'm a scientific programmer who just entered the field in the past 2 years. I'm into more biology and physics modeling, but I bet what you're looking for is pretty similar. While I was applying to jobs and internships there were two things that I didn't think would be that important to know, but caused me to end up missing out on opportunities. One was MATLAB, which has already been mentioned. The other was database design -- no matter what area of SP you're in, there's probably going to be a lot of data that has to be managed somehow.

The book Database Design for Mere Mortals by Michael Hernandez was recommended to me as being a good start and helped me out a lot in my preparation. I would also make sure you at least understand some basic SQL if you don't already.

One issue scientific programmers face is maintaining a repository of code (and data) that others can use to reproduce your experiments. In my experience this is a skill not required in commercial development.

Here are some readings on this:

These are in the context of computational biology but I assume it applies to most scientific programming.

Also, look at Python Scripting for Computational Science.

I am a newbie to GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemmas, suggestions are most welcome.

  1. If I go for CUDA -- Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have seen some blogposts which claim to have installed CUDA 5 on Ubuntu 11.10 and Ubuntu 12.04. However, I have not been able to get them to work. Also, standard CUDA textbooks prefer to work in the windows domain and are more or less silent in concern with Unix/Ubuntu installation and working.

  2. CUDA or OpenCL -- Now this is probably more trickier than my first question ! I have mostly come across GPGPU projects using CUDA/Nvidia but OpenCL is probably is the next best option in open source and installing in Ubuntu probably will not be an issue, though some suggestions here will be most useful. Am I sacrificing any functionality if I go for OpenCL and NOT CUDA ?

Any help or suggestions ?

  1. If you use OpenCL, you can easily use it both on Windows and Linux because having display drivers is enough to run OpenCL programs and for programming you would simply need to install the SDK. CUDA has more requirements on specific GCC versions etc. But it is not much more difficult to install on Linux also.

  2. In Linux CUDA has strange requirements such as using GCC 4.6 or 4.7. If you use a different version of GCC, you won't be able to compile your program anymore. If you use OpenCL, you can use any compiler because you would just need to link with the common OpenCL library. So OpenCL is easier to setup, use and compile for. Once you compile an OpenCL program it can be run on any hardware (as long as it is coded to do so) even if it was compiled using another brand's OpenCL SDK.

You can write OpenCL programs which will function on Nvidia, AMD, and Intel hardware, on GPUs, CPUs, and Accelerators. Even more, Altera is working on supporting OpenCL on FPGAs! If you use CUDA, you will have to use Nvidia GPUs only and re-write your code again in OpenCL or other language for other platforms. A serious limitation of using CUDA and cause of serious waste of time in the long run.

I see that somebody posted some old references between CUDA and OpenCL, but they are old! When those documents were out, only AMD properly supported OpenCL. Since 2013, OpenCL is supported by ARM, Altera, Intel etc. and became an industry standard.

The only downside is that since OpenCL is so flexible, you are faced with more options and ways to code memory allocations, transfers etc. in your program. Therefore it may feel more complicated perhaps.

The idea of doing remote rendering (typically for a video game) which is streamed to a client device is conceptually quite simple, barring obvious issues like lag for an interactive fast-paced game.

But - technically how could you do it? My understanding is that streaming video not only caches ahead of the current play-back position, but that video files are compressed by looking ahead many frames?

Are there libraries that would let you feed an arbitrary "display feed" into a serverside video-source, so that you could then play it on the client using regular Flash/HTML5 components? Avoiding the need for a custom app or bespoke browser-plugin would be a significant benefit... i.e. the client-side web-page doesn't know it's not a regular video.

It's a bit like a web-cam I suppose... but I want the 'camera' to be 'watching' the output of a window rendered to on the server.

I'm targeting Windows-based servers and C++ rendering apps.

Stream Encoding Video


I'm working on a similar problem and I'll share what I've learned. While I don't know how to stream them out, I do know how to generate and encode multiple HD video streams on the server. I've tested two approaches: NVIDIA CUDA Video Encode (C Library) API and Intel Performance Primitives Video Encoder. The NVIDIA link takes you right to the example. The Intel page does not have internal anchors so you'll have to search for "Video Encoder".

Test Setup

Both encode video streams, up to and inlcluding HD, to H.264. Other formats are supported, but I am interested in H.264. To test performance, I setup prepared input video, in YUV format, and fed it to the encoders as fast as they would take it. Output from both encoders was 1080P.

CPU Performance

Performance wise, the Intel video encoder could encode a single stream at 0.5X real time with about a 12.5% load on a Xeon E5520 @ 2.27GHz, i.e. one core of eight at 100% load. Newer Xeons are much faster, but I don't know if they can hit real-time yet.

GPU Performance

The NVIDIA encoder on a GTS 450, could encode 9-10X real-time 1080P(!) with a 50% CPU load. The CPU load on the NVIDIA appear to be primarily copying data to-and-from the GPU.

What is particularly nice about the GPU solution is that it can take a GPU render surface as input; graphics are generated and encoded on the GPU, only leaving to go out to the network. For details on using a render surface and an input, see CUDA by Example, an excellent and straight-forward book on GPU programming. In that case I would expect CPU load to drop by approximately half. Since there is no point in going faster than real-time for real-time graphics, you could likely encode 8+ streams from render surfaces with adequate GPU resources, e.g. two GTS 450 cards, perhaps many more if resolution lower than 1080P is acceptable.

I've read the following and most of the NVIDIA manuals and other content. I was also at GTC last year for the papers and talks.

CUDA by Example: An Introduction to General-Purpose GPU Programming

Programming Massively Parallel Processors: A Hands-on Approach

And I'm aware of the latest GPU Computing Gems Emerald Edition but haven't read it yet.

What other books and resources would you recommend? For instance I'm sure there's some great content from the first wave of data parallel programming in the 80s (the Connection Machine etc). I know a lot of research was done on data parallel algorithms for that generation of hardware.

Followup... 30/Mar/2011

I also discovered that the GPU Gems books 1-3 have some chapters on GPU computing, not just graphics. They're available free online, I've not had a chance to read them yet.

Hillis & Steele [1986], "Data Parallel Algorithms".

While compiling this hello world sample in Ubuntu 10.10

This is from CUDA by Example, chapter 3 (No compile instructions provided >:@)

#include <iostream>

__global__ void kernel (void){


int main(void){

    kernel <<<1,1>>>();
        printf("Hellow World!\n");
    return 0;


I got this:

$ nvcc -lcudart error: identifier "printf" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00007812_00000000-4_hello.cpp1.ii".

Why? How should this code be compiled?

You need to include stdio.h not iostream (which is for std::cout stuff) for printf (see man 3 printf). I found the source code for the book here.

chapter03/ is actually:

 * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
 * NVIDIA Corporation and its licensors retain all intellectual property and 
 * proprietary rights in and to this software and related documentation. 
 * Any use, reproduction, disclosure, or distribution of this software 
 * and related documentation without an express license agreement from
 * NVIDIA Corporation is strictly prohibited.
 * Please refer to the applicable NVIDIA end user license agreement (EULA) 
 * associated with this source code for terms and conditions that govern 
 * your use of this NVIDIA software.

#include "../common/book.h"

int main( void ) {
    printf( "Hello, World!\n" );
    return 0;

Where ../common/book.h includes stdio.h.

The README.txt file details how to compile the examples:

The vast majority of these code examples can be compiled quite easily by using 
NVIDIA's CUDA compiler driver, nvcc. To compile a typical example, say 
"," you will simply need to execute:

> nvcc

I'm using some standard GLSL (version 120) vertex and fragment shaders to simulate LIDAR. In other words, instead of just returning a color at each x,y position (each pixel, via the fragment shader), it should return color and distance.

I suppose I don't actually need all of the color bits, since I really only want the intensity; so I could store the distance in gl_FragColor.b, for example, and use .rg for the intensity. But then I'm not entirely clear on how I get the value back out again.

Is there a simple way to return values from the fragment shader? I've tried varying, but it seems like the fragment shader can't write variables other than gl_FragColor.

I understand that some people use the GLSL pipeline for general-purpose (non-graphics) GPU processing, and that might be an option — except I still do want to render my objects normally.

Fragment shaders output to a rendering buffer. If you want to use the GPU for computing and fetching data back into host memory you have a few options

  • Create a framebuffer and attach a texture to it to hold your data. Once the image has been rendered you can read back information from the texture into host memory.
  • Use an CUDA, OpenCL or an OpenGL compute shader to write the memory into an arbitrary bound buffer, and read back the buffer contents

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get Useful Answers to your CUDA Questions on Stack Overflow" below.

Online documentation for many aspects of CUDA programming is available here.

The CUDA platform enables application development using several languages and associated APIs, including:

There are also frameworks that extend CUDA to enable a smoother development process like Managed CUDA, which has features like debugging and profiling.

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

How to get Useful Answers to your CUDA Questions on Stack Overflow

Here are a number of suggestions to users new to CUDA and/or Stack Overflow. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which requires calling cudaDeviceSynchronize(). Here is an example of how to do error checking in CUDA programs.
  • If you are getting unspecified launch failure it is likely that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if cuda-memcheck is reporting any errors.
  • Search Stack Overflow (and the web!) for similar questions before asking yours. Some questions are frequently asked, as for example on
  • Include an as-simple-as-possible code example in your question and you are much more likely to get a useful answer. If the code is short and self-contained (so users can test it themselves), that is even better.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is.


I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?

The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.

CUDA by Example: An Introduction to General-Purpose GPU Programming

Programming Massively Parallel Processors: A Hands-on Approach

These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.

Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.

Here are two tutorials on getting started with CUDA and Visual C++ 2010:

There's also a post on the NVIDIA forum:

Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.