Saturday 9 May 2015

What is CUDA? / Basics of CUDA (Necessity of GPU, Host, Device, Kernel, Stream Multiprocessor, Stream Processor, Thread, Block, Grid, Warp, Memory architecture of GPU)

                     In this post, we will see Basics of CUDA Programming | CUDA Terminologies | Host, Device, Kernel, Stream Multiprocessor, Stream Processor, Thread, Block, Grid, Warp, gpu vs cpu,what is cuda,what is cuda cores,what is cuda cores in graphics cards,what is cuda gpu,what is cuda programming,what is cuda and opencl,what is cuda toolkit,what is cuda nvidia,what is cuda cores in gpu.

Watch following video:

Watch on YouTube:

Why do we need GPU when already we have CPU?
              GPU stands for Graphics Processing Unit while CPU stands for Central Processing Unit. For CPUs which has say four cores, we at max can run four threads simultaneously (one thread on each core). In graphics(e.g. image) processing, so many pixels are processed simultaneously. For so much simultaneous processing, we need so many threads running simultaneously. But as I told before, for a CPU(which has limited number of cores), we can run limited number of threads simultaneously. Hence we need GPU which consisted of many more cores, specially in a case of graphics processing(e.g. Games on Computers).
               For your information, Nvidia's GPU GeForce GTX TITAN X consisted of 3072 cores. For it's specification, go through

What is CUDA?
              Now we will see what is CUDA. CUDA stands for Compute Unified Device Architecture. It is a package(libraries) developed by Nvidia, specially for graphics processing. These libraries can be used with C or C++. Cuda is specific to Nvidia; this means you can not use Cuda for graphics cards(GPUs) of other companies like AMD or Intel.

Host and Device 
            In CUDA, CPU is called a Host while a GPU is called a Device.

            A function(in C/C++ language) to be executed on GPU is called a Kernel. While defining kernel, a function is prefixed with keyword __global__.
 __global__ void matadd(int *a,int *b)
//code to be executed on GPU

Here matadd() is a kernel.

Note: Only those kernels, which are called from cpu function(e.g. main() funtion), are prefixed with __global__ during their definition while kernels which are called from another kernel, are prefixed with __device__ during their definition.

Stream Multiprocessor(SM) and Stream Processor(SP)
             GPU consists of smaller components called as Stream Multiprocesssors(SM). Each SM consists of many Stream Processors(SP) on which actual computation is done. Each SP is also called a Cuda core.

Thread, Block, Grid
             Thread is a single instance of execution. On one SP, one or more threads can be executed. A group of threads is called a Block. On one SM, one or more blocks can be executed. A group of blocks is called a Grid. One Grid is generated for one Kernel and on one GPU. Also, only one kernel can be executed at one time instance.

             Number of threads in a block running simultaneously on a SM is called a Warp. 
             Suppose a block has 128 threads. That block is going to be run on one SM which has only 8 SPs. On 8 SPs, only 8 threads can be run. If you consider an instruction pipeline of four phases(say Fetch, Decode, Execute, Write-back), then 4 threads can be run on one SP. Hence number of threads running simultaneously on that SM will be 8 * 4=32. These 32 threads form  one warp. As in our example, block contains 128 threads, this block will have 4 warps. This means on that SM, first warp will get run, then second, third, fourth and after that again first, second....

Memory architecture of a GPU: 

Watch following video:

Watch on YouTube:

             Now let us see memory architecture of a GPU. Following diagram shows all types memories present in a GPU.

1. Local Memory
                   Each SP uses Local memory. All variables declared in a kernel(a function to be executed on GPU) are saved into Local memory. 

2. Registers
                   Kernel may consist of several expressions. During execution of an expression, values are saved into the Registers of SP. 

3. Global Memory
                      It is the main memory of GPU. Whenever a memory from GPU is allocated for variables by using cudaMalloc() function, by default, it uses global memory.

4. Shared Memory
                     On one SP, one or more threads can be run. A collection of threads is called a Block. On one SM, one or more blocks can be run. Advantage of Shared memory is, it is shared by all the threads in one block. 
                     Shared memory is also used to reduce the latency(memory access delay). How? See, Global memory is very big in size as compared to shared memory. So definitely, search time for a location of variable is lesser for shared memory compared to global memory. 
                     Keep in mind, shared memory in one SM is shared by all threads in one block. When we have to use shared memory for a variable, it should be prefixed with keyword __shared__ during its declaration. For e.g. __shared__ int x. 

5. Constant Memory
                       Constant Memory is used to store constant values.
                       The advantage of having separate Constant Memory is to reduce latency. It is used in only those situation, when multiple threads has to access same value. How constant memory reduces the latency, I will explain now. Suppose there are 32 threads in one block. Let all of them are using the same variable. Hence there will be 32 accesses from global memory. Now if we store the variable in constant memory. Then first thread will access the value of a variable and will broadcast this value to other threads in half warp. This value will be saved in a cache and will be provided to the threads of other half warp. Hence total accesses will be just one instead of 32. 
                      Constant memory can be used for a variable by prefixing keyword __constant__ in the variable declaration(e.g. __constant__ int x).

6. Texture Memory
                      Texture memory is again used to reduce the latency. Texture memory is used in a special case. Consider an image. When we access a particular pixel, there are more chances that we will access surrounding pixels. Such a group of values which are accessed together are saved in texture memory.               

 Watch following video on Basics of CUDA:


  1. Very nice description!
    I found this article good too.

  2. Thank you so much sir. Helped a lot.