Stream Multiprocessor(SM) and Stream Processor(SP)
GPU consists of smaller components called as Stream Multiprocesssors(SM). Each SM consists of many Stream Processors(SP) on which actual computation is done. Each SP is also called a Cuda core.
Memory architecture of a GPU:
Now let us see memory architecture of a GPU. Following diagram shows all types memories present in a GPU.
1. Local Memory
Each SP uses Local memory. All variables declared in a kernel(a function to be executed on GPU) are saved into Local memory.
Kernel may consist of several expressions. During execution of an expression, values are saved into the Registers of SP.
3. Global Memory
It is the main memory of GPU. Whenever a memory from GPU is allocated for variables by using cudaMalloc() function, by default, it uses global memory.
4. Shared Memory
On one SP, one or more threads can be run. A collection of threads is called a Block. On one SM, one or more blocks can be run. Shared memory is shared by all the threads in one block. Shared memory is used to reduce the latency(memory access delay). How? See, Global memory is very big in size as compared to shared memory. So definitely, search time for a location of variable is lesser for shared memory compared to global memory. Keep in mind, shared memory in one SM is shared by all threads in one block. When we have to use shared memory for a variable, it should be prefixed with keyword __shared__ during its declaration. For e.g. __shared__ int x.
5. Constant Memory
Constant memory is also used to reduce latency. But constant memory is used in only those situation, when multiple threads has to access same value. How constant memory reduces the latency, I will explain now. Suppose there are 32 threads in one block. Let all of them are using the same variable. Hence there will be 32 accesses from global memory. Now if we store the variable in constant memory. Then first thread will access the value of a variable and will broadcast this value to other threads in half warp. This value will be saved in a cache and will be provided to the threads of other half warp. Hence total accesses will be just one instead of 32. Constant memory can be used for a variable by prefixing keyword __constant__ in the variable declaration(e.g. __constant__ int x).
6. Texture Memory
Texture memory is again used to reduce the latency. Texture memory is used in a special case. Consider an image. When we access a particular pixel, there are more chances that we will access surrounding pixels. Such a group of values which are accessed together are saved in texture memory.