cuda 学习笔记

时间 2019-11-19

标签 cuda 学习笔记繁體版

原文原文链接

1. __global__: Alerts the compiler that a function should be compiled to run on a device instead of the host.编程

2.The trick is actually in calling the device code from the host code which is handled by the CUDA compiler and runtime.数组

3.The hardware limits the number of blocks in a single launch to 65535. Similarly, the hardware limits the number of threads per block缓存

with which we can launch a kernel by the maxThreadPerBlock field of the device properties structure.异步

4.Cuda C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.
5. Without synchronization, we have created a race condition where the correctness of the execution results depends on the nondeterministic details of the hardware.async

6.一种关于__syncthreads() 的无穷等待的状况。关键是认识到：The Cuda Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed __syncthreads(). For example:函数

if(cacheIndex<i) {性能

cache[cacheIndex] += cache[cacheIndex + i];优化

__syncthreads();this

}操作系统

若是这样写的话，由于不符合条件的线程不会执行到__syncthreads();而cuda runtime 会一直等待全部线程（包括这种不符合条件的线程）执行到__syncthreads();就会出现无穷等待的状况。

7.Reading from constant memory can conserve memory bandwidth when compared to reading the same data from global memory.

8. The trickiest part of using events arises as a consequence of the fact that some of the calls we make in CUDA C are actually asynchronous. For example, when we launch the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes.v

9.为何Kernel函数没有返回值？在主机提交Kernel一段时间后，Kernel才开始在GPU上实际投入运行。这种异步调用机制致使Kernel没法返回函数值。

10.Cuda 编程的三个基本法则：

1）将数据放入并始终存储于GPGPU

2) 交给GPGPU足够更多的任务

3）注重GPGPU上的数据重用，以免带宽限制

11.可用于CUDA计算的GPGPU包含片上与片外两大类存储器。流多处理器片上存储器是速度最快，可扩展性最佳的备受青睐的存储器。这些存储器容量有限，一般仅在KB级别。板载的全局内存是一个共享的存储系统，能够被GPU上的全部流多处理器访问。该内存容量能够达到GB级别，是目前最大，使用最广泛，但也是GPU上最慢的存储器。

12.一个定义为volatile的变量是说这个变量可能会被意想不到地改变，这样，编译器就不会去假设这个变量的值了。精确的说就是，优化器在用到这个变量的时候必须每次都当心的从新读取这个变量的值，而不是使用保存在寄存器里的备份。

13. 在GF100芯片上的流多处理器上，寄存器溢出后会将多余的局部数据存放于L1缓存中，因为L1缓存带宽较高，程序仍可保持高性能，这就凸显了L1缓存的重要性。但须要知道，因为寄存器溢出和栈（最多会消耗1KB的L1缓存资源）占用了L1缓存，这将致使其余缓存数据更容易被挤出缓存，所以就会下降缓存的命中率，影响程序性能。

14.C编译器容许main()没有参数，或者有两个参数（有些实现容许更多的参数，但这将是对标准的扩展）。有两个参数时，第一个参数是命令行中的字符串数。按照惯例，这个int参数被称为argc（表明argument count），系统使用空格判断一个字符串结束，另外一个字符串开始。第二个参数是一个指向字符串的指针数组。命令行中的每一个字符串被存储到内存中，而且分配一个指针指向它。按照惯例，这个指针数组被称为argv(表明argument value)。若是能够（有些操做系统不容许这样作），把程序自己的名字赋值给argv[0],接着，把随后的第一个字符串赋给argv[1]，等等。

int main(int argc, char **argv) 这种对argv的声明和char *argv[]是等价的。这意味着argv是一个指向“指向字符的指针”的指针。

15.对于结构体，和数组不一样的是，一个结构的名字不是该结构的地址，必须使用&运算符。

16.关于编译和连接的一点总结：在编译时，编译器只检测程序语法，和函数，变量是否被声明，若是函数未被声明，编译器会给出一个警告，可是能够生成object file,而在连接程序时，连接器会在全部的Object文件中寻找函数的实现，若是找不到，就会报连接错误码（Linker Error）