OpenCL C

时间 2019-11-16

标签 opencl 栏目 C&C++ 繁體版

原文原文链接

OpenCL C

OpenCL

简介

opencl C是ISO C99的一个扩展，主要区别以下：html

去除了C99的一些特性，如：标准C99头文件，函数指针，递归，变长数组，和位域
增长了一些特性用于并行计算，如：工做项和工做组，向量类型，同步，地址空间限定符(Address space qualifiers)

内建类型

标量数据类型ios

char , uchar, short, ushort, int, uint, long, ulong, float
bool, intptr_t, ptrdiff_t, size_t, uintptr_t, void, half (storage)
图像类型
image2d_t, image3d_t, sampler_t
向量数据类型
Vector lengths 2, 4, 8, & 16 (char2, ushort4, int8, float16, double2, …)

向量操做

向量的n能够选择大小为2, 3, 4, 8, and 16，能够直接使用向量字面值，例如：
(float4)( float, float, float, float )
(float4)( float2, float, float )
(float4)( float, float2, float )
(float4)( float, float, float2 )
(float4)( float2, float2 )
(float4)( float3, float )
(float4)( float, float3 )
(float4)( float )只一个值则赋给全组算法

向量下标：
能够用xyzw表示0123进行索引，如s.xy将索引0, 1位置的值。数组

使用.odd, .even索引偶数，奇数位置值，下标是从0开始。
int8 v = (int8)(1, 2, 3, 4, 5, 6, 7, 8);
int4 v1 = v.odd; // 奇数索引位置值{2, 4, 6, 8}缓存

使用.lo表示向量前半部分，.hi表示向量的后半部分。
int8 v = (int8)(1, 2, 3, 4, 5, 6, 7, 8);
int4 v1 = v.lo; // {1, 2, 3, 4}
int4 v2 = v.hi; // {5, 6, 7, 8}
对于3个元素的向量，v.hi, v.odd的第二个元素为未定义。
int3 v = (int3)(1, 2, 3);
int2 v1 = v.lo; // {1, 2}
int2 v2 = v.hi; // {3, undefined}并发

在作赋值时，必须保证两边向量的元素个数相同：
float4 v = (float4)(1);
v.odd = (float2)(3, 3); //左边是2个元素，右边必需要是float2less

关系运算符

关系运算符返回值：
标量：specified relation is false返回0, true返回1
向量：specified relation is false返回0, true返回-1dom

NaN的状况：异步

The equality operator equal(==) returns 0 if one or both arguments are not a number (NaN).
The equality operator not equal (!=) returns 1 (for scalar source operands) or -1 (for vector source operands) if one or botharguments are not a number (NaN)

bitwise operator

bitwise operators and (&), or (|), exclusive or (^), and not (~)

类型转换

destType convert_destType<_sat><_roundingMode> (sourceType)
destTypen convert_destTypen<_sat><_roundingMode> (sourceTypen)

Modifier Rounding Mode Description
_rte Round to nearest even
_rtz Round toward zero
_rtp Round toward positive infinity
_rtn Round toward negative infinity

整型默认 _rtz，float是_rte;

标量支持显示转换，也能够用convert_type函数。

char n = 3;
int m = (int)n;
或
int m = convert_int(n);

向量转换，不支持显示转换，必须使用convert_type函数进行转换。

float4 v1 = (float4)(1.0 1.0 1.0 1.0);
int4 v2 = convert_int4(v1);

as_type不改变元素bit位，从新使用新的类型解析，注意不一样平台字节序（Endianness）可能不一致，不具备可移植性：

float4 v1 = (float4)(1.0 1.0 1.0 1.0);
int4 v3 = as_int4(v1); //(int4)(0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000)，不是1

内存操做

返回(p + offset * n)处的值：

gentypen vloadn(size_t offset, const __global gentype *p)

将data写到(p + offset *n)位置：

void vstoren (gentypen data, size_t offset, __global gentype *p)

判断地址类型：

bool is_global (const void *ptr)
bool is_local (const void *ptr)
bool is_private (const void *ptr)
cl_mem_fence_flags get_fence (const void *ptr)：返回地址对应的cl_mem_fence_flags

从global memory 到 local memory，或local memory 到 global memory 的异步拷贝，可使用DMA实现，快速。
参数event是须要等待的事件
返回一个event，能够给wait_group_events使用。

event_t async_work_group_copy(local gentype
*dst, const global gentype *src, size_t
num_gentypes, event_t event);

event_t async_work_group_strided_copy(__local gentype *dst, const __global gentype *src, size_t num_gentypes, size_t src_stride, event_t event);

将全局内存num_gentypes * sizeof(gentype)字节缓存到global cache中。

void prefetch(const _global gentype *_p, size_t num_gentypes)

同步

work_group_barrier之前的叫barrier函数，新标准仍然兼容barrier函数。一个工做组里的全部线程必须都执行到这个函数，才能继续往下执行。

void work_group_barrier (cl_mem_fence_flags flags)

cl_mem_fence_flags：
CLK_LOCAL_MEM_FENCE local内存操做对全部同组item可见
CLK_GLOBAL_MEM_FENCE global内存操做对同组可见

不论是CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE，都只能对相同的work-group里的item进行同步，没法同步全局item的内存操做。

若是真的须要进行全局全部item同步，那么最好将同步先后拆分红两个kernel，在host端调用时进行同步。

原子操做

使用原子操做作同步开销是至关大的，可是相对于使用更原始的阻塞当前线程执行的同步方式而言又是比较高效的。所以，当对某些特定数据作同步更新时，不须要使用栅栏（fence）等这种更低效的同步处理机制，咱们能够直接对那些存储地址采用原子操做。

在一个原子事务中执行。读取 p 指向位置的内容（用做返回值），将 p 指向位置的内容加上 val 后再存入该位置。

int atomic_add (volatile __global int *p, int val)

原子加 1 操做。读取 p 指向位置的内容（用做返回值），将 p 指向位置的内容加上常量值 1 后再存入该位置。原子减 1 操做 atomic_dec 和加 1 操做相似。

int atomic_inc(volatile __global int *p)

pipe

pipe能够用于在不一样kernel程序间传递数据。多个kernel程序(甚至是硬件许可)对同一pipe的同时访问结果都是不肯定的。主机端没法访问pipe。

OpenCL2.0新增了一个主机API函数来建立pipe，再经过设置参数将pipe传递给不一样的kernel使用：

cl_mem clCreatePipe ( cl_context context, cl_mem_flags flags, cl_uint pipe_packet_size, cl_uint pipe_max_packets,  
const cl_pipe_properties * properties,  cl_int *errcode_ret)

一个kernel进行写入：

//reserve space in pipe for writing random numbers.  
reserve_id_t rid = work_group_reserve_write_pipe(rng_pipe, szgr);  
write_pipe(rng_pipe,rid,lid, &gfrn);  
work_group_commit_write_pipe(rng_pipe, rid);

一个kernel进行读取：

//reserve pipe for reading  
reserve_id_t rid = work_group_reserve_read_pipe(rng_pipe, szgr);  
if(is_valid_reserve_id(rid)) {  
  //read random number from the pipe.  
  read_pipe(rng_pipe,rid,lid, &rn);  
  work_group_commit_read_pipe(rng_pipe, rid);  
  }

打印

printf常规：
%d
%x
%f
%s

打印向量vn, n取2, 3, 4, 8, 16:

int4 value = (int4)(1, 2, 3, 4);
printf("%v4d\n", value);

描述符

加下划线不加下划线均可以。

函数描述符:
__kernel and kernel

内存位置描述符：
__global, global,
__local, local,
__constant, constant,
__private and private

访问权限描述符：
__read_only, read_only,
__write_only, write_only,
__read_write and read_write

work item函数

get_local_id: 返回当前thread在group中的位置
get_group_id: 返回当前group的位置
get_global_id: 返回当前thread在全局thread中的位置

get_local_size返回一个work-group的大小
get_global_size返回全局work-item的个数，NDRange中的global_work_size

整体上有：
get_global_id = get_group_id * get_local_size + get_local_id

wave

wave是线程调度的基本单位，相似cuda里的warp(32), AMD的实现中，wave大小被定义为64。

访存合并

对于全局内存，一次访问，须要几百个cycles，咱们但愿进行访存合并，减小内存访问次数。

不必定要全部thread要进行数据读取，但要保证以下两点才能进行合并访问：

Aligned Memory access 对齐
Coalesced Memory access 连续

当要获取的Memory首地址是cache line的倍数时，就是Aligned Memory Access，若是是非对齐的，就会致使浪费带宽。至于Coalesced Memory Access则是warp的32个thread请求的是连续的内存块。

L1为128 byte，一次最小读入128 byte大小。

如下二者方式均可以一次传输：

enter description here

下面落入两个128-byte，因此须要两次传输：

enter description here

下面落入更多的区域，因此须要更多的传输：

enter description here

Uncached Loads

这里就是指不走L1可是仍是要走L2，也就是cache line从128-byte变为32-byte了.
下图是理想的对齐且连续情形，全部的128 bytes都落在四块32 bytes的块中

enter description here

下图请求没有对齐，请求落在了160-byte范围内，bus有效使用率是百分之八十，相对使用L1，性能要好很多。

enter description here

下图是全部thread都请求同一块数据的情形，bus有效使用率为4bytes/32bytes=12.5%，依然要比L1表现好。

enter description here

下图是状况最糟糕的，数据很是分散，可是因为所请求的128 bytes落在了多个以32 bytes为单位的segment中，所以无效的数据传输要少的多。

enter description here

收集来自： http://www.javashuo.com/article/p-gvervabt-cw.html

bank conflict

如今的warp通常是32个thread，在local memory中，存在32个bank，每一个bank是4 bytes，性能高的也多是8 bytes。

以下，一个local memory被映射到不一样的bank中，在一个warp中若是thread 0访问bank0，thread31访问bank31，这样就没有conflict。

int lid = get_local_id(0);
int v = data[lid];

但若是是下面的访问方法, thread 0, 8, 16, 24都会访问bank0，这就是一个4 way conflict，致使性能降低为原来的1/4。

int lid = get_local_id(0);
int v = data[lid*4];

bank

对于局部内存，一个warp中若是多个thread访问到相同的bank的不一样位置，便会产生bank conflict，这样访问会顺序执行。

另外，若是全部thread都访问到一个bank，会产生广播，不会形成conflict，如你们都访问data[0]，只会是一次访问。

延时隐藏

若是warp中线程执行一条指令须要等待前面启动的长延时操做的结果（就是该warp须要从全局存储器中提取数值计算），那么就不选择该warp，而是选择另外一个不须要等待结果的驻留的warp（这个warp已经获得了本身须要的结果，因此已经无需等待了，能够直接执行了），当多个warp准备执行的时候，采用优先机制选择一个warp执行，这种机制不产生延时的线程先执行，这就是所谓的延时隐藏（latency hiding）。

同一个warp中的thread能够以任意顺序执行，active warps被sm资源限制。当一个warp空闲时，SM就能够调度驻留在该SM中另外一个可用warp。在并发的warp之间切换是没什么消耗的，由于硬件资源早就被分配到全部thread和block，因此该新调度的warp的状态已经存储在SM中了。不一样于CPU，CPU切换线程须要保存/读取线程上下文（register内容），这是很是耗时的，而GPU为每一个threads提供物理register，无需保存/读取上下文。

Occupancy

要保证较高的CU资源利用率，如何保证呢，就是在进行内存访问请求资源时，有足够多的算术计算占据这部分时间。

向量化

向量化容许一个线程同时执行多个操做。咱们能够在kernel代码中，使用向量数据类型，好比float4来得到加速。向量化在AMD的GPU上效果更为明显，这是由于AMD的显卡的stream core是(x,y,z,w)这样的向量运算单元。
下图是在简单的向量赋值运算中，使用float和float4的性能比较。

opencl优化方法

思路：

更好的算法思想，如对矩阵相乘进行分块
使用本地内存（Local Memory）
- 本地内存的延迟比全局内存低，但可能会存在隐性开销。例如，使用本地内存常常有一个本地内存屏障，这种屏障将致使同步延迟，抵销了低延迟带来的好处。
- 在您将多级算法合并至单一内核函数中时，本地内存对于存储中间数据是有好处的，能够节省 DDR 带宽，从而下降功耗。
- 若是您但愿在本地内存缓存数据，便于屡次访问，一个好的经验法则是保证缓存数据被访问3次以上才有必要这么作。
避免本地内存的bank conflict
优化全局内存的访存合并
对于work-group大小，最好是wave的整数倍，若是是非整数倍，有部分wave里是空置的；若是小于wave的话，也会有一部分线程空操做
kernel要简单些，复杂的话须要的寄存器数量会增多，而一个sm所拥有的寄存器个数是固定的（GTX 1080TI 个数为: 65536）
尽可能按行操做,须要按列操做时能够先对矩阵进行转置
循环展开，减小分支（分支是分步执行的，好比说一个if (tid % 2)这样的分支，先执行奇数线程，再执行偶数线程）
向量化操做，向量化容许一个线程同时执行多个操做。咱们能够在 kernel 代码中,使用向量数据类型,好比 float4 来得到加速。

图像

采样器对象描述了读取图像数据时如何对图像进行采样。图像读取函数 read_imageX 包含一个采样器参数，该参数能够在主机端经过调用 OpenCL API 函数建立，而后使用 clSetKernelArg 传递给内核；也能够在内核程序中声明，在内核程序中声明的采样器对象为 sampler_t 类型的常量。采样器对象包含了一些属性，这些属性描述了在读取图像对象的像素时如何采样。分别是规格化浮点坐标，寻址模式和过滤模式。

规格化坐标：指定传递的 x、y 和 z 坐标值是规格化浮点坐标仍是非规格化坐标值。能够是 CLK_NORMALIZED_COORDS_TRUE 或者 CLK_NORMALIZED_COORDS_FALSE 枚举类型的值；
寻址模式：指定图像的寻址模式。即，当传递的坐标值超过图像坐标区域时该如何处理。能够是下面的枚举类型的值：
- CLK_ADDRESS_MIRRORED_REPEAT：图像区域外的坐标设置为区域内坐标的反射值对应的颜色；
- CLK_ADDRESS_REPEAT：图像区域外的坐标重复区域内坐标的颜色，只对规格化坐标有效；
- CLK_ADDRESS_CLAMP_TO_EDGE：图像区域外的坐标返回图像边缘的颜色；
- CLK_ADDRESS_CLAMP：图像区域外坐标返回的颜色和边框颜色保持一致；
过滤模式：指定使用的过滤模式。能够是 CLK_FILTER_NEAREST 或 CLK_FILTER_LINEAR 枚举类型值，分别表示最近邻插值和双线性插值。

Sample

1. vector add

每一个thread执行一个元素：

vector add

2. image scale

image scale

3. reduction

__kernel void reduce(__global uint4* input, __global uint4* output, int NUM) {
    NUM = NUM / 4;    //每四个数为一个总体uint4。
    unsigned int tid = get_local_id(0);
    unsigned int localSize = get_local_size(0);
    unsigned int globalSize = get_global_size(0);

    uint4 res=(uint4){0,0,0,0};
    __local uint4 resArray[64];

    
    unsigned int i = get_global_id(0);
    while(i < NUM)
    {
        res+=input[i];
        i+=globalSize;
    }
    resArray[tid]=res;    //将每一个work-item计算结果保存到对应__local memory中
    barrier(CLK_LOCAL_MEM_FENCE);

    // do reduction in shared mem
    for(unsigned int s = localSize >> 1; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            resArray[tid] += resArray[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) 
        output[get_group_id(0)] = resArray[0];
}

#include <CL/cl.h>
#include "tool.h"
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <fstream>
using namespace std;

int isVerify(int NUM,int groupNUM,int *res) //校验结果 {
       int sum1 = (NUM+1)*NUM/2;
    int sum2 = 0;
    for(int i = 0;i < groupNUM*4; i++)
        sum2 += res[i];
    if(sum1 == sum2)
        return 0;
    return -1;
}

void isStatusOK(cl_int status) //判断状态码 {
    if(status == CL_SUCCESS)
        cout<<"RIGHT"<<endl;
    else
        cout<<"ERROR"<<endl;
}

int main(int argc, char* argv[]) {
    cl_int    status;
    /**Step 1: Getting platforms and choose an available one(first).*/
    cl_platform_id platform;
    getPlatform(platform);

    /**Step 2:Query the platform and choose the first GPU device if has one.*/
    cl_device_id *devices=getCl_device_id(platform);

    /**Step 3: Create context.*/
    cl_context context = clCreateContext(NULL,1, devices,NULL,NULL,NULL);

    /**Step 4: Creating command queue associate with the context.*/
    cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);

    /**Step 5: Create program object */
    const char *filename = "Own_Reduction_Kernels.cl";
    string sourceStr;
    status = convertToString(filename, sourceStr);
    const char *source = sourceStr.c_str();
    size_t sourceSize[] = {strlen(source)};
    cl_program program = clCreateProgramWithSource(context, 1, &source, sourceSize, NULL);

    /**Step 6: Build program. */
    status=clBuildProgram(program, 1,devices,NULL,NULL,NULL);

    /**Step 7: Initial input,output for the host and create memory objects for the kernel*/
    int NUM=25600;    //6400*4
    size_t global_work_size[1] = {640};  ///
    size_t local_work_size[1]={64};    ///256 PE
    size_t groupNUM=global_work_size[0]/local_work_size[0];
    int* input = new int[NUM];
    for(int i=0;i<NUM;i++)
        input[i]=i+1;
    int* output = new int[(global_work_size[0]/local_work_size[0])*4];

    cl_mem inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, (NUM) * sizeof(int),(void *) input, NULL);
    cl_mem outputBuffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY , groupNUM*4* sizeof(int), NULL, NULL);

    /**Step 8: Create kernel object */
    cl_kernel kernel = clCreateKernel(program,"reduce", NULL);

    /**Step 9: Sets Kernel arguments.*/
    status = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&inputBuffer);
    status = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&outputBuffer);
    status = clSetKernelArg(kernel, 2, sizeof(int), &NUM);

    /**Step 10: Running the kernel.*/
    cl_event enentPoint;
    status = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, &enentPoint);
    clWaitForEvents(1,&enentPoint); ///wait
    clReleaseEvent(enentPoint);
    isStatusOK(status);
            
    /**Step 11: Read the cout put back to host memory.*/
    status = clEnqueueReadBuffer(commandQueue, outputBuffer, CL_TRUE, 0,groupNUM*4 * sizeof(int), output, 0, NULL, NULL);
    isStatusOK(status);
    if(isVerify(NUM, groupNUM ,output) == 0)
        cout<<"The result is right!!!"<<endl;
    else
        cout<<"The result is wrong!!!"<<endl;

    /**Step 12: Clean the resources.*/
    status = clReleaseKernel(kernel);//*Release kernel.
    status = clReleaseProgram(program);    //Release the program object.
    status = clReleaseMemObject(inputBuffer);//Release mem object.
    status = clReleaseMemObject(outputBuffer);
    status = clReleaseCommandQueue(commandQueue);//Release Command queue.
    status = clReleaseContext(context);//Release context.

    free(input);
    free(output);
    free(devices);
    return 0;
}

4. 矩阵转置：

不管采起那种映射方式，总有一个buffer是非合并访问方式：

矩阵转置

先用local memory缓存，再进行coalesced访问：

优化

优化后的性能有显著提高：

性能比较

5. 直方图

6. 矩阵相乘：

分块