英文版本:http://deeplearning.net/software/theano/tutorial/using_gpu.html using the GPUphp
想要看GPU的介绍性的讨论和对密集并行计算的使用,查阅:GPGPU.html
theano设计的一个目标就是在一个抽象层面上进行特定的计算,因此内部的函数编译器须要灵活的处理这些计算,其中一个灵活性体如今能够在显卡上进行计算。node
当前有两种方式来使用gpu,一种只支持NVIDIA cards (CUDA backend) ;另外一种,还在开发中,能够支持任何 OpenCL设备,就像和NVIDIA cards (GpuArray Backend)同样。python
若是你没有准备好,那么就须要安装Nvidia 的 GPU编程工具链 (CUDA),而后配置好 Theano。咱们提供了安装指南Linux, MacOS and Windows.(举个例子介绍一下具体安装过程)。linux
为了检查你的GPU是否启用了,能够剪切下面的代码而后保存成一个文件,运行看看。 web
from theano import function, config, shared, sandbox import theano.tensor as T import numpy import time vlen = 10 * 30 * 768 # 10 x #cores x # threads per core iters = 1000 rng = numpy.random.RandomState(22) x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) f = function([], T.exp(x)) print f.maker.fgraph.toposort() t0 = time.time() for i in xrange(iters): r = f() t1 = time.time() print 'Looping %d times took' % iters, t1 - t0, 'seconds' print 'Result is', r if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): print 'Used the cpu' else: print 'Used the gpu'
该程序会计算一堆随机数的exp() 。注意到咱们使用了 shared 函数来确保输入的x 是存储在显卡设备上的。express
若是运行该程序(保存文件名为check1.py),并且device=cpu, 那么计算机将会花费大约 3 ;而在GPU 上,只须要0.64秒。不过 GPU不会一直生成彻底和CPU一致的浮点数。 做为一个基准来讲,调用numpy.exp(x.get_value()) 的一个循环会花费大约 46秒。macos
1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] 3 Looping 1000 times took 3.06635117531 seconds 4 Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 5 1.62323284] 6 Used the cpu 7 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py 9 Using gpu device 0: GeForce GTX 580 10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 0.638810873032 seconds 12 Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 13 1.62323296] 14 Used the gpu
注意到在theano中GPU的操做在目前来讲,只支持 floatX 为float32类型。编程
在前面的例子中,加速并无那么明显,这是由于函数返回的结果是做为一个 NumPy ndarray,而为了方便,已经从设备复制到主机上了。这就是为何在device=gpu下很容易交换的缘由,不过若是你不建议更少的可移植性,能够经过改变graph来用GPU的存储结果表示一个计算的过程来获得更大的加速。 gpu_from_host 操做也就是说“将输入从主机复制到GPU上”,而后在T.exp(x)被GPU版本的exp()替换后进行优化。windows
1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)] 3 Looping 1000 times took 3.06635117531 seconds 4 Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761 5 1.62323284] 6 Used the cpu 7 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py 9 Using gpu device 0: GeForce GTX 580 10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 0.638810873032 seconds 12 Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 13 1.62323296] 14 Used the gpu
输出结果为:
1 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py 2 Using gpu device 0: GeForce GTX 580 3 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)] 4 Looping 1000 times took 0.34898686409 seconds 5 Result is <CudaNdarray object at 0x6a7a5f0> 6 Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 7 1.62323296] 8 Used the gpu
这里咱们经过简单的不要将结果数组复制回主机的方式省掉了大约50%的运行时间。经过每次的函数调用返回的对象不是一个NumPy array,而是一个 “CudaNdarray”,后者能够经过正常的Numpy casting机制(例如numpy.asarray())来转换成一个NumPy ndarray。
对更对你可使用borrow flag加速的资料,查阅:Borrowing when Constructing Function Objects.
在当咱们接着优化咱们的实现的时候,效果的特性也会改变,并且在从设备到设备之间会有所变化,不过如今仍是给出一个粗略的想法吧:
从Theano 0.6开始,咱们就开始使用gpu的异步功能了。这可让咱们运行的更快,不过可能会让一些错误在它们本应该出现的地方延迟抛出异常。则会致使当分析 theano apply节点的时候有些困难。这里有一个 NVIDIA 驱动特性有助于解决这些问题。若是你将环境变量设置成CUDA_LAUNCH_BLOCKING=1 那么,全部的kernel调用都会自动同步的。这会下降性能,不过却提供很好的profiling和合理的位置错误信息。
该特性会与theano的中间结果的垃圾回收相关联。为了获取该特性的大部分效果,你须要禁用gc来在graph中插入同步点。设置theano flag allow_gc=False 来获得甚至更快的速度!不过这会引发内存使用率上升的问题。
为了改变共享变量的值,即对进程提供新的数据,可使用函数shared_variable.set_value(new_value). 更详细的资料,查阅 Understanding Memory Aliasing for Speed and Correctness.
1 import numpy 2 import theano 3 import theano.tensor as T 4 rng = numpy.random 5 6 N = 400 7 feats = 784 8 D = (rng.randn(N, feats).astype(theano.config.floatX), 9 rng.randint(size=N,low=0, high=2).astype(theano.config.floatX)) 10 training_steps = 10000 11 12 # Declare Theano symbolic variables 13 x = T.matrix("x") 14 y = T.vector("y") 15 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w") 16 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b") 17 x.tag.test_value = D[0] 18 y.tag.test_value = D[1] 19 #print "Initial model:" 20 #print w.get_value(), b.get_value() 21 22 # Construct Theano expression graph 23 p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one 24 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 25 xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy 26 cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize 27 gw,gb = T.grad(cost, [w,b]) 28 29 # Compile expressions to functions 30 train = theano.function( 31 inputs=[x,y], 32 outputs=[prediction, xent], 33 updates={w:w-0.01*gw, b:b-0.01*gb}, 34 name = "train") 35 predict = theano.function(inputs=[x], outputs=prediction, 36 name = "predict") 37 38 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in 39 train.maker.fgraph.toposort()]): 40 print 'Used the cpu' 41 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in 42 train.maker.fgraph.toposort()]): 43 print 'Used the gpu' 44 else: 45 print 'ERROR, not able to tell if theano used the cpu or the gpu' 46 print train.maker.fgraph.toposort() 47 48 for i in range(training_steps): 49 pred, err = train(D[0], D[1]) 50 #print "Final model:" 51 #print w.get_value(), b.get_value() 52 53 print "target values for D" 54 print D[1] 55 56 print "prediction on D" 57 print predict(D[0])
修改并经过使用floatX= float32来在gpu上执行该例子,并使用time python file.py。来查看执行时间 (帮助资料:Configuration Settings and Compiling Mode)。
从cpu到gpu上有速度的提高吗?
Where does it come from? (Use ProfileMode)
在gpu上如何有更好的速度的提高?
note:
答案(Solution)
1 #!/usr/bin/env python 2 # Theano tutorial 3 # Solution to Exercise in section 'Using the GPU' 4 5 6 # 1. Raw results 7 8 9 from __future__ import print_function 10 import numpy 11 import theano 12 import theano.tensor as tt 13 14 from theano import sandbox, Out 15 16 theano.config.floatX = 'float32' 17 18 rng = numpy.random 19 20 N = 400 21 feats = 784 22 D = (rng.randn(N, feats).astype(theano.config.floatX), 23 rng.randint(size=N, low=0, high=2).astype(theano.config.floatX)) 24 training_steps = 10000 25 26 # Declare Theano symbolic variables 27 x = theano.shared(D[0], name="x") 28 y = theano.shared(D[1], name="y") 29 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w") 30 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b") 31 x.tag.test_value = D[0] 32 y.tag.test_value = D[1] 33 #print "Initial model:" 34 #print w.get_value(), b.get_value() 35 36 # Construct Theano expression graph 37 p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b)) # Probability of having a one 38 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1 39 xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1) # Cross-entropy 40 cost = tt.cast(xent.mean(), 'float32') + \ 41 0.01 * (w ** 2).sum() # The cost to optimize 42 gw, gb = tt.grad(cost, [w, b]) 43 44 """ 45 # Compile expressions to functions 46 train = theano.function( 47 inputs=[x, y], 48 outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)], 49 updates={w: w - 0.01 * gw, b: b - 0.01 * gb}, 50 name="train") 51 predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True), 52 name="predict") 53 """ 54 55 # Compile expressions to functions 56 train = theano.function( 57 inputs=[], 58 outputs=[prediction, xent], 59 updates={w: w - 0.01 * gw, b: b - 0.01 * gb}, 60 name="train") 61 predict = theano.function(inputs=[], outputs=prediction, 62 name="predict") 63 64 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in 65 train.maker.fgraph.toposort()]): 66 print('Used the cpu') 67 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in 68 train.maker.fgraph.toposort()]): 69 print('Used the gpu') 70 else: 71 print('ERROR, not able to tell if theano used the cpu or the gpu') 72 print(train.maker.fgraph.toposort()) 73 74 for i in range(training_steps): 75 pred, err = train() 76 #print "Final model:" 77 #print w.get_value(), b.get_value() 78 79 print("target values for D") 80 print(D[1]) 81 82 print("prediction on D") 83 print(predict()) 84 85 """ 86 87 # 2. Profiling 88 89 90 # 2.1 Profiling for CPU computations 91 92 # In your terminal, type: 93 $ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py 94 95 # You'll see first the output of the script: 96 Used the cpu 97 target values for D 98 prediction on D 99 100 # Followed by the output of profiling.. You'll see profiling results for each function 101 # in the script, followed by a summary for all functions. 102 # We'll show here only the summary: 103 104 Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz 105 106 Function profiling 107 ================== 108 Message: Sum of all(3) printed profiles at exit excluding Scan op profile. 109 Time in 10002 calls to Function.__call__: 1.590916e+00s 110 Time in Function.fn.__call__: 1.492365e+00s (93.805%) 111 Time in thunks: 1.408159e+00s (88.512%) 112 Total compile time: 6.309664e+00s 113 Number of Apply nodes: 25 114 Theano Optimizer time: 4.848340e-01s 115 Theano validate time: 5.454302e-03s 116 Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s 117 118 Class 119 --- 120 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 121 59.6% 59.6% 0.839s 4.19e-05s C 20001 3 theano.tensor.blas_c.CGemv 122 30.1% 89.7% 0.424s 4.71e-06s C 90001 10 theano.tensor.elemwise.Elemwise 123 5.5% 95.2% 0.078s 7.79e-02s Py 1 1 theano.tensor.blas.Gemv 124 1.9% 97.1% 0.026s 1.30e-06s C 20001 3 theano.tensor.basic.Alloc 125 1.3% 98.4% 0.018s 1.85e-06s C 10000 1 theano.tensor.elemwise.Sum 126 1.0% 99.4% 0.014s 4.78e-07s C 30001 4 theano.tensor.elemwise.DimShuffle 127 0.6% 100.0% 0.008s 4.23e-07s C 20001 3 theano.compile.ops.Shape_i 128 ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) 129 130 Ops 131 --- 132 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 133 59.6% 59.6% 0.839s 4.19e-05s C 20001 3 CGemv{inplace} 134 15.8% 75.4% 0.223s 2.23e-05s C 10000 1 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)] 135 7.7% 83.1% 0.109s 1.09e-05s C 10000 1 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)] 136 5.5% 88.7% 0.078s 7.79e-02s Py 1 1 Gemv{no_inplace} 137 4.3% 92.9% 0.060s 6.00e-06s C 10000 1 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}} 138 1.9% 94.8% 0.026s 1.30e-06s C 20001 3 Alloc 139 1.3% 96.1% 0.018s 1.85e-06s C 10000 1 Sum{acc_dtype=float64} 140 0.7% 96.8% 0.009s 4.73e-07s C 20001 3 InplaceDimShuffle{x} 141 0.6% 97.4% 0.009s 8.52e-07s C 10000 1 Elemwise{sub,no_inplace} 142 0.6% 98.0% 0.008s 4.23e-07s C 20001 3 Shape_i{0} 143 0.5% 98.5% 0.007s 7.06e-07s C 10000 1 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 144 0.5% 98.9% 0.007s 6.57e-07s C 10000 1 Elemwise{neg,no_inplace} 145 0.3% 99.3% 0.005s 4.88e-07s C 10000 1 InplaceDimShuffle{1,0} 146 0.3% 99.5% 0.004s 3.78e-07s C 10000 1 Elemwise{inv,no_inplace} 147 0.2% 99.8% 0.003s 3.44e-07s C 10000 1 Elemwise{Cast{float32}} 148 0.2% 100.0% 0.003s 3.01e-07s C 10000 1 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 149 0.0% 100.0% 0.000s 8.11e-06s C 1 1 Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}} 150 ... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) 151 152 Apply 153 ------ 154 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 155 31.6% 31.6% 0.445s 4.45e-05s 10000 7 CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 156 27.9% 59.6% 0.393s 3.93e-05s 10000 17 CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 157 15.8% 75.4% 0.223s 2.23e-05s 10000 14 Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0) 158 7.7% 83.1% 0.109s 1.09e-05s 10000 15 Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0) 159 5.5% 88.7% 0.078s 7.79e-02s 1 0 Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0}) 160 4.3% 92.9% 0.060s 6.00e-06s 10000 13 Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5}) 161 1.3% 94.2% 0.018s 1.85e-06s 10000 16 Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0) 162 1.0% 95.2% 0.013s 1.34e-06s 10000 5 Alloc(TensorConstant{0.0}, Shape_i{0}.0) 163 0.9% 96.1% 0.013s 1.27e-06s 10000 12 Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0) 164 0.6% 96.7% 0.009s 8.52e-07s 10000 4 Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y) 165 0.5% 97.2% 0.007s 7.06e-07s 10000 9 Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0) 166 0.5% 97.6% 0.007s 6.57e-07s 10000 11 Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 167 0.4% 98.1% 0.006s 6.27e-07s 10000 0 InplaceDimShuffle{x}(b) 168 0.4% 98.5% 0.006s 5.90e-07s 10000 1 Shape_i{0}(x) 169 0.3% 98.9% 0.005s 4.88e-07s 10000 2 InplaceDimShuffle{1,0}(x) 170 0.3% 99.1% 0.004s 3.78e-07s 10000 10 Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0) 171 0.2% 99.4% 0.003s 3.44e-07s 10000 8 Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0) 172 0.2% 99.6% 0.003s 3.19e-07s 10000 6 InplaceDimShuffle{x}(Shape_i{0}.0) 173 0.2% 99.8% 0.003s 3.01e-07s 10000 18 Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0) 174 0.2% 100.0% 0.003s 2.56e-07s 10000 3 Shape_i{0}(y) 175 ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime) 176 177 178 179 # 2.2 Profiling for GPU computations 180 181 # In your terminal, type: 182 $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py 183 184 # You'll see first the output of the script: 185 Used the gpu 186 target values for D 187 prediction on D 188 189 Results were produced using a GeForce GTX TITAN 190 191 # Profiling summary for all functions: 192 193 Function profiling 194 ================== 195 Message: Sum of all(3) printed profiles at exit excluding Scan op profile. 196 Time in 10002 calls to Function.__call__: 3.535239e+00s 197 Time in Function.fn.__call__: 3.420863e+00s (96.765%) 198 Time in thunks: 2.865905e+00s (81.067%) 199 Total compile time: 4.728150e-01s 200 Number of Apply nodes: 36 201 Theano Optimizer time: 4.283385e-01s 202 Theano validate time: 7.687330e-03s 203 Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s 204 205 Class 206 --- 207 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> 208 45.7% 45.7% 1.308s 1.64e-05s C 80001 9 theano.sandbox.cuda.basic_ops.GpuElemwise 209 17.2% 62.8% 0.492s 2.46e-05s C 20002 4 theano.sandbox.cuda.blas.GpuGemv 210 15.1% 77.9% 0.433s 2.17e-05s C 20001 3 theano.sandbox.cuda.basic_ops.GpuAlloc 211 8.2% 86.1% 0.234s 1.17e-05s C 20002 4 theano.sandbox.cuda.basic_ops.HostFromGpu 212 7.2% 93.3% 0.207s 2.07e-05s C 10000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce 213 4.4% 97.7% 0.127s 1.27e-05s C 10003 4 theano.sandbox.cuda.basic_ops.GpuFromHost 214 0.9% 98.6% 0.025s 8.23e-07s C 30001 4 theano.sandbox.cuda.basic_ops.GpuDimShuffle 215 0.7% 99.3% 0.020s 9.88e-07s C 20001 3 theano.tensor.elemwise.Elemwise 216 0.5% 99.8% 0.014s 7.18e-07s C 20001 3 theano.compile.ops.Shape_i 217 0.2% 100.0% 0.006s 5.78e-07s C 10000 1 theano.tensor.elemwise.DimShuffle 218 ... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) 219 220 Ops 221 --- 222 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> 223 17.2% 17.2% 0.492s 2.46e-05s C 20001 3 GpuGemv{inplace} 224 8.2% 25.3% 0.234s 1.17e-05s C 20002 4 HostFromGpu 225 8.0% 33.3% 0.228s 2.28e-05s C 10001 2 GpuAlloc{memset_0=True} 226 7.4% 40.7% 0.211s 2.11e-05s C 10000 1 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace} 227 7.2% 47.9% 0.207s 2.07e-05s C 10000 1 GpuCAReduce{add}{1} 228 7.1% 55.0% 0.205s 2.05e-05s C 10000 1 GpuAlloc 229 6.9% 62.0% 0.198s 1.98e-05s C 10000 1 GpuElemwise{sub,no_inplace} 230 6.9% 68.9% 0.198s 1.98e-05s C 10000 1 GpuElemwise{inv,no_inplace} 231 6.2% 75.1% 0.178s 1.78e-05s C 10000 1 GpuElemwise{neg,no_inplace} 232 5.6% 80.6% 0.159s 1.59e-05s C 10000 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)] 233 4.4% 85.1% 0.127s 1.27e-05s C 10003 4 GpuFromHost 234 4.3% 89.4% 0.124s 1.24e-05s C 10000 1 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)] 235 4.2% 93.6% 0.121s 1.21e-05s C 10000 1 GpuElemwise{ScalarSigmoid}[(0, 0)] 236 4.2% 97.7% 0.119s 1.19e-05s C 10000 1 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)] 237 0.5% 98.2% 0.014s 7.18e-07s C 20001 3 Shape_i{0} 238 0.5% 98.7% 0.013s 1.33e-06s C 10001 2 Elemwise{gt,no_inplace} 239 0.3% 99.0% 0.010s 9.81e-07s C 10000 1 GpuDimShuffle{1,0} 240 0.3% 99.3% 0.008s 7.90e-07s C 10000 1 GpuDimShuffle{0} 241 0.2% 99.6% 0.007s 6.97e-07s C 10001 2 GpuDimShuffle{x} 242 0.2% 99.8% 0.006s 6.50e-07s C 10000 1 Elemwise{Cast{float32}} 243 ... (remaining 3 Ops account for 0.20%(0.01s) of the runtime) 244 245 Apply 246 ------ 247 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> 248 8.8% 8.8% 0.251s 2.51e-05s 10000 22 GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417}) 249 8.4% 17.2% 0.241s 2.41e-05s 10000 7 GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0}) 250 8.0% 25.1% 0.228s 2.28e-05s 10000 5 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0) 251 7.4% 32.5% 0.211s 2.11e-05s 10000 13 GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0) 252 7.2% 39.7% 0.207s 2.07e-05s 10000 21 GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0) 253 7.1% 46.9% 0.205s 2.05e-05s 10000 17 GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0) 254 6.9% 53.8% 0.198s 1.98e-05s 10000 4 GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y) 255 6.9% 60.7% 0.198s 1.98e-05s 10000 12 GpuElemwise{inv,no_inplace}(GpuFromHost.0) 256 6.2% 66.9% 0.178s 1.78e-05s 10000 11 GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0) 257 5.6% 72.5% 0.159s 1.59e-05s 10000 19 GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0) 258 4.8% 77.3% 0.138s 1.38e-05s 10000 18 HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0) 259 4.4% 81.7% 0.126s 1.26e-05s 10000 10 GpuFromHost(Elemwise{Cast{float32}}.0) 260 4.3% 86.0% 0.124s 1.24e-05s 10000 9 GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0) 261 4.2% 90.2% 0.121s 1.21e-05s 10000 15 GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0) 262 4.2% 94.4% 0.119s 1.19e-05s 10000 23 GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0) 263 3.4% 97.7% 0.096s 9.61e-06s 10000 16 HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0) 264 0.5% 98.2% 0.013s 1.33e-06s 10000 20 Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5}) 265 0.3% 98.5% 0.010s 9.81e-07s 10000 2 GpuDimShuffle{1,0}(x) 266 0.3% 98.8% 0.008s 8.27e-07s 10000 1 Shape_i{0}(x) 267 0.3% 99.1% 0.008s 7.90e-07s 10000 14 GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0) 268 ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime) 269 270 271 # 3. Conclusions 272 273 Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves 274 consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead. 275 Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs; 276 if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU. 277 278 """
若是你尚未准备好,你须要安装 libgpuarray 和至少一个计算工具箱。能够看相关的介绍说明 libgpuarray.
若是使用OpenGL,那么全部设备的类型都支持的,对于该章节剩下的部分,无论你使用的计算设备是什么,都表示是gpu。
waring:咱们想彻底支持OpenCL, 在2014年5月的时候,该支持仍然是个想法而已。一些有用的ops仍然没有被支持,由于 想要在旧的后端以最小化变化来移植。
为了查看是否使用的是GPU,能够将下面代码剪切而后建立个文件运行:
1 from theano import function, config, shared, tensor, sandbox 2 import numpy 3 import time 4 5 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core 6 iters = 1000 7 8 rng = numpy.random.RandomState(22) 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) 10 f = function([], tensor.exp(x)) 11 print f.maker.fgraph.toposort() 12 t0 = time.time() 13 for i in xrange(iters): 14 r = f() 15 t1 = time.time() 16 print 'Looping %d times took' % iters, t1 - t0, 'seconds' 17 print 'Result is', r 18 if numpy.any([isinstance(x.op, tensor.Elemwise) and 19 ('Gpu' not in type(x.op).__name__) 20 for x in f.maker.fgraph.toposort()]): 21 print 'Used the cpu' 22 else: 23 print 'Used the gpu'
该程序只计算一群随机数的 exp() 。注意到咱们使用 theano.shared() 函数来确保输入x存储在gpu上。
1 $ THEANO_FLAGS=device=cpu python check1.py 2 [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)] 3 Looping 1000 times took 2.6071999073 seconds 4 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 5 1.62323285] 6 Used the cpu 7 8 $ THEANO_FLAGS=device=cuda0 python check1.py 9 Using device cuda0: GeForce GTX 275 10 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)] 11 Looping 1000 times took 2.28562092781 seconds 12 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 13 1.62323285] 14 Used the gpu
在默认状况下,在gpu上执行的函数仍然返回一个标准的numpy ndarray。在获得结果以前会有一个迁移操做,将数据传输会cpu上从而来确保与cpu代码的兼容。这可让在不改变源代码的状况下只使用flag device来改变代码运行的位置。
若是不建议损失一些灵活性,可让theano直接返回gpu对象。下面的代码就是这样:
1 from theano import function, config, shared, tensor, sandbox 2 import numpy 3 import time 4 5 vlen = 10 * 30 * 768 # 10 x #cores x # threads per core 6 iters = 1000 7 8 rng = numpy.random.RandomState(22) 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) 10 f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x))) 11 print f.maker.fgraph.toposort() 12 t0 = time.time() 13 for i in xrange(iters): 14 r = f() 15 t1 = time.time() 16 print 'Looping %d times took' % iters, t1 - t0, 'seconds' 17 print 'Result is', numpy.asarray(r) 18 if numpy.any([isinstance(x.op, tensor.Elemwise) and 19 ('Gpu' not in type(x.op).__name__) 20 for x in f.maker.fgraph.toposort()]): 21 print 'Used the cpu' 22 else: 23 print 'Used the gpu'
这里的 theano.sandbox.gpuarray.basic.gpu_from_host() 调用的意思是 “将输入复制到 GPU上”。然而在优化的阶段中,由于结果已经在gpu上了,它会被移除掉(即该函数会被忽略)。这里是为了告诉theano咱们想要gpu上的结果。
输出为:
1 $ THEANO_FLAGS=device=cuda0 python check2.py 2 Using device cuda0: GeForce GTX 275 3 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)] 4 Looping 1000 times took 0.455810785294 seconds 5 Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 6 1.62323285] 7 Used the gpu
然而每次调用的时间看上去会比以前的两个调用更少 (的确是会更少,由于这里避免了数据传输r)这里这么大的加速是由于gpu上执行的异步过程所致使的,也就是说工做并无完成,只是“启动”了。
返回的对象是一个从pygou上获得的 GpuArray。它几乎扮演着带有一些异常的 numpy ndarray ,由于它的数据都在gpu上,你能够将它复制到主机上,而后经过使用日常的numpy cast ,例如numpy.asarray()来转换成一个常规的ndarray 。
为了更快的速度,可使用borrow flag,查阅: Borrowing when Constructing Function Objects.
固然在不一样设备之间,性能特性仍是不太的,一样的,咱们会改进咱们的实现。
该backend支持全部的常规theano数据类型 (float32, float64, int, ...),然而GPU的支持是变化的,并且一些单元无法处理 double (float64)或者更小的 (小于32 位,好比 int16)数据类型。若是使用了这些单元,那么会在编译的时候或者运行的时候获得一个错误。
复杂的支持还未测试,并且大多数都不行。
一般来讲,大的操做,好比矩阵乘法或者有着大量输入的逐元素操做将会明显更快的。
默认状况下,在gpu上全部的操做都是异步的,这能够经过底层的libgpuarray来使得这些操做都是透明的。
当在设备和主机之间进行内存迁移的时候,能够经过引入同步点。当在gpu上释放活动的(活动的缓冲区就是仍然会被kernel使用的缓冲区)内存缓冲区的时候,能够引入另外一个同步点。
能够经过调用它的sync()方法来对一个特定的GpuArray强制同步。这在作基准的时候能够用来获得准确的耗时计算。
强制的同步点会和中间结果的垃圾回收相关联。为了获得最快的速度,你应该经过使用theano flag allow_gc=False来禁用垃圾回收器。不过要注意这会致使内存使用提高的问题。
撇开theano这种元编程,有:
CUDA: GPU 编程API,是NVIDIA 对C的扩展 (CUDA C)
OpenCL: CUDA的多供应商版本
PyCUDA:对CUDA驱动接口的python绑定,容许经过python来访问 Nvidia的 CUDA 并行计算API
方便:
使用python来更容易的进行GPU 元编程。
从python中可以抽象的编译更低层的 CUDA 代码 (pycuda.driver.SourceModule).
GPU 内存缓存 (pycuda.gpuarray.GPUArray).
帮助文档.
完整性: 绑定了全部的CUDA驱动 API.
自动的错误检测:全部的 CUDA 错误都会自动的转到python异常。
速度: PyCUDA的底层是用 C++写的。
针对GPU对象,具备很好的内存管理:
对象的清理是和对象的生命周期绑定的 (RAII, ‘Resource Acquisition Is Initialization’).
使得更容易编写正确的,无漏洞的和不容易崩溃的代码。
PyCUDA 会知道依赖条件 (例如,它不会在全部分配的内存释放以前对上下文进行分离)。
(查阅PyCUDA的 documentation 和 在PyCUDA上Andreas Kloeckner的 website )
PyOpenCL: PyCUDA for OpenCL
若是你已经精通C了,那么你就能够很容易的经过学习来充分利用你的知识,首先用CUDA C来编写GPU,而后,使用 PyCUDA来访问 CUDA API。
下面的资源有助于你学习的过程:
下面的例子是用来讲明用PyCUDA来对GPU编程的一个预言。一旦你以为彻底足够了,你就能够尝试去作相对应的练习。
Example: PyCUDA
1 # (from PyCUDA's documentation) 2 import pycuda.autoinit 3 import pycuda.driver as drv 4 import numpy 5 6 from pycuda.compiler import SourceModule 7 mod = SourceModule(""" 8 __global__ void multiply_them(float *dest, float *a, float *b) 9 { 10 const int i = threadIdx.x; 11 dest[i] = a[i] * b[i]; 12 } 13 """) 14 15 multiply_them = mod.get_function("multiply_them") 16 17 a = numpy.random.randn(400).astype(numpy.float32) 18 b = numpy.random.randn(400).astype(numpy.float32) 19 20 dest = numpy.zeros_like(a) 21 multiply_them( 22 drv.Out(dest), drv.In(a), drv.In(b), 23 block=(400,1,1), grid=(1,1)) 24 25 assert numpy.allclose(dest, a*b) 26 print dest
运行以前的例子
修改并执行一个shape(20,10)的矩阵
Example: Theano + PyCUDA
1 import numpy, theano 2 import theano.misc.pycuda_init 3 from pycuda.compiler import SourceModule 4 import theano.sandbox.cuda as cuda 5 6 class PyCUDADoubleOp(theano.Op): 7 def __eq__(self, other): 8 return type(self) == type(other) 9 10 def __hash__(self): 11 return hash(type(self)) 12 13 def __str__(self): 14 return self.__class__.__name__ 15 16 def make_node(self, inp): 17 inp = cuda.basic_ops.gpu_contiguous( 18 cuda.basic_ops.as_cuda_ndarray_variable(inp)) 19 assert inp.dtype == "float32" 20 return theano.Apply(self, [inp], [inp.type()]) 21 22 def make_thunk(self, node, storage_map, _, _2): 23 mod = SourceModule(""" 24 __global__ void my_fct(float * i0, float * o0, int size) { 25 int i = blockIdx.x*blockDim.x + threadIdx.x; 26 if(i<size){ 27 o0[i] = i0[i]*2; 28 } 29 }""") 30 pycuda_fct = mod.get_function("my_fct") 31 inputs = [storage_map[v] for v in node.inputs] 32 outputs = [storage_map[v] for v in node.outputs] 33 34 def thunk(): 35 z = outputs[0] 36 if z[0] is None or z[0].shape != inputs[0][0].shape: 37 z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape) 38 grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1) 39 pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size), 40 block=(512, 1, 1), grid=grid) 41 return thunk
使用这个代码来测试:
1 >>> x = theano.tensor.fmatrix() 2 >>> f = theano.function([x], PyCUDADoubleOp()(x)) 3 >>> xv = numpy.ones((4, 5), dtype="float32") 4 >>> assert numpy.allclose(f(xv), xv*2) 5 >>> print numpy.asarray(f(xv))
运行前面的例子
修改并执行两个矩阵的乘法: x * y.
修改并执行返回两个输出: x + y 和 x - y.
(注意到theano当前的逐元素优化只对涉及到单一输出的计算有用。因此,为了提供基本解决状况下的效率,须要在代码中显式的对这两个操做进行优化)。
修改而后执行来支持跨越行为(stride) (即,避免受限于输入必定是C-连续的)。
查阅 Other Implementations 来了解如何在gpu上处理随机数
参考资料:
[1]官网:http://deeplearning.net/software/theano/tutorial/using_gpu.html
[2]person:https://www.cnblogs.com/shouhuxianjian/p/4590224.html