paper 167：GPU的使用Theano之tutorial

时间 2019-12-11

标签 paper gpu 使用 theano tutorial 繁體版

原文原文链接

Theano之使用GPU

英文版本：http://deeplearning.net/software/theano/tutorial/using_gpu.html using the GPUphp

想要看GPU的介绍性的讨论和对密集并行计算的使用，查阅：GPGPU.html

theano设计的一个目标就是在一个抽象层面上进行特定的计算，因此内部的函数编译器须要灵活的处理这些计算，其中一个灵活性体如今能够在显卡上进行计算。node

当前有两种方式来使用gpu，一种只支持NVIDIA cards (CUDA backend) ；另外一种，还在开发中，能够支持任何 OpenCL设备，就像和NVIDIA cards (GpuArray Backend)同样。python

1、CUDA backend

若是你没有准备好，那么就须要安装Nvidia 的 GPU编程工具链 (CUDA)，而后配置好 Theano。咱们提供了安装指南Linux, MacOS and Windows.（举个例子介绍一下具体安装过程）。linux

1.1 测试theano和GPU

为了检查你的GPU是否启用了，能够剪切下面的代码而后保存成一个文件，运行看看。　web

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
else:
    print 'Used the gpu'

该程序会计算一堆随机数的exp() 。注意到咱们使用了 shared 函数来确保输入的x 是存储在显卡设备上的。express

若是运行该程序(保存文件名为check1.py)，并且device=cpu, 那么计算机将会花费大约 3 ；而在GPU 上，只须要0.64秒。不过 GPU不会一直生成彻底和CPU一致的浮点数。做为一个基准来讲，调用numpy.exp(x.get_value()) 的一个循环会花费大约 46秒。macos

 1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
 3 Looping 1000 times took 3.06635117531 seconds
 4 Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
 5   1.62323284]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
 9 Using gpu device 0: GeForce GTX 580
10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 0.638810873032 seconds
12 Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
13   1.62323296]
14 Used the gpu

注意到在theano中GPU的操做在目前来讲，只支持 floatX 为float32类型。编程

1.2 返回设备分配数据的句柄

在前面的例子中，加速并无那么明显，这是由于函数返回的结果是做为一个 NumPy ndarray，而为了方便，已经从设备复制到主机上了。这就是为何在device=gpu下很容易交换的缘由，不过若是你不建议更少的可移植性，能够经过改变graph来用GPU的存储结果表示一个计算的过程来获得更大的加速。 gpu_from_host 操做也就是说“将输入从主机复制到GPU上”，而后在T.exp(x)被GPU版本的exp()替换后进行优化。windows

 1 $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
 3 Looping 1000 times took 3.06635117531 seconds
 4 Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
 5   1.62323284]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
 9 Using gpu device 0: GeForce GTX 580
10 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 0.638810873032 seconds
12 Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
13   1.62323296]
14 Used the gpu

输出结果为：

1 $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
2 Using gpu device 0: GeForce GTX 580
3 [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
4 Looping 1000 times took 0.34898686409 seconds
5 Result is <CudaNdarray object at 0x6a7a5f0>
6 Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
7   1.62323296]
8 Used the gpu

这里咱们经过简单的不要将结果数组复制回主机的方式省掉了大约50%的运行时间。经过每次的函数调用返回的对象不是一个NumPy array，而是一个 “CudaNdarray”，后者能够经过正常的Numpy casting机制（例如numpy.asarray()）来转换成一个NumPy ndarray。

对更对你可使用borrow flag加速的资料，查阅：Borrowing when Constructing Function Objects.

1.3 在GPU上加速的是什么？

在当咱们接着优化咱们的实现的时候，效果的特性也会改变，并且在从设备到设备之间会有所变化，不过如今仍是给出一个粗略的想法吧：

只有float32 的数据类型的计算能够加速。针对float64的更好的支持期待未来的硬件，不过在目前（2010年1月）float64仍是至关慢的。
当参数是足够大而保持30个处理器都工做的时候，矩阵乘法，卷积和大型的逐元素计算能够加速大概5-50x。
索引、维度重排和常量时间的reshaping在gpu和cpu上同样块。
在张量上基于行/列的求和在gpu上可能会比cpu上慢一点。
设备与主机之间大量的数据的复制是至关慢的，一般会抵消掉在数据上一两个加速函数的大部分优点。让gpu取得性能上的提高的关键取决于数据传输到设备上的时间消耗。

1.4 在gpu上提高效果的提示

考虑将floatX=float32 加到你的 .theanorc 文件中。
使用theano flag allow_gc=False. 见 GPU Async capabilities
推荐使用构造器，如matrix, vector 和 scalar 来替换dmatrix, dvector 和 dscalar。由于前者当设定floatX = float32 的时候回使用float32类型的变量。
确保你的输出变量为float32 dtype而不是float64。在graph中更多的float32变量会让你将更多的工做放在gpu上实现。
使用shared float32变量存储频繁访问的数据（见shared（））来最大程度的减小转移到gpu设备上花费的时间。当使用gpu的时候，float32 张量共享变量存储在gpu上，并默认的使用这些变量来消除到gpu上的传输时间。（这里的意思应该是建立的时候就放在gpu上，而无需每次调用都从cpu上传给gpu，从而这份数据可以一直保持在gpu上，减小屡次的传输）。
若是你对你获得的效果不满意，试着用 mode='ProfileMode'来创建你的函数。这在程序终止的时候，会打印出一些时间信息。若是一个op或者apply花费了它共享还多的时间，那么若是你知道一些gpu变成，就能够看看在theano.sandbox.cuda上它是怎么实现的。检查下载cpu上花费的时间比例Xs(X%) ，和在gpu上花费的时间比例 Xs(X%) 和在传输操做上花费的时间比例 Xs(X%) 。这能够告诉你你的graph所花费的时间是在gpu上仍是更多的在内存的传输上。
使用 nvcc 选项。 nvcc 支持一些选项来加速某些计算： -ftz=true to flush denormals values to zeros., –prec-div=false 和 –prec-sqrt=false 选项能够经过使用更少的精度来对除法和平方根操做进行加速，。你能够经过 nvcc.flags=–use_fast_math Theano flag 来一次启用它们，或者如子nvcc.flags=-ftz=true –prec-div=false同样分别对它们进行启用。

1.5 GPU 异步功能

从Theano 0.6开始，咱们就开始使用gpu的异步功能了。这可让咱们运行的更快，不过可能会让一些错误在它们本应该出现的地方延迟抛出异常。则会致使当分析 theano apply节点的时候有些困难。这里有一个 NVIDIA 驱动特性有助于解决这些问题。若是你将环境变量设置成CUDA_LAUNCH_BLOCKING=1 那么，全部的kernel调用都会自动同步的。这会下降性能，不过却提供很好的profiling和合理的位置错误信息。

该特性会与theano的中间结果的垃圾回收相关联。为了获取该特性的大部分效果，你须要禁用gc来在graph中插入同步点。设置theano flag allow_gc=False 来获得甚至更快的速度！不过这会引发内存使用率上升的问题。

1.6 改变共享变量的值

为了改变共享变量的值，即对进程提供新的数据，可使用函数shared_variable.set_value(new_value). 更详细的资料，查阅 Understanding Memory Aliasing for Speed and Correctness.

练习：再次拿逻辑回归作例子

 1 import numpy
 2 import theano
 3 import theano.tensor as T
 4 rng = numpy.random
 5 
 6 N = 400
 7 feats = 784
 8 D = (rng.randn(N, feats).astype(theano.config.floatX),
 9 rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
10 training_steps = 10000
11 
12 # Declare Theano symbolic variables
13 x = T.matrix("x")
14 y = T.vector("y")
15 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
16 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
17 x.tag.test_value = D[0]
18 y.tag.test_value = D[1]
19 #print "Initial model:"
20 #print w.get_value(), b.get_value()
21 
22 # Construct Theano expression graph
23 p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
24 prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
25 xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
26 cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
27 gw,gb = T.grad(cost, [w,b])
28 
29 # Compile expressions to functions
30 train = theano.function(
31             inputs=[x,y],
32             outputs=[prediction, xent],
33             updates={w:w-0.01*gw, b:b-0.01*gb},
34             name = "train")
35 predict = theano.function(inputs=[x], outputs=prediction,
36             name = "predict")
37 
38 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
39         train.maker.fgraph.toposort()]):
40     print 'Used the cpu'
41 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
42           train.maker.fgraph.toposort()]):
43     print 'Used the gpu'
44 else:
45     print 'ERROR, not able to tell if theano used the cpu or the gpu'
46     print train.maker.fgraph.toposort()
47 
48 for i in range(training_steps):
49     pred, err = train(D[0], D[1])
50 #print "Final model:"
51 #print w.get_value(), b.get_value()
52 
53 print "target values for D"
54 print D[1]
55 
56 print "prediction on D"
57 print predict(D[0])

修改并经过使用floatX= float32来在gpu上执行该例子，并使用time python file.py。来查看执行时间 (帮助资料：Configuration Settings and Compiling Mode)。

从cpu到gpu上有速度的提高吗？

Where does it come from? (Use ProfileMode)

在gpu上如何有更好的速度的提高？

note：

当前只支持32 位 floats (其余待开发)。
有着float32 dtype的Shared 变量默认会放到gpu内存空间上.
当前一个gpu被限制成只容许一个进程。
使用Theano flag device=gpu 来请求使用gpu设备。
当你有多个gpu的时候，使用 device=gpu{0, 1, ...} 来指定具体的那个。
在代码中使用Theano flag floatX=float32 (through theano.config.floatX) 。
在存储到一个shared变量以前记得Cast 输入。
避免本该cast到float32的int32 自动变成float64：
- 在代码中手动插入cast或者使用 [u]int{8,16}.
- 在均值操做的周围手动插入cast (这会涉及到length的除法，而这是一个int64类型的).
- 注意：一个新的casting机制在开发中。

答案（Solution）

  1 #!/usr/bin/env python
  2 # Theano tutorial
  3 # Solution to Exercise in section 'Using the GPU'
  4 
  5 
  6 # 1. Raw results
  7 
  8 
  9 from __future__ import print_function
 10 import numpy
 11 import theano
 12 import theano.tensor as tt
 13 
 14 from theano import sandbox, Out
 15 
 16 theano.config.floatX = 'float32'
 17 
 18 rng = numpy.random
 19 
 20 N = 400
 21 feats = 784
 22 D = (rng.randn(N, feats).astype(theano.config.floatX),
 23 rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
 24 training_steps = 10000
 25 
 26 # Declare Theano symbolic variables
 27 x = theano.shared(D[0], name="x")
 28 y = theano.shared(D[1], name="y")
 29 w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
 30 b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
 31 x.tag.test_value = D[0]
 32 y.tag.test_value = D[1]
 33 #print "Initial model:"
 34 #print w.get_value(), b.get_value()
 35 
 36 # Construct Theano expression graph
 37 p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
 38 prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
 39 xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
 40 cost = tt.cast(xent.mean(), 'float32') + \
 41        0.01 * (w ** 2).sum()  # The cost to optimize
 42 gw, gb = tt.grad(cost, [w, b])
 43 
 44 """
 45 # Compile expressions to functions
 46 train = theano.function(
 47             inputs=[x, y],
 48             outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
 49             updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
 50             name="train")
 51 predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
 52             name="predict")
 53 """
 54 
 55 # Compile expressions to functions
 56 train = theano.function(
 57             inputs=[],
 58             outputs=[prediction, xent],
 59             updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
 60             name="train")
 61 predict = theano.function(inputs=[], outputs=prediction,
 62             name="predict")
 63 
 64 if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
 65 train.maker.fgraph.toposort()]):
 66     print('Used the cpu')
 67 elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
 68 train.maker.fgraph.toposort()]):
 69     print('Used the gpu')
 70 else:
 71     print('ERROR, not able to tell if theano used the cpu or the gpu')
 72     print(train.maker.fgraph.toposort())
 73 
 74 for i in range(training_steps):
 75     pred, err = train()
 76 #print "Final model:"
 77 #print w.get_value(), b.get_value()
 78 
 79 print("target values for D")
 80 print(D[1])
 81 
 82 print("prediction on D")
 83 print(predict())
 84 
 85 """
 86 
 87 # 2. Profiling
 88 
 89 
 90 # 2.1 Profiling for CPU computations
 91 
 92 # In your terminal, type:
 93 $ THEANO_FLAGS=profile=True,device=cpu python using_gpu_solution_1.py
 94 
 95 # You'll see first the output of the script:
 96 Used the cpu
 97 target values for D
 98 prediction on D
 99 
100 # Followed by the output of profiling.. You'll see profiling results for each function
101 # in the script, followed by a summary for all functions.
102 # We'll show here only the summary:
103 
104 Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
105 
106 Function profiling
107 ==================
108   Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
109   Time in 10002 calls to Function.__call__: 1.590916e+00s
110   Time in Function.fn.__call__: 1.492365e+00s (93.805%)
111   Time in thunks: 1.408159e+00s (88.512%)
112   Total compile time: 6.309664e+00s
113     Number of Apply nodes: 25
114     Theano Optimizer time: 4.848340e-01s
115        Theano validate time: 5.454302e-03s
116     Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
117 
118 Class
119 ---
120 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
121   59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
122   30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
123    5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
124    1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
125    1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
126    1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
127    0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
128    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
129 
130 Ops
131 ---
132 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
133   59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
134   15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
135    7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
136    5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
137    4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
138    1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
139    1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
140    0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
141    0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
142    0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
143    0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
144    0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
145    0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
146    0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
147    0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
148    0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
149    0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
150    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
151 
152 Apply
153 ------
154 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
155   31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
156   27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
157   15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
158    7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
159    5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
160    4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
161    1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
162    1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
163    0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
164    0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
165    0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
166    0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
167    0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
168    0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
169    0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
170    0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
171    0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
172    0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
173    0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
174    0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
175    ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
176 
177 
178 
179 # 2.2 Profiling for GPU computations
180 
181 # In your terminal, type:
182 $ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
183 
184 # You'll see first the output of the script:
185 Used the gpu
186 target values for D
187 prediction on D
188 
189 Results were produced using a GeForce GTX TITAN
190 
191 # Profiling summary for all functions:
192 
193 Function profiling
194 ==================
195   Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
196   Time in 10002 calls to Function.__call__: 3.535239e+00s
197   Time in Function.fn.__call__: 3.420863e+00s (96.765%)
198   Time in thunks: 2.865905e+00s (81.067%)
199   Total compile time: 4.728150e-01s
200     Number of Apply nodes: 36
201     Theano Optimizer time: 4.283385e-01s
202        Theano validate time: 7.687330e-03s
203     Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
204 
205 Class
206 ---
207 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
208   45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
209   17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
210   15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
211    8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
212    7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
213    4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
214    0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
215    0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
216    0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
217    0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
218    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
219 
220 Ops
221 ---
222 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
223   17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
224    8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
225    8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
226    7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
227    7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
228    7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
229    6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
230    6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
231    6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
232    5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
233    4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
234    4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
235    4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
236    4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
237    0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
238    0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
239    0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
240    0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
241    0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
242    0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
243    ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)
244 
245 Apply
246 ------
247 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
248    8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
249    8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
250    8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
251    7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
252    7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
253    7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
254    6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
255    6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
256    6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
257    5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
258    4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
259    4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
260    4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
261    4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
262    4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
263    3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
264    0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
265    0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
266    0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
267    0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
268    ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
269 
270 
271 # 3. Conclusions
272 
273 Examine and compare 'Ops' summaries for CPU and GPU. Usually GPU ops 'GpuFromHost' and 'HostFromGpu' by themselves
274 consume a large amount of extra time, but by making as few as possible data transfers between GPU and CPU, you can minimize their overhead.
275 Notice that each of the GPU ops consumes more time than its CPU counterpart. This is because the ops operate on small inputs;
276 if you increase the input data size (e.g. set N = 4000), you will see a gain from using the GPU.
277 
278 """

2、 GpuArray Backend

若是你尚未准备好，你须要安装 libgpuarray 和至少一个计算工具箱。能够看相关的介绍说明 libgpuarray.

若是使用OpenGL，那么全部设备的类型都支持的，对于该章节剩下的部分，无论你使用的计算设备是什么，都表示是gpu。

waring：咱们想彻底支持OpenCL, 在2014年5月的时候，该支持仍然是个想法而已。一些有用的ops仍然没有被支持，由于想要在旧的后端以最小化变化来移植。

2.1 Testing Theano with GPU

为了查看是否使用的是GPU，能够将下面代码剪切而后建立个文件运行：

 1 from theano import function, config, shared, tensor, sandbox
 2 import numpy
 3 import time
 4 
 5 vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
 6 iters = 1000
 7 
 8 rng = numpy.random.RandomState(22)
 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
10 f = function([], tensor.exp(x))
11 print f.maker.fgraph.toposort()
12 t0 = time.time()
13 for i in xrange(iters):
14     r = f()
15 t1 = time.time()
16 print 'Looping %d times took' % iters, t1 - t0, 'seconds'
17 print 'Result is', r
18 if numpy.any([isinstance(x.op, tensor.Elemwise) and
19               ('Gpu' not in type(x.op).__name__)
20               for x in f.maker.fgraph.toposort()]):
21     print 'Used the cpu'
22 else:
23     print 'Used the gpu'

该程序只计算一群随机数的 exp() 。注意到咱们使用 theano.shared() 函数来确保输入x存储在gpu上。

 1 $ THEANO_FLAGS=device=cpu python check1.py
 2 [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
 3 Looping 1000 times took 2.6071999073 seconds
 4 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
 5   1.62323285]
 6 Used the cpu
 7 
 8 $ THEANO_FLAGS=device=cuda0 python check1.py
 9 Using device cuda0: GeForce GTX 275
10 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
11 Looping 1000 times took 2.28562092781 seconds
12 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
13   1.62323285]
14 Used the gpu

2.2 返回在设备上分配数据的句柄

在默认状况下，在gpu上执行的函数仍然返回一个标准的numpy ndarray。在获得结果以前会有一个迁移操做，将数据传输会cpu上从而来确保与cpu代码的兼容。这可让在不改变源代码的状况下只使用flag device来改变代码运行的位置。

若是不建议损失一些灵活性，可让theano直接返回gpu对象。下面的代码就是这样：

 1 from theano import function, config, shared, tensor, sandbox
 2 import numpy
 3 import time
 4 
 5 vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
 6 iters = 1000
 7 
 8 rng = numpy.random.RandomState(22)
 9 x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
10 f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
11 print f.maker.fgraph.toposort()
12 t0 = time.time()
13 for i in xrange(iters):
14     r = f()
15 t1 = time.time()
16 print 'Looping %d times took' % iters, t1 - t0, 'seconds'
17 print 'Result is', numpy.asarray(r)
18 if numpy.any([isinstance(x.op, tensor.Elemwise) and
19               ('Gpu' not in type(x.op).__name__)
20               for x in f.maker.fgraph.toposort()]):
21     print 'Used the cpu'
22 else:
23     print 'Used the gpu'

这里的 theano.sandbox.gpuarray.basic.gpu_from_host() 调用的意思是 “将输入复制到 GPU上”。然而在优化的阶段中，由于结果已经在gpu上了，它会被移除掉（即该函数会被忽略）。这里是为了告诉theano咱们想要gpu上的结果。

输出为：

1 $ THEANO_FLAGS=device=cuda0 python check2.py
2 Using device cuda0: GeForce GTX 275
3 [GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
4 Looping 1000 times took 0.455810785294 seconds
5 Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753
6   1.62323285]
7 Used the gpu

然而每次调用的时间看上去会比以前的两个调用更少 (的确是会更少，由于这里避免了数据传输r)这里这么大的加速是由于gpu上执行的异步过程所致使的，也就是说工做并无完成，只是“启动”了。

返回的对象是一个从pygou上获得的 GpuArray。它几乎扮演着带有一些异常的 numpy ndarray ，由于它的数据都在gpu上，你能够将它复制到主机上，而后经过使用日常的numpy cast ，例如numpy.asarray()来转换成一个常规的ndarray 。

为了更快的速度，可使用borrow flag，查阅： Borrowing when Constructing Function Objects.

2.3 什么可以在gpu上加速？

固然在不一样设备之间，性能特性仍是不太的，一样的，咱们会改进咱们的实现。

该backend支持全部的常规theano数据类型 (float32, float64, int, ...)，然而GPU的支持是变化的，并且一些单元无法处理 double (float64)或者更小的 (小于32 位，好比 int16)数据类型。若是使用了这些单元，那么会在编译的时候或者运行的时候获得一个错误。

复杂的支持还未测试，并且大多数都不行。

一般来讲，大的操做，好比矩阵乘法或者有着大量输入的逐元素操做将会明显更快的。

2.4 GPU 异步功能

默认状况下，在gpu上全部的操做都是异步的，这能够经过底层的libgpuarray来使得这些操做都是透明的。

当在设备和主机之间进行内存迁移的时候，能够经过引入同步点。当在gpu上释放活动的（活动的缓冲区就是仍然会被kernel使用的缓冲区）内存缓冲区的时候，能够引入另外一个同步点。

能够经过调用它的sync（）方法来对一个特定的GpuArray强制同步。这在作基准的时候能够用来获得准确的耗时计算。

强制的同步点会和中间结果的垃圾回收相关联。为了获得最快的速度，你应该经过使用theano flag allow_gc=False来禁用垃圾回收器。不过要注意这会致使内存使用提高的问题。

3、直接对gpu编程的一些软件

撇开theano这种元编程，有：

CUDA: GPU 编程API，是NVIDIA 对C的扩展 (CUDA C)
- 特定供应商
- 成熟的数值库 (BLAS, RNG, FFT) 。
OpenCL: CUDA的多供应商版本
- 更加的通用和标准。
- 更少的库，传播不广
PyCUDA:对CUDA驱动接口的python绑定，容许经过python来访问 Nvidia的 CUDA 并行计算API
- 方便:
  
  使用python来更容易的进行GPU 元编程。
  
  从python中可以抽象的编译更低层的 CUDA 代码 (pycuda.driver.SourceModule).
  
  GPU 内存缓存 (pycuda.gpuarray.GPUArray).
  
  帮助文档.
- 完整性: 绑定了全部的CUDA驱动 API.
- 自动的错误检测：全部的 CUDA 错误都会自动的转到python异常。
- 速度: PyCUDA的底层是用 C++写的。
- 针对GPU对象，具备很好的内存管理：
  
  对象的清理是和对象的生命周期绑定的 (RAII, ‘Resource Acquisition Is Initialization’).
  
  使得更容易编写正确的，无漏洞的和不容易崩溃的代码。
  
  PyCUDA 会知道依赖条件 (例如，它不会在全部分配的内存释放以前对上下文进行分离)。
(查阅PyCUDA的 documentation 和在PyCUDA上Andreas Kloeckner的 website )
PyOpenCL: PyCUDA for OpenCL

4、学习用PyCUDA编程

若是你已经精通C了，那么你就能够很容易的经过学习来充分利用你的知识，首先用CUDA C来编写GPU，而后，使用 PyCUDA来访问 CUDA API。

下面的资源有助于你学习的过程：

CUDA API 和CUDA C: 入门
- NVIDIA’s slides
- Stein’s (NYU) slides
CUDA API 和 CUDA C: 高级
- MIT IAP2009 CUDA (full coverage: lectures, leading Kirk-Hwu textbook, 例子，额外的资源)
- Course U. of Illinois (full lectures, Kirk-Hwu 教科书)
- NVIDIA’s knowledge base (覆盖范围广，从入门到高级)
- practical issues ( grids, blocks 和 threads之间的关系；并在同一页还有相对应的问题)
- CUDA optimisation
PyCUDA: 入门
- Kloeckner’s slides
- Kloeckner’ website
PYCUDA: 高级
- PyCUDA documentation website

下面的例子是用来讲明用PyCUDA来对GPU编程的一个预言。一旦你以为彻底足够了，你就能够尝试去作相对应的练习。

Example: PyCUDA

 1 # (from PyCUDA's documentation)
 2 import pycuda.autoinit
 3 import pycuda.driver as drv
 4 import numpy
 5 
 6 from pycuda.compiler import SourceModule
 7 mod = SourceModule("""
 8 __global__ void multiply_them(float *dest, float *a, float *b)
 9 {
10   const int i = threadIdx.x;
11   dest[i] = a[i] * b[i];
12 }
13 """)
14 
15 multiply_them = mod.get_function("multiply_them")
16 
17 a = numpy.random.randn(400).astype(numpy.float32)
18 b = numpy.random.randn(400).astype(numpy.float32)
19 
20 dest = numpy.zeros_like(a)
21 multiply_them(
22         drv.Out(dest), drv.In(a), drv.In(b),
23         block=(400,1,1), grid=(1,1))
24 
25 assert numpy.allclose(dest, a*b)
26 print dest

Exercise

运行以前的例子

修改并执行一个shape（20，10）的矩阵

Example: Theano + PyCUDA

 1 import numpy, theano
 2 import theano.misc.pycuda_init
 3 from pycuda.compiler import SourceModule
 4 import theano.sandbox.cuda as cuda
 5 
 6 class PyCUDADoubleOp(theano.Op):
 7     def __eq__(self, other):
 8         return type(self) == type(other)
 9 
10     def __hash__(self):
11         return hash(type(self))
12 
13     def __str__(self):
14         return self.__class__.__name__
15 
16     def make_node(self, inp):
17         inp = cuda.basic_ops.gpu_contiguous(
18            cuda.basic_ops.as_cuda_ndarray_variable(inp))
19         assert inp.dtype == "float32"
20         return theano.Apply(self, [inp], [inp.type()])
21 
22     def make_thunk(self, node, storage_map, _, _2):
23         mod = SourceModule("""
24     __global__ void my_fct(float * i0, float * o0, int size) {
25     int i = blockIdx.x*blockDim.x + threadIdx.x;
26     if(i<size){
27         o0[i] = i0[i]*2;
28     }
29   }""")
30         pycuda_fct = mod.get_function("my_fct")
31         inputs = [storage_map[v] for v in node.inputs]
32         outputs = [storage_map[v] for v in node.outputs]
33 
34         def thunk():
35             z = outputs[0]
36             if z[0] is None or z[0].shape != inputs[0][0].shape:
37                 z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
38             grid = (int(numpy.ceil(inputs[0][0].size / 512.)), 1)
39             pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
40                        block=(512, 1, 1), grid=grid)
41         return thunk

使用这个代码来测试：

1 >>> x = theano.tensor.fmatrix()
2 >>> f = theano.function([x], PyCUDADoubleOp()(x))
3 >>> xv = numpy.ones((4, 5), dtype="float32")
4 >>> assert numpy.allclose(f(xv), xv*2)
5 >>> print numpy.asarray(f(xv))

Exercise

运行前面的例子

修改并执行两个矩阵的乘法： x * y.

修改并执行返回两个输出： x + y 和 x - y.

(注意到theano当前的逐元素优化只对涉及到单一输出的计算有用。因此，为了提供基本解决状况下的效率，须要在代码中显式的对这两个操做进行优化)。

修改而后执行来支持跨越行为（stride） (即，避免受限于输入必定是C-连续的)。

5、注意

查阅 Other Implementations 来了解如何在gpu上处理随机数

参考资料：

[1]官网：http://deeplearning.net/software/theano/tutorial/using_gpu.html

[2]person:https://www.cnblogs.com/shouhuxianjian/p/4590224.html