NumPy 底层进行了不错的优化。html
loops = 25000000
from math import *
# 策略1:对每个元素依次进行计算
a = range(1, loops)
def f(x):
return 3 * log(x) + cos(x) ** 2
%timeit r = [f(x) for x in a]
# 策略2:使用np
import numpy as np
a = np.arange(1, loops)
%timeit r = 3 * np.log(a) + np.cos(a) ** 2
# 策略3:numexpr方法,避免了数组在内存中复制
import numexpr as ne
ne.set_num_threads(1)
f = '3 * log(a) + cos(a) ** 2'
%timeit r = ne.evaluate(f)
# 策略4:再并发四线程
ne.set_num_threads(4) # 支持并发
%timeit r = ne.evaluate(f)
NumPy 虽然经过底层高度优化过的计算库能够实现接近C的高效计算,但在计算复杂且计算量庞大的时候多少仍是有些嫌慢。Numexpr 库是最近发现的一个很是简单易用的 Numpy性能提高工具。python
Goto: NumExpr: Fast numerical expression evaluator for NumPygit
先安装cuda驱动:[CUDA] 00 - GPU Driver Installation & Concurrency Programminggithub
再安装CuPy驱动:CuPy Installationweb
须要了解:PyPy和CPython的不一样点。数据库
简单理解:PyPy 为何会比 CPython 还要快?express
CPython的思路编程
def add(x, y): return x + y if instance_has_method(x, '__add__') { return call(x, '__add__', y) // x.__add__ 里面又有一大堆针对不一样类型的 y 的判断
} else if isinstance_has_method(super_class(x), '__add__' { return call(super_class, '__add__', y)
} else if isinstance(x, str) and isinstance(y, str) { return concat_str(x, y)
} else if isinstance(x, float) and isinstance(y, float) { return add_float(x, y)
} else if isinstance(x, int) and isinstance(y, int) { return add_int(x, y)
} else ...
Just In Time 编译数组
为什么加速了呢?PyPy 执行的时候,发现执行了一百遍 add 函数,发现你 TM 每次都只传两个整数进来,那我何苦每次还给你作这么多计算。因而当场生成了一个相似 C 的函数:缓存
int add_int_int(int x, int y) {
return x + y; }
原始代码,做为以后的优化参照。
import numpy as np import time grid_size = (512,) def laplacian(grid, out): np.copyto(out, grid) np.multiply(out, -2.0, out) np.add(out, np.roll(grid, +1), out) np.add(out, np.roll(grid, -1), out) def evolve(grid, dt, out, D=1): laplacian(grid, out) np.multiply(out, D * dt, out) np.add(out, grid, out)
def run_experiment(num_iterations):
scratch = np.zeros(grid_size) grid = np.zeros(grid_size) block_low = int(grid_size[0] * .4) block_high = int(grid_size[0] * .5) grid[block_low:block_high] = 0.005 start = time.time() for i in range(num_iterations): evolve(grid, 0.1, scratch) grid, scratch = scratch, grid return time.time() - start
if __name__ == "__main__": run_experiment(500)
性能分析一:
$ sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python diffusion_python_memory.py
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory.py' (3 runs): 358139881 cycles # 3.605 GHz ( +- 2.30% ) (65.11%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 453383936 instructions # 1.27 insn per cycle ( +- 1.22% ) (82.56%) 22986206 cache-references # 231.400 M/sec ( +- 0.56% ) (82.71%) 2022121 cache-misses # 8.797 % of all cache refs ( +- 15.29% ) (83.90%) 100438152 branches # 1011.101 M/sec ( +- 1.59% ) (83.90%) 3004796 branch-misses # 2.99% of all branches ( +- 0.49% ) (84.38%) 99.34 msec task-clock # 0.996 CPUs utilized ( +- 4.31% ) 4984 faults # 0.050 M/sec ( +- 0.07% ) 4984 minor-faults # 0.050 M/sec ( +- 0.07% ) 1 cs # 0.007 K/sec ( +- 50.00% ) 0 migrations # 0.000 K/sec 0.09976 +- 0.00448 seconds time elapsed ( +- 4.49% )
写在一块儿的好处是:短代码每每意味着高性能,由于容易利用内建优化机制。
def laplacian(grid): return np.roll(grid, +1) + np.roll(grid, -1) - 2 * grid
def evolve(grid, dt, D=1): return grid + dt * D * laplacian(grid)
性能分析二:cache-references很是大,但0.97指令减小才是重点。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy.py' (3 runs): 345131518 cycles # 3.368 GHz ( +- 2.05% ) (63.87%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 441624002 instructions # 1.28 insn per cycle ( +- 0.41% ) (82.69%) 22383742 cache-references # 218.428 M/sec ( +- 0.44% ) (84.39%) 2334617 cache-misses # 10.430 % of all cache refs ( +- 4.03% ) (84.39%) 98647251 branches # 962.634 M/sec ( +- 1.17% ) (84.39%) 2950862 branch-misses # 2.99% of all branches ( +- 0.24% ) (82.97%) 102.48 msec task-clock # 0.998 CPUs utilized ( +- 2.02% ) 4978 faults # 0.049 M/sec ( +- 0.04% ) 4978 minor-faults # 0.049 M/sec ( +- 0.04% ) 1 cs # 0.007 K/sec ( +- 50.00% ) 0 migrations # 0.000 K/sec 0.10270 +- 0.00209 seconds time elapsed ( +- 2.04% )
咱们使用 numpy下降了 CPU 负担,咱们下降了解决问题所必要的内存分配次数。
事实上 evolve 函数有 93%的时间花费在 laplacian 函数中。
参见:[Optimized Python] 17 - Bottle-neck in performance
向操做系统进行请求所需的开销比简单填充缓存大不少。
填充一次缓存失效是一个在主板上优化过的硬件行为,而内存分配则须要跟另外一个进程、内核打交道来完成。
(1)"+="运算符的使用减小了“没必要要的变量内存分配”。
(2)roll_add替代过于全能的np.roll,以减小没必要要的指令。
def roll_add(rollee, shift, out): if shift == 1: out[1:] += rollee[:-1] out[0] += rollee[-1] elif shift == -1: out[:-1] += rollee[1:] out[-1] += rollee[0] def laplacian(grid, out): np.copyto(out, grid) np.multiply(out, -2.0, out) roll_add(grid, +1, out) roll_add(grid, -1, out) def evolve(grid, dt, out, D=1): laplacian(grid, out) np.multiply(out, D * dt, out) np.add(out, grid, out
性能分析三:减低了内存缺页,提升了指令效率。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2.py' (3 runs): 314575986 cycles # 3.670 GHz ( +- 1.89% ) (62.67%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 402037134 instructions # 1.28 insn per cycle ( +- 0.19% ) (81.33%) 18138094 cache-references # 211.611 M/sec ( +- 2.65% ) (81.94%) 2298500 cache-misses # 12.672 % of all cache refs ( +- 6.84% ) (84.68%) 88677913 branches # 1034.575 M/sec ( +- 0.68% ) (86.00%) 2885039 branch-misses # 3.25% of all branches ( +- 1.19% ) (84.71%) 85.71 msec task-clock # 0.997 CPUs utilized ( +- 1.86% ) 4974 faults # 0.058 M/sec ( +- 0.10% ) 4974 minor-faults # 0.058 M/sec ( +- 0.10% ) 0 cs # 0.004 K/sec ( +-100.00% ) 0 migrations # 0.000 K/sec 0.08596 +- 0.00163 seconds time elapsed ( +- 1.89% )
numexpr 的一个重要特色是它考虑到了 CPU 缓存。它特意移动数据来让各级CPU 缓存可以拥有正确的数据让缓存失效最小化。
(a)当咱们增长矩阵的大小时,咱们会发现 numexpr 比原生 numpy 更好地利用了咱们的缓存。并且,numexpr 利用了多核来进行计算并尝试填满每一个核心的缓存。
(b)当矩阵较小时,管理多核的开销盖过了任何可能的性能提高。
import numexpr as ne
def evolve(grid, dt, out, D=1): laplacian(grid, out) ne.evaluate("out*D*dt+grid", out=out)
性能分析四:减低了内存缺页,提升了指令效率。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_numpy_memory2_numexpr.py' (3 runs): 389532813 cycles # 3.888 GHz ( +- 1.01% ) (65.40%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 478293442 instructions # 1.23 insn per cycle ( +- 1.70% ) (82.72%) 23279515 cache-references # 232.372 M/sec ( +- 0.91% ) (82.89%) 2447425 cache-misses # 10.513 % of all cache refs ( +- 10.84% ) (83.77%) 103451737 branches # 1032.639 M/sec ( +- 1.21% ) (84.97%) 3324954 branch-misses # 3.21% of all branches ( +- 0.61% ) (82.97%) 100.18 msec task-clock # 1.000 CPUs utilized ( +- 5.75% ) 8591 faults # 0.086 M/sec ( +- 0.01% ) 8591 minor-faults # 0.086 M/sec ( +- 0.01% ) 12 cs # 0.123 K/sec ( +- 2.70% ) 3 migrations # 0.030 K/sec 0.10018 +- 0.00582 seconds time elapsed ( +- 5.81% )
直接使用scipy提供的接口。
from scipy.ndimage.filters import laplace def laplacian(grid, out): laplace(grid, out, mode='wrap')
性能分析五:可能过于通用而没有明显提高。
Performance counter stats for '/usr/local/anaconda3/bin/python3 diffusion_scipy.py' (3 runs): 346206970 cycles # 3.647 GHz ( +- 0.37% ) (63.49%) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 441742003 instructions # 1.28 insn per cycle ( +- 1.10% ) (81.74%) 21243859 cache-references # 223.757 M/sec ( +- 0.37% ) (82.77%) 2092136 cache-misses # 9.848 % of all cache refs ( +- 1.46% ) (84.55%) 96943942 branches # 1021.088 M/sec ( +- 0.80% ) (85.19%) 3120156 branch-misses # 3.22% of all branches ( +- 1.27% ) (84.00%) 94.94 msec task-clock # 0.997 CPUs utilized ( +- 11.09% ) 5156 faults # 0.054 M/sec ( +- 0.10% ) 5156 minor-faults # 0.054 M/sec ( +- 0.10% ) 0 cs # 0.004 K/sec ( +-100.00% ) 0 migrations # 0.000 K/sec 0.0952 +- 0.0106 seconds time elapsed ( +- 11.14% )
看来,scipy没有大的必要,“定制的” 优化比较靠谱。
v = [0.5, 0.75, 1.0, 1.5, 2.0] # vector of numbers
m = [v, v, v] # matrix of numbers
m
m[1]
m[1][0]
v1 = [0.5, 1.5]
v2 = [1, 2]
m = [v1, v2]
c = [m, m] # cube of numbers
c
c[1][1][0]
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = [v, v, v]
m
v[0] = 'Python'
m
from copy import deepcopy
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = 3 * [deepcopy(v), ]
m
v[0] = 'Python'
m
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0])
type(a)
a[:2] # indexing as with list objects in 1 dimension
a.sum() # sum of all elements
a.std() # standard deviation
a.cumsum() # running cumulative sum
a * 2
a ** 2
np.sqrt(a)
b = np.array([a, a * 2])
b
b[0] # first row
b[0, 2] # third element of first row
b.sum()
b.sum(axis=0)
# sum along axis 0, i.e. column-wise sum
b.sum(axis=1)
# sum along axis 1, i.e. row-wise sum
c = np.zeros((2, 3, 4), dtype='i', order='C') # also: np.ones()
c
d = np.ones_like(c, dtype='f16', order='C') # also: np.zeros_like()
d
import random
I = 5000
%time mat = [[random.gauss(0, 1) for j in range(I)] for i in range(I)]
# a nested list comprehension
%time mat = np.random.standard_normal((I, I))
%time mat.sum()
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'),
('Height', 'f'), ('Children/Pets', 'i4', 2)]) # 相似与数据库的表,定义列的属性;每列数据的类型是要确保一致的
s = np.array([('Smith', 45, 1.83, (0, 1)),
('Jones', 53, 1.72, (2, 2))], dtype=dt)
s
s['Name']
s['Height'].mean()
s[1]['Age']
r = np.random.standard_normal((4, 3))
s = np.random.standard_normal((4, 3))
r + s
2 * r + 3
s = np.random.standard_normal(3)
r + s
# causes intentional error
# s = np.random.standard_normal(4)
# r + s
# r.transpose() + s
np.shape(r.T)
def f(x):
return 3 * x + 5
f(0.5) # float object
f(r) # NumPy array
# causes intentional error
# import math
# math.sin(r)
np.sin(r) # array as input
np.sin(np.pi) # float as input
x = np.random.standard_normal((5, 10000000))
y = 2 * x + 3 # linear equation y = a * x + b
C = np.array((x, y), order='C') # 按行统计快些
F = np.array((x, y), order='F') # 按列统计快些
x = 0.0; y = 0.0 # memory clean-up
C[:2].round(2) # 四舍五入
%timeit C.sum()
%timeit F.sum()
%timeit C[0].sum(axis=0)
%timeit C[0].sum(axis=1)
%timeit F.sum(axis=0)
%timeit F.sum(axis=1)
F = 0.0; C = 0.0 # memory clean-up