理解numpy中ndarray的内存布局和设计哲学

时间 2020-02-11

标签理解 numpy ndarray 内存布局设计哲学繁體版

原文原文链接

目录html

本文的主要目的在于理解numpy.ndarray的内存结构及其背后的设计哲学。git

ndarray是什么

NumPy provides an N-dimensional array type, the ndarray, which describes a collection of “items” of the same type. The items can be indexed using for example N integers.github

—— from https://docs.scipy.org/doc/numpy-1.17.0/reference/arrays.htmlapi

ndarray是numpy中的多维数组，数组中的元素具备相同的类型，且能够被索引。数组

以下所示：ide

>>> import numpy as np
>>> a = np.array([[0,1,2,3],[4,5,6,7],[8,9,10,11]])
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> type(a)
<class 'numpy.ndarray'>
>>> a.dtype   
dtype('int32')
>>> a[1,2]
6
>>> a[:,1:3]
array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

>>> a.ndim    
2
>>> a.shape   
(3, 4)        
>>> a.strides 
(16, 4)

注：np.array并非类，而是用于建立np.ndarray对象的其中一个函数，numpy中多维数组的类为np.ndarray。函数

ndarray的设计哲学

ndarray的设计哲学在于数据存储与其解释方式的分离，或者说copy和view的分离，让尽量多的操做发生在解释方式上（view上），而尽可能少地操做实际存储数据的内存区域。布局

以下所示，像reshape操做返回的新对象b，a和b的shape不一样，可是二者共享同一个数据block，c=b.T，c是b的转置，但二者仍共享同一个数据block，数据并无发生变化，发生变化的只是数据的解释方式。.net

>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> b = a.reshape(4, 3)
>>> b
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

# reshape操做产生的是view视图，只是对数据的解释方式发生变化，数据物理地址相同
>>> a.ctypes.data
80831392
>>> b.ctypes.data
80831392
>>> id(a) == id(b)
false

# 数据在内存中连续存储
>>> from ctypes import string_at
>>> string_at(b.ctypes.data, b.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'

# b的转置c，c仍共享相同的数据block，只改变了数据的解释方式，“以列优先的方式解释行优先的存储”
>>> c = b.T
>>> c
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  4,  8, 11]])
>>> c.ctypes.data
80831392
>>> string_at(c.ctypes.data, c.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

# copy会复制一份新的数据，其物理地址位于不一样的区域
>>> c = b.copy()
>>> c
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
>>> c.ctypes.data
80831456
>>> string_at(c.ctypes.data, c.nbytes).hex()
'000000000100000002000000030000000400000005000000060000000700000008000000090000000a0000000b000000'

# slice操做产生的也是view视图，仍指向原来数据block中的物理地址
>>> d = b[1:3, :]
>>> d
array([[3, 4, 5],
       [6, 7, 8]])
>>> d.ctypes.data
80831404
>>> print('data buff address from {0} to {1}'.format(b.ctypes.data, b.ctypes.data + b.nbytes))
data buff address from 80831392 to 80831440

副本是一个数据的完整的拷贝，若是咱们对副本进行修改，它不会影响到原始数据，物理内存不在同一位置。

视图是数据的一个别称或引用，经过该别称或引用亦即可访问、操做原有数据，但原有数据不会产生拷贝。若是咱们对视图进行修改，它会影响到原始数据，物理内存在同一位置。

视图通常发生在：

一、numpy 的切片操做返回原数据的视图。

二、调用 ndarray 的 view() 函数产生一个视图。

副本通常发生在：

Python 序列的切片操做，调用deepCopy()函数。

调用 ndarray 的 copy() 函数产生一个副本。

—— from NumPy 副本和视图

view机制的好处显而易见，省内存，同时速度快。

ndarray的内存布局

NumPy arrays consist of two major components, the raw array data (from now on, referred to as the data buffer), and the information about the raw array data. The data buffer is typically what people think of as arrays in C or Fortran, a contiguous (and fixed) block of memory containing fixed sized data items. NumPy also contains a significant set of data that describes how to interpret the data in the data buffer.

—— from NumPy internals

ndarray的内存布局示意图以下：

可大体划分红2部分——对应设计哲学中的数据部分和解释方式：

raw array data：为一个连续的memory block，存储着原始数据，相似C或Fortran中的数组，连续存储
metadata：是对上面内存块的解释方式

metadata都包含哪些信息呢？

dtype：数据类型，指示了每一个数据占用多少个字节，这几个字节怎么解释，好比int32、float32等；
ndim：有多少维；
shape：每维上的数量；
strides：维间距，即到达当前维下一个相邻数据须要前进的字节数，因考虑内存对齐，不必定为每一个数据占用字节数的整数倍；

上面4个信息构成了ndarray的indexing schema，即如何索引到指定位置的数据，以及这个数据该怎么解释。

除此以外的信息还有：字节序（大端小端）、读写权限、C-order（行优先存储） or Fortran-order（列优先存储）等，以下所示，

>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

ndarray的底层是C和Fortran实现，上面的属性能够在其源码中找到对应，具体可见PyArrayObject和PyArray_Descr等结构体。

为何能够这样设计

为何ndarray能够这样设计？

由于ndarray是为矩阵运算服务的，ndarray中的全部数据都是同一种类型，好比int32、float64等，每一个数据占用的字节数相同、解释方式也相同，因此能够稠密地排列在一块儿，在取出时根据dtype现copy一份数据组装成scalar对象输出。这样极大地节省了空间，scalar对象中除了数据以外的域不必重复存储，同时由于连续内存的缘由，能够按秩访问，速度也要快得多。

>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1,1]
5
>>> i,j = a[1,1], a[1,1]

# i和j为不一样的对象，访问一次就“组装一个”对象
>>> id(i)
102575536
>>> id(j)
102575584
>>> a[1,1] = 4
>>> i
5
>>> j
5
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  4,  6,  7],
       [ 8,  9, 10, 11]])

# isinstance(val, np.generic) will return True if val is an array scalar object. Alternatively, what kind of array scalar is present can be determined using other members of the data type hierarchy.
>> isinstance(i, np.generic)
True

这里，能够将ndarray与python中的list对比一下，list能够容纳不一样类型的对象，像string、int、tuple等均可以放在一个list里，因此list中存放的是对象的引用，再经过引用找到具体的对象，这些对象所在的物理地址并非连续的，以下所示

因此相对ndarray，list访问到数据须要多跳转1次，list只能作到对对象引用的按秩访问，对具体的数据并非按秩访问，因此效率上ndarray比list要快得多，空间上，由于ndarray只把数据紧密存储，而list须要把每一个对象的全部域值都存下来，因此ndarray比list要更省空间。

小结

下面小结一下：

ndarray的设计哲学在于数据与其解释方式的分离，让绝大部分多维数组操做只发生在解释方式上；
ndarray中的数据在物理内存上连续存储，在读取时根据dtype现组装成对象输出，能够按秩访问，效率高省空间；
之因此能这样实现，在于ndarray是为矩阵运算服务的，全部数据单元都是同种类型。

理解numpy中ndarray的内存布局和设计哲学

ndarray是什么

ndarray的设计哲学

ndarray的内存布局

为何能够这样设计

小结

参考