Python中的字符串对象（《Python源码剖析》笔记三）

时间 2019-11-16

标签 python 字符串对象 Python源码剖析笔记栏目 Python 繁體版

原文原文链接

这是个人关于《Python源码剖析》一书的笔记的第三篇。 Learn Python by Analyzing Python Source Code · GitBook

Python中的字符串对象

在Python3中，str类型就是Python2的unicode类型，以前的str类型转化成了一个新的bytes类型。咱们能够分析bytes类型的实现，也就是《Python源码剖析》中的内容，但鉴于咱们对str类型的经常使用程度，且咱们对它较浅的理解，因此咱们来剖析一下这个相较而言复杂得多的类型。python

在以前的分析中，Python2中的整数对象是定长对象，而字符串对象则是变长对象。同时字符串对象又是一个不可变对象，建立以后就没法再改变它的值。git

Unicode的四种形式

在Python3中，一个unicode字符串有四种形式：缓存

compact asciiapp
compact函数
legacy string， not readyoop
legacy string ，ready布局

compact的意思是，假如一个字符串对象是compact的模式，它将只使用一个内存块来存储内容，也就是说，在内存中字符是牢牢跟在结构体后面的。对于non-compact的对象来讲，也就是PyUnicodeObject，Python使用一个内存块来保存PyUnicodeObject结构体，另外一个内存块来保存字符。性能

对于ASCII-only的字符串，Python使用PyUnicode_New来建立，并将其保存在PyASCIIObject结构体中。只要它是经过UTF-8来解码的，utf-8字符串就是数据自己，也就是说二者等价。测试

legacy string 是经过PyUnicodeObject来保存的。优化

咱们先看源码，而后再叙述其余内容。

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;       
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible * surrogates count as two code points. */
} PyCompactUnicodeObject;

typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;复制代码

能够看出，整个字符串对象机制以PyASCIIObject为基础，咱们就先来看这个对象。length中保存了字符串中code points的数量。hash中则保存了字符串的hash值，由于一个字符串对象是不可变对象，它的hash值永远不会改变，所以Python将其缓存在hash变量中，防止重复计算带来的性能损失。state结构体中保存了关于这个对象的一些信息，它们和咱们以前介绍的字符串的四种形式有关。wstr变量则是字符串对象真正的值所在。

state结构体中的变量都是什么意思？为了节省篇幅，我将注释删除了，咱们来一一解释。interned变量的值和字符串对象的intern机制有关，它能够有三个值：SSTATE_NOT_INTERNED (0)，SSTATE_INTERNED_MORTAL (1)，SSTATE_INTERNED_IMMORTAL (2)。分别表示不intern，intern但可删除，永久intern。具体的机制咱们后面会说。kind主要是表示字符串以几字节的形式保存。compact咱们已经解释，ascii也很好理解。ready则是用来讲明对象的布局是否被初始化。若是是1，就说明要么这个对象是紧凑的（compact），要么它的数据指针已经被填满了。

咱们前面提到，一个ASCII字符串使用PyUnicode_New来建立，并保存在PyASCIIObject结构体中。一样使用PyUnicode_New建立的字符串对象，若是是非ASCII字符串，则保存在PyCompactUnicodeObject结构体中。一个PyUnicodeObject经过PyUnicode_FromUnicode(NULL, len)建立，真正的字符串数据一开始保存在wstr block中，而后使用_PyUnicode_Ready被复制到了data block中。

咱们再来看一下PyUnicode_Type：

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",              /* tp_name */
    sizeof(PyUnicodeObject),        /* tp_size */
    ……
    unicode_repr,           /* tp_repr */
    &unicode_as_number,         /* tp_as_number */
    &unicode_as_sequence,       /* tp_as_sequence */
    &unicode_as_mapping,        /* tp_as_mapping */
    (hashfunc) unicode_hash,        /* tp_hash*/
    ……
};复制代码

能够看出，Python3中的str的确就是以前的unicode。

建立字符串对象

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) {
    PyObject *unicode;
    Py_UCS4 maxchar = 0;
    Py_ssize_t num_surrogates;

    if (u == NULL)
        return (PyObject*)_PyUnicode_New(size);

    /* If the Unicode data is known at construction time, we can apply some optimizations which share commonly used objects. */

    /* Optimization for empty strings */
    if (size == 0)
        _Py_RETURN_UNICODE_EMPTY();

    /* Single character Unicode objects in the Latin-1 range are shared when using this constructor */
    if (size == 1 && (Py_UCS4)*u < 256)
        return get_latin1_char((unsigned char)*u);

    /* If not empty and not single character, copy the Unicode data into the new object */
    if (find_maxchar_surrogates(u, u + size,
                                &maxchar, &num_surrogates) == -1)
        return NULL;

    unicode = PyUnicode_New(size - num_surrogates, maxchar);
    if (!unicode)
        return NULL;

    switch (PyUnicode_KIND(unicode)) {
    case PyUnicode_1BYTE_KIND:
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, unsigned char,
                                u, u + size, PyUnicode_1BYTE_DATA(unicode));
        break;
    case PyUnicode_2BYTE_KIND:
#if Py_UNICODE_SIZE == 2
        memcpy(PyUnicode_2BYTE_DATA(unicode), u, size * 2);
#else
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, Py_UCS2,
                                u, u + size, PyUnicode_2BYTE_DATA(unicode));
#endif
        break;
    case PyUnicode_4BYTE_KIND:
#if SIZEOF_WCHAR_T == 2
        /* This is the only case which has to process surrogates, thus a simple copy loop is not enough and we need a function. */
        unicode_convert_wchar_to_ucs4(u, u + size, unicode);
#else
        assert(num_surrogates == 0);
        memcpy(PyUnicode_4BYTE_DATA(unicode), u, size * 4);
#endif
        break;
    default:
        assert(0 && "Impossible state");
    }

    return unicode_result(unicode);
}
PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) {
    PyObject *obj;
    PyCompactUnicodeObject *unicode;
    void *data;
    enum PyUnicode_Kind kind;
    int is_sharing, is_ascii;
    Py_ssize_t char_size;
    Py_ssize_t struct_size;

    /* Optimization for empty strings */
    if (size == 0 && unicode_empty != NULL) {
        Py_INCREF(unicode_empty);
        return unicode_empty;
    }

    is_ascii = 0;
    is_sharing = 0;
    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
        is_ascii = 1;
        struct_size = sizeof(PyASCIIObject);
    }
    else if (maxchar < 256) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
    }
    else if (maxchar < 65536) {
        kind = PyUnicode_2BYTE_KIND;
        char_size = 2;
        if (sizeof(wchar_t) == 2)
            is_sharing = 1;
    }
    else {
        if (maxchar > MAX_UNICODE) {
            PyErr_SetString(PyExc_SystemError,
                            "invalid maximum character passed to PyUnicode_New");
            return NULL;
        }
        kind = PyUnicode_4BYTE_KIND;
        char_size = 4;
        if (sizeof(wchar_t) == 4)
            is_sharing = 1;
    }

    /* Ensure we won't overflow the size. */
    if (size < 0) {
        PyErr_SetString(PyExc_SystemError,
                        "Negative size passed to PyUnicode_New");
        return NULL;
    }
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();

    /* Duplicated allocation code from _PyObject_New() instead of a call to * PyObject_New() so we are able to allocate space for the object and * it's data buffer. */
    obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
    if (obj == NULL)
        return PyErr_NoMemory();
    obj = PyObject_INIT(obj, &PyUnicode_Type);
    if (obj == NULL)
        return NULL;

    unicode = (PyCompactUnicodeObject *)obj;
    if (is_ascii)
        data = ((PyASCIIObject*)obj) + 1;
    else
        data = unicode + 1;
    _PyUnicode_LENGTH(unicode) = size;
    _PyUnicode_HASH(unicode) = -1;
    _PyUnicode_STATE(unicode).interned = 0;
    _PyUnicode_STATE(unicode).kind = kind;
    _PyUnicode_STATE(unicode).compact = 1;
    _PyUnicode_STATE(unicode).ready = 1;
    _PyUnicode_STATE(unicode).ascii = is_ascii;
    if (is_ascii) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
    }
    else if (kind == PyUnicode_1BYTE_KIND) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
        _PyUnicode_WSTR_LENGTH(unicode) = 0;
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
    }
    else {
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
        if (kind == PyUnicode_2BYTE_KIND)
            ((Py_UCS2*)data)[size] = 0;
        else /* kind == PyUnicode_4BYTE_KIND */
            ((Py_UCS4*)data)[size] = 0;
        if (is_sharing) {
            _PyUnicode_WSTR_LENGTH(unicode) = size;
            _PyUnicode_WSTR(unicode) = (wchar_t *)data;
        }
        else {
            _PyUnicode_WSTR_LENGTH(unicode) = 0;
            _PyUnicode_WSTR(unicode) = NULL;
        }
    }
#ifdef Py_DEBUG
    unicode_fill_invalid((PyObject*)unicode, 0);
#endif
    assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
    return obj;
}复制代码

先来分析PyUnicode_FromUnicode的流程。若是传入的u是个空指针，调用_PyUnicode_New(size)直接返回一个指定大小但值为空的PyUnicodeObject对象。若是size==0，调用_Py_RETURN_UNICODE_EMPTY()直接返回。若是是在Latin-1范围内的单字符字符串，直接返回该字符对应的PyUnicodeObject，这和咱们在上一章说的小整数对象池相似，这里也有一个字符缓冲池。若是二者都不是，则建立一个新的对象并将数据复制到这个对象中。

PyUnicode_New的流程很好理解，传入对象的大小和maxchar，根据这两个参数来决定返回的是PyASCIIObject，PyCompactUnicodeObject仍是PyUnicodeObject。

Intern机制

咱们以前提到了intern机制，它指的就是在建立一个新的字符串对象时，若是已经有了和它的值相同的字符串对象，那么就直接返回那个对象的引用，而不返回新建立的字符串对象。Python在那里寻找呢？事实上，python维护着一个键值对类型的结构interned，键就是字符串的值。但这个intern机制并不是对于全部的字符串对象都适用，简单来讲对于那些符合python标识符命名原则的字符串，也就是只包括字母数字下划线的字符串，python会对它们使用intern机制。在标准库中，有一个函数可让咱们对一个字符串强制实行这个机制——sys.intern()，下面是这个函数的文档：

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

具体机制见下面代码：

PyObject *PyUnicode_InternFromString(const char *cp) {
    PyObject *s = PyUnicode_FromString(cp);
    if (s == NULL)
        return NULL;
    PyUnicode_InternInPlace(&s);
    return s;
}复制代码

void PyUnicode_InternInPlace(PyObject **p) {
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt. The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}复制代码

当Python调用PyUnicode_InternFromString时，会返回一个interned的对象，具体过程由PyUnicode_InternInPlace来实现。

事实上，即便Python会对一个字符串进行intern操做，它也会先建立出一个PyUnicodeObject对象，以后再检查是否有值和其相同的对象。若是有的话，就将interned中保存的对象返回，以前新建立出来的，由于引用计数变为零，被回收了。

被intern机制处理后的对象分为两类：mortal和immortal，前者会被回收，后者则不会被回收，与Python虚拟机共存亡。

PyUnicodeObject有关的效率问题

在《Python源码剖析》原书中提到使用+来链接字符串是一个极其低效的操做，由于每次链接都会建立一个新的字符串对象，推荐使用字符串的join方法来链接字符串。在Python3.6下，通过个人测试，使用+来链接字符串已经和使用join的耗时相差不大。固然这只是我在个别环境下的测试，真正的答案我还不知道。

小结

在Python3中，str底层实现使用unicode，这很好的解决了Python2中复杂麻烦的非ASCII字符串的种种问题。同时在底层，Python对于ASCII和非ASCII字符串区别对待，加上utf-8兼容ASCII字符，兼顾了性能和简单程度。在Python中，不可变对象每每都有相似intern机制的东西，这使得Python减小了没必要要的内存消耗，可是在真正的实现中，Python也是取平衡点。由于，一味使用intern机制，有可能会形成额外的计算和查找，这就和优化性能的目的背道而驰了。