Python内存数据库/引擎

时间 2019-11-26

原文原文链接

1 初探

　　在平时的开发工做中，咱们可能会有这样的需求：咱们但愿有一个内存数据库或者数据引擎，用比较Pythonic的方式进行数据库的操做（好比说插入和查询）。html

　　举个具体的例子，分别向数据库db中插入两条数据，"a=1, b=1" 和 "a=1, b=2", 而后想查询a=1的数据可能会使用这样的语句db.query(a=1)，结果就是返回前面插入的两条数据；若是想查询a=1, b=2的数据，就使用这样的语句db.query(a=1, b=2)，结果就返回前面的第二条数据。python

　　那么是否拥有实现上述需求的现成的第三方库呢？几经查找，发现PyDbLite可以知足这样的需求。其实，PyDbLite和Python自带的SQLite均支持内存数据库模式，只是前者是Pythonic的用法，然后者则是典型的SQL用法。
他们具体的用法是这样的：git

PyDbLitegithub

import pydblite
# 使用内存数据库
pydb = pydblite.Base(':memory:')
# 建立a,b,c三个字段
pydb.create('a', 'b', 'c')
# 为字段a,b建立索引
pydb.create_index('a', 'b')
# 插入一条数据
pydb.insert(a=-1, b=0, c=1)
# 查询符合特定要求的数据
results = pydb(a=-1, b=0)

SQLitesql

import sqlite3
# 使用内存数据库
con = sqlite3.connect(':memory:')
# 建立a,b,c三个字段
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
# 为字段a,b建立索引
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
# 插入一条数据
cur.execute('insert into test values(?, ?, ?)', (-1,0,1))
# 查询符合特定要求的数据
cur.execute('select * from test where a=? and b=?',(-1, 0))

2 pydblite和sqlite的性能

　　毫无疑问，pydblite的使用方式很是地Pythonic，可是它的效率如何呢？因为咱们主要关心的是数据插入和查询速度，因此不妨仅对这两项作一个对比。写一个简单的测试脚本：数据库

import time
count = 100000

def timeit(func):
    def wrapper(*args, **kws):
        t = time.time()
        func(*args)
        print time.time() - t, kws['des']
    return wrapper

@timeit
def test_insert(mdb, des=''):
    for i in xrange(count):
        mdb.insert(a=i-1, b=i, c=i+1)

@timeit
def test_query_object(mdb, des=''):
    for i in xrange(count):
        c = mdb(a=i-1, b=i)

@timeit
def test_sqlite_insert(cur, des=''):
    for i in xrange(count):
        cur.execute('insert into test values(?, ?, ?)', (i-1, i, i+1))

@timeit
def test_sqlite_query(cur, des=''):
    for i in xrange(count):
        cur.execute('select * from test where a=? and b=?', (i-1, i))

print '-------pydblite--------'
import pydblite
pydb = pydblite.Base(':memory:')
pydb.create('a', 'b', 'c')
pydb.create_index('a', 'b')
test_insert(pydb, des='insert')
test_query_object(pydb, des='query, object call')


print '-------sqlite3--------'
import sqlite3
con = sqlite3.connect(':memory:')
cur = con.cursor()
cur.execute('create table test (a char(256), b char(256), c char(256));')
cur.execute('create index a_index on test(a)')
cur.execute('create index b_index on test(b)')
test_sqlite_insert(cur, des='insert')
test_sqlite_query(cur, des='query')

　　在建立索引的状况下，10w次的插入和查询的时间以下：数据结构

-------pydblite--------
1.14199995995 insert
0.308000087738 query, object call
-------sqlite3--------
0.411999940872 insert
0.30999994278 query

　　在未建立索引的状况（把建立索引的测试语句注释掉）下，1w次的插入和查询时间以下：app

-------pydblite--------
0.0989999771118 insert
5.15300011635 query, object call
-------sqlite3--------
0.0169999599457 insert
7.43400001526 query

　　咱们不可贵出以下结论：ide

　　sqlite的插入速度是pydblite的3-5倍；而在创建索引的状况下，sqlite的查询速度和pydblite至关；在未创建索引的状况下，sqlite的查询速度比pydblite慢1.5倍左右。函数

3 优化

　　咱们的目标很是明确，使用Pythonic的内存数据库，提升插入和查询效率，而不考虑持久化。那么可否既拥有pydblite的pythonic的使用方式，又同时具有pydblite和sqlite中插入和查询速度快的那一方的速度？针对咱们的目标，看看可否对pydblite作一些优化。

　　阅读pydblite的源码，首先映入眼帘的是对python2和3作了一个简单的区分。给外部调用的Base基于_BasePy2或者_BasePy3，它们仅仅是在__iter__上有细微差别，最终调用的是_Base这个类。

class _BasePy2(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.itervalues())


class _BasePy3(_Base):

    def __iter__(self):
        """Iteration on the records"""
        return iter(self.records.values())

if sys.version_info[0] == 2:
    Base = _BasePy2
else:
    Base = _BasePy3

　　而后看下_Base的构造函数，作了简单的初始化文件的操做，因为咱们就是使用内存数据库，因此文件相关的内容彻底能够抛弃。

class _Base(object):

    def __init__(self, path, protocol=pickle.HIGHEST_PROTOCOL, save_to_file=True,
                 sqlite_compat=False):
        """protocol as defined in pickle / pickle.
        Defaults to the highest protocol available.
        For maximum compatibility use protocol = 0

        """
        self.path = path
        """The path of the database in the file system"""
        self.name = os.path.splitext(os.path.basename(path))[0]
        """The basename of the path, stripped of its extension"""
        self.protocol = protocol
        self.mode = None
        if path == ":memory:":
            save_to_file = False
        self.save_to_file = save_to_file
        self.sqlite_compat = sqlite_compat
        self.fields = []
        """The list of the fields (does not include the internal
        fields __id__ and __version__)"""
        # if base exists, get field names
        if save_to_file and self.exists():
            if protocol == 0:
                _in = open(self.path)  # don't specify binary mode !
            else:
                _in = open(self.path, 'rb')
            self.fields = pickle.load(_in)

　　紧接着比较重要的是create（建立字段）、create_index（建立索引）两个函数：

    def create(self, *fields, **kw):
        """
        Create a new base with specified field names.

        Args:
            - \*fields (str): The field names to create.
            - mode (str): the mode used when creating the database.

        - if mode = 'create' : create a new base (the default value)
        - if mode = 'open' : open the existing base, ignore the fields
        - if mode = 'override' : erase the existing base and create a
          new one with the specified fields

        Returns:
            - the database (self).
        """
        self.mode = kw.get("mode", 'create')
        if self.save_to_file and os.path.exists(self.path):
            if not os.path.isfile(self.path):
                raise IOError("%s exists and is not a file" % self.path)
            elif self.mode is 'create':
                raise IOError("Base %s already exists" % self.path)
            elif self.mode == "open":
                return self.open()
            elif self.mode == "override":
                os.remove(self.path)
            else:
                raise ValueError("Invalid value given for 'open': '%s'" % open)

        self.fields = []
        self.default_values = {}
        for field in fields:
            if type(field) is dict:
                self.fields.append(field["name"])
                self.default_values[field["name"]] = field.get("default", None)
            elif type(field) is tuple:
                self.fields.append(field[0])
                self.default_values[field[0]] = field[1]
            else:
                self.fields.append(field)
                self.default_values[field] = None

        self.records = {}
        self.next_id = 0
        self.indices = {}
        self.commit()
        return self

    def create_index(self, *fields):
        """
        Create an index on the specified field names

        An index on a field is a mapping between the values taken by the field
        and the sorted list of the ids of the records whose field is equal to
        this value

        For each indexed field, an attribute of self is created, an instance
        of the class Index (see above). Its name it the field name, with the
        prefix _ to avoid name conflicts

        Args:
            - fields (list): the fields to index
        """
        reset = False
        for f in fields:
            if f not in self.fields:
                raise NameError("%s is not a field name %s" % (f, self.fields))
            # initialize the indices
            if self.mode == "open" and f in self.indices:
                continue
            reset = True
            self.indices[f] = {}
            for _id, record in self.records.items():
                # use bisect to quickly insert the id in the list
                bisect.insort(self.indices[f].setdefault(record[f], []), _id)
            # create a new attribute of self, used to find the records
            # by this index
            setattr(self, '_' + f, Index(self, f))
        if reset:
            self.commit()

　　能够看出，pydblite在内存中维护了一个名为records的字典变量，用来存放一条条的数据。它的key是内部维护的id，从0开始自增；而它的value则是用户插入的数据，为了后续查询和记录的方便，这里在每条数据中额外又加入了__id__和__version__。其次，内部维护的indices字典变量则是是个索引表，它的key是字段名，而value则是这样一个字典：其key是这个字段全部已知的值，value是这个值所在的那条数据的id。

　　举个例子，假设咱们插入了“a=-1,b=0,c=1”和“a=0,b=1,c=2”两条数据，那么records和indices的内容会是这样的：

# records
{0: {'__id__': 0, '__version__': 0, 'a': -1, 'b': 0, 'c': 1},
 1: {'__id__': 1, '__version__': 0, 'a': 0, 'b': 1, 'c': 2}}

# indices
{'a': {-1: [0], 0: [1]}, 'b': {0: [0], 1: [1]}}

　　比方说如今咱们想查找a=0的数据，那么就会在indices中找key为'a'的value，即{-1: set([0]), 0: set([1])}，而后在这里面找key为0的value，即[1]，由此咱们直到了咱们想要的这条数据它的id是1（也可能会有多个）；假设咱们对数据还有其余要求好比a=0,b=1，那么它会继续上述的查找过程，找到a=0和b=1分别对应的ids，作交集，就获得了知足这两个条件的ids，而后再到records里根据ids找到全部对应的数据。

　　明白了原理，咱们再看看有什么可优化的地方：

　　数据结构，总体的records和indeices数据结构已经挺精简了，暂时不须要优化。其中的__version__能够不要，由于咱们并不关注这个数据被修改了几回。其次是因为indices中最终的ids是个list，在查询和插入的时候会比较慢，咱们知道内部维护的id必定是惟一的，因此这里改为set会好一些。

　　python语句，不难看出，整个_Base为了同时兼容python2和python3，不得不使用了2和3都支持的语句，这就致使在部分语句上针对特定版本的python就会形成浪费或者说是性能开销。好比说，d是个字典，那么为了同事兼容python2和3，做者使用了相似与for key in d.keys()这样的语句，在python2中，d.keys()会首先产生一个list，用d.iterkeys是个更明智的方案。再如，做者会使用相似set(d.keys()) - set([1])这样的语句，可是python2中，使用d.viewkeys() - set([1])效率将会更高，由于它不须要将list转化成set。

　　对特定版本python的优化语句就不一一举例，归纳地说，从数据结构，python语句以及是否须要某些功能等方面能够对pydblite作进一步的优化。前面只是说了create和create_index两个函数，包括insert和__call__的优化也十分相似。此外，用普通方法来代替魔法方法，也能稍微提高下效率，因此在后续的优化中将__call__改写为了query。

　　优化后的代码，请见MemLite。

4 memlite、pydblite和sqlite的性能

　　让咱们在上文的测试代码中加入对memlite的测试：

@timeit
def test_query_method(mdb, des=''):
    for i in xrange(count):
        c = mdb.query(a=i-1, b=i)

print '-------memlite-------'
import memlite
db = memlite.Base()
db.create('a', 'b', 'c')
db.create_index('a', 'b')
test_insert(db, des='insert')
test_query_method(db, des='query, method call')

在建立索引的状况下，10w次的插入和查询的时间以下：

-------memlite-------
0.378000020981 insert
0.285000085831 query, method call
-------pydblite--------
1.3140001297 insert
0.309000015259 query, object call
-------sqlite3--------
0.414000034332 insert
0.3109998703 query

　　在未建立索引的状况（把建立索引的测试语句注释掉）下，1w次的插入和查询时间以下：

-------memlite-------
0.0179998874664 insert
5.90199995041 query, method call
-------pydblite--------
0.0980000495911 insert
4.87400007248 query, object call
-------sqlite3--------
0.0170001983643 insert
7.42399978638 query

　　能够看出，在建立索引的状况下，memlite的插入和查询性能在sqlite和pydblite之上；而在未建立索引的状况下，memlite的插入性能和sqlite同样，好于pydblite，memlite的查询性能比pydblite稍差，但好于sqlite。综合来看，memlite即拥有pydblite的pythonic的使用方式，又拥有pydblite和sqlite中性能较高者的效率，符合预期的优化目标。

转载请注明出处：http://www.cnblogs.com/dreamlofter/p/5843355.html 谢谢！