numpy的tfrecord处理

原文连接:https://blog.csdn.net/songbinxu/article/details/80136836python

Tensorflow数据读写:Numpy存储为TFRecord文件与读取

用Tensorflow训练模型,读取数据有三种方法:网络

  • 每一个epoch/batch将内存中的Numpy数据送入placeholder,只适用于小数据集,会至关占用内存。
  • 从硬盘里的txt或csv文件读取,IO操做比较耗时。
  • 读取tensorflow推荐的TFRecord文件

  TFRecord文件是一种能将data和label一块儿存储的二进制文件,能更好地利用内存,在tensorflow的graph中更快地复制、移动、读取。TFRecord文件包含了tf.train.Example 协议缓冲区(protocol buffer),能够先将数据转成字符串序列化,填入到协议缓冲区,再由TFRecordWritier写入TFRecord文件。dom


Numpy存TFRecord

def save_tfrecords(data, label, desfile):
    with tf.python_io.TFRecordWriter(desfile) as writer:
        for i in range(len(data)):
            features = tf.train.Features(
                feature = {
                    "data":tf.train.Feature(bytes_list = tf.train.BytesList(value = [data[i].astype(np.float64).tostring()])),
                    "label":tf.train.Feature(int64_list = tf.train.Int64List(value = [label[i]]))
                }
            )
            example = tf.train.Example(features = features)
            serialized = example.SerializeToString()
            writer.write(serialized)

使用举例

例如,咱们把一个含有10个样本,维度不一的数据集,跟它们的label一块儿存储到tfrecord文件中。函数

# 将不定长样本padding补0成定长
def padding(data, maxlen=10):
    for i in range(len(data)):
        data[i] = np.hstack([data[i], np.zeros((maxlen-len(data[i])))])

lens = np.random.randint(low=3,high=10,size=(10,))
data = [np.arange(l) for l in lens]
padding(data)
label = [0,0,0,0,0,1,1,1,1,1]

save_tfrecords(data, label, "./data.tfrecords")
  •  

读TFRecord转Numpy

def _parse_function(example_proto):
  features = {"data": tf.FixedLenFeature((), tf.string),
              "label": tf.FixedLenFeature((), tf.int64)}
  parsed_features = tf.parse_single_example(example_proto, features)
  data = tf.decode_raw(parsed_features['data'], tf.float32)
  return data, parsed_features["label"]

def load_tfrecords(srcfile):
    sess = tf.Session()

    dataset = tf.data.TFRecordDataset(srcfile) # load tfrecord file
    dataset = dataset.map(_parse_function) # parse data into tensor
    dataset = dataset.repeat(2) # repeat for 2 epoches
    dataset = dataset.batch(5) # set batch_size = 5

    iterator = dataset.make_one_shot_iterator()
    next_data = iterator.get_next()

    while True:
        try:
            data, label = sess.run(next_data)
            print data
            print label
        except tf.errors.OutOfRangeError:
            break

使用举例

load_tfrecords(srcfile="./data.tfrecords")
  •  

结果输出

# 10个样本,2个epoch,至关于20个样本,每一个batch有5个样本
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]]
[0 0 0 0 0]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]]
[1 1 1 1 1]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]]
[0 0 0 0 0]
[[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]
 [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]]
[1 1 1 1 1]

用Dataset读取TFRecord训练

准备数据集

为了实现一个训练,这里选用iris数据集,存储为TFRecord文件。大数据

from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
label = iris.label
save_tfrecords(data, label, "./iris.tfrecord")

设计模型

这里简单地采用两层神经网络,使用relu做为激活函数spa

def model_function(X=None, Y=None):
    # data & label
    if X == None or Y == None:
        X = tf.placeholder(tf.float32, [None, 4])
        Y = tf.placeholder(tf.int64, [None,])

    # params
    W1 = tf.Variable(tf.random_normal([4,32], 0.0, 0.01))
    b1 = tf.Variable(tf.zeros([32,]))
    W2 = tf.Variable(tf.random_normal([32,3], 0.0, 0.01))
    b2 = tf.Variable(tf.zeros([3,]))

    # transform
    H1 = tf.nn.relu(tf.matmul(X, W1) + b1)
    H2 = tf.nn.relu(tf.matmul(H1, W2) + b2)

    cross_entropy = tf.losses.sparse_softmax_cross_entropy(Y, H2)

    return X, Y, cross_entropy

常规训练

  首先提供一种从内存中读入每个batch的数据输入网络的placeholder进行训练的方法,这种方法的问题在于内存消耗较大,可是理论上应该会更快,由于不须要额外的IO操做。.net

def common_training():

    iris = load_iris()
    data = iris.data
    label = iris.target

    with tf.Session() as sess:
        X,Y,loss = model_function()
        training_op = tf.train.AdamOptimizer().minimize(loss)
        tf.global_variables_initializer().run()

        start = time.time()
        for epoch in range(1000):
            S = 0
            for batch in range(3):
                index = range(batch*50, (batch+1)*50)
                batch_x, batch_y = data[index], label[index]
                L, _ = sess.run([loss, training_op], feed_dict={X:batch_x, Y:batch_y})
                S += L
            if epoch % 100 == 0:
                print S / 3.0, len(index), len(batch_x)
        print time.time() - start, 's'

用tfrecord读取数据并送入模型训练

  用TFRecord文件初始化tf.data.TFRecordDataset对象,设定好batch大小和epoch个数,训练时直接 run(loss) 便可,数据会自动跳batch。注意,当文件队列走到尽头会抛出错误,要 excpet tf.errors.OutOfRangeError 防止报错。设计

def tfrecord_training():
    sess = tf.Session()
    iris = tf.data.TFRecordDataset("./iris.tfrecord")
    iris = iris.map(_parse_function)
    iris = iris.batch(50)
    iris = iris.repeat(1000)

    iterator = iris.make_one_shot_iterator()
    next_example, next_label = iterator.get_next()

    _, _, loss = model_function(next_example, next_label)
    training_op = tf.train.AdamOptimizer().minimize(loss)

    sess.run(tf.global_variables_initializer()) # must initialize

    start = time.time()
    for epoch in range(1000):
        S = 0
        for batch in range(3):
            try:
                L, _ = sess.run([loss, training_op])
            except tf.errors.OutOfRangeError:
                break
            S += L
        if epoch % 100 == 0:
            print S, S/3.0
    print time.time()-start, 's'

快慢对比

  common_training 耗时4秒,tfrecord_training 耗时8秒。 
  我原觉得使用Dataset会更快一点,然而实际上更慢,这个慢是体如今每个batch的训练上的。 
  我猜想Dataset.map()和Dataset.batch()只是设定了一个函数接口,在每个batch都进行一次map运算,从而减慢了速度,甚至其实并无读入文件,那么这里就还加上了IO的时间,而common_training只是从内存中取数据而已。 
  如此看来,tfrecord_training的优点应该是在于无需事先把数据读入内存,这样对于大数据的训练来讲会更好一点,代价就是处理时间了。code