原文连接:https://blog.csdn.net/songbinxu/article/details/80136836python
用Tensorflow训练模型,读取数据有三种方法:网络
TFRecord文件是一种能将data和label一块儿存储的二进制文件,能更好地利用内存,在tensorflow的graph中更快地复制、移动、读取。TFRecord文件包含了tf.train.Example 协议缓冲区(protocol buffer),能够先将数据转成字符串序列化,填入到协议缓冲区,再由TFRecordWritier写入TFRecord文件。dom
def save_tfrecords(data, label, desfile): with tf.python_io.TFRecordWriter(desfile) as writer: for i in range(len(data)): features = tf.train.Features( feature = { "data":tf.train.Feature(bytes_list = tf.train.BytesList(value = [data[i].astype(np.float64).tostring()])), "label":tf.train.Feature(int64_list = tf.train.Int64List(value = [label[i]])) } ) example = tf.train.Example(features = features) serialized = example.SerializeToString() writer.write(serialized)
例如,咱们把一个含有10个样本,维度不一的数据集,跟它们的label一块儿存储到tfrecord文件中。函数
# 将不定长样本padding补0成定长 def padding(data, maxlen=10): for i in range(len(data)): data[i] = np.hstack([data[i], np.zeros((maxlen-len(data[i])))]) lens = np.random.randint(low=3,high=10,size=(10,)) data = [np.arange(l) for l in lens] padding(data) label = [0,0,0,0,0,1,1,1,1,1] save_tfrecords(data, label, "./data.tfrecords")
def _parse_function(example_proto): features = {"data": tf.FixedLenFeature((), tf.string), "label": tf.FixedLenFeature((), tf.int64)} parsed_features = tf.parse_single_example(example_proto, features) data = tf.decode_raw(parsed_features['data'], tf.float32) return data, parsed_features["label"] def load_tfrecords(srcfile): sess = tf.Session() dataset = tf.data.TFRecordDataset(srcfile) # load tfrecord file dataset = dataset.map(_parse_function) # parse data into tensor dataset = dataset.repeat(2) # repeat for 2 epoches dataset = dataset.batch(5) # set batch_size = 5 iterator = dataset.make_one_shot_iterator() next_data = iterator.get_next() while True: try: data, label = sess.run(next_data) print data print label except tf.errors.OutOfRangeError: break
load_tfrecords(srcfile="./data.tfrecords")
# 10个样本,2个epoch,至关于20个样本,每一个batch有5个样本 [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]] [0 0 0 0 0] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]] [1 1 1 1 1] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 7. 8. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.]] [0 0 0 0 0] [[0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 0. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 0. 0. 0. 0. 0.] [0. 1. 2. 3. 4. 5. 6. 0. 0. 0.]] [1 1 1 1 1]
为了实现一个训练,这里选用iris数据集,存储为TFRecord文件。大数据
from sklearn.datasets import load_iris iris = load_iris() data = iris.data label = iris.label save_tfrecords(data, label, "./iris.tfrecord")
这里简单地采用两层神经网络,使用relu做为激活函数spa
def model_function(X=None, Y=None): # data & label if X == None or Y == None: X = tf.placeholder(tf.float32, [None, 4]) Y = tf.placeholder(tf.int64, [None,]) # params W1 = tf.Variable(tf.random_normal([4,32], 0.0, 0.01)) b1 = tf.Variable(tf.zeros([32,])) W2 = tf.Variable(tf.random_normal([32,3], 0.0, 0.01)) b2 = tf.Variable(tf.zeros([3,])) # transform H1 = tf.nn.relu(tf.matmul(X, W1) + b1) H2 = tf.nn.relu(tf.matmul(H1, W2) + b2) cross_entropy = tf.losses.sparse_softmax_cross_entropy(Y, H2) return X, Y, cross_entropy
首先提供一种从内存中读入每个batch的数据输入网络的placeholder进行训练的方法,这种方法的问题在于内存消耗较大,可是理论上应该会更快,由于不须要额外的IO操做。.net
def common_training(): iris = load_iris() data = iris.data label = iris.target with tf.Session() as sess: X,Y,loss = model_function() training_op = tf.train.AdamOptimizer().minimize(loss) tf.global_variables_initializer().run() start = time.time() for epoch in range(1000): S = 0 for batch in range(3): index = range(batch*50, (batch+1)*50) batch_x, batch_y = data[index], label[index] L, _ = sess.run([loss, training_op], feed_dict={X:batch_x, Y:batch_y}) S += L if epoch % 100 == 0: print S / 3.0, len(index), len(batch_x) print time.time() - start, 's'
用TFRecord文件初始化tf.data.TFRecordDataset对象,设定好batch大小和epoch个数,训练时直接 run(loss) 便可,数据会自动跳batch。注意,当文件队列走到尽头会抛出错误,要 excpet tf.errors.OutOfRangeError 防止报错。设计
def tfrecord_training(): sess = tf.Session() iris = tf.data.TFRecordDataset("./iris.tfrecord") iris = iris.map(_parse_function) iris = iris.batch(50) iris = iris.repeat(1000) iterator = iris.make_one_shot_iterator() next_example, next_label = iterator.get_next() _, _, loss = model_function(next_example, next_label) training_op = tf.train.AdamOptimizer().minimize(loss) sess.run(tf.global_variables_initializer()) # must initialize start = time.time() for epoch in range(1000): S = 0 for batch in range(3): try: L, _ = sess.run([loss, training_op]) except tf.errors.OutOfRangeError: break S += L if epoch % 100 == 0: print S, S/3.0 print time.time()-start, 's'
common_training 耗时4秒,tfrecord_training 耗时8秒。
我原觉得使用Dataset会更快一点,然而实际上更慢,这个慢是体如今每个batch的训练上的。
我猜想Dataset.map()和Dataset.batch()只是设定了一个函数接口,在每个batch都进行一次map运算,从而减慢了速度,甚至其实并无读入文件,那么这里就还加上了IO的时间,而common_training只是从内存中取数据而已。
如此看来,tfrecord_training的优点应该是在于无需事先把数据读入内存,这样对于大数据的训练来讲会更好一点,代价就是处理时间了。code