了解了Tensorflow图像生成文本实现(1)flickr30k数据集介绍数据集以后,须要对其中的token文件进行解析,对数据进行初步处理。python
由于是一句句的描述,所以须要进行分词,并统计出每一个词的词频,将其对应的储存在一个文件中。这个文件的做用有两个:web
import os import sys import pprint input_description_file = "./data/results_20130124.token" output_vocab_file = "./data/vocab.txt" def count_vocab(input_description_file): with open(input_description_file) as f: lines = f.readlines() max_length_of_sentences = 0 # 全部句子中 最长长度 length_dict = {} # 统计 句子长度字典 {长度:句子总数} vocab_dict = {} # 词表字典 {词:词频} for line in lines: image_id, description = line.strip('\n').split('\t') words = description.strip(' ').split() # 分词 # words 的 格式 ['Two', 'young', 'guys', 'with', 'shaggy', 'hair', ……] max_length_of_sentences = max(max_length_of_sentences, len(words)) # 选择一个最大值放入 length_dict.setdefault(len(words), 0) length_dict[len(words)] += 1 # 词表 统计 for word in words: vocab_dict.setdefault(word, 0) vocab_dict[word] += 1 print(max_length_of_sentences) pprint.pprint(length_dict) return vocab_dict vocab_dict = count_vocab(input_description_file) sorted_vocab_dict = sorted(vocab_dict.items(), key = lambda d:d[1], reverse=True) #对 词表进行排序 with open(output_vocab_file, 'w') as f: f.write('<UNK>\t1000000\n') for item in sorted_vocab_dict: f.write('%s\t%d\n' % item)
生成的词表格式以下:svg
<UNK> 1000000 a 181627 . 151039 A 90071 in 83224 the 57402 on 45538 and 44253 is 41108 man 40277 of 38773 with 36171 , 25285 woman 21236 are 20189 to 17603 Two 16446 at 16157 wearing 15694 people 14148 white 13039 shirt 12975 black 12084 young 12021 while 11650 his 11489 blue 11268 an 11119 red 9857 sitting 9608 ...