本文针对前面利用Python 所作的一次数据匹配实验,整理了其中的一些对于csv文件的读写操做和经常使用的Python'数据结构'(如字典和列表)之间的转换
(Python Version 2.7)python
将列表转换为csv文件git
将嵌套字典的列表转换为csv文件github
最基本的转换,将列表中的元素逐行写入到csv文件中数据结构
def list2csv(list, file): wr = csv.writer(open(file, 'wb'), quoting=csv.QUOTE_ALL) for word in list: wr.writerow([word])
这种属于典型的csv文件读写,常见的csv文件经常是第一行为属性栏,标明各个字段,接下来每一行都是对应属性的值,读取时经常用字典来存储(key为第一行的属性,value为对应行的值),例如app
my_list = [{'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', 'players.date_of_birth': '08/02/1991', 'players.team': 'Bordeaux'}, {'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', 'players.date_of_birth': '08/02/1991', 'players.team': 'Sunderland'}, {'players.vis_name': 'Lewis Baker', 'players.role': 'Midfielder', 'players.country': 'England', 'players.last_name': 'Baker', 'players.player_id': '9574', 'players.first_name': 'Lewis', 'players.date_of_birth': '25/04/1995', 'players.team': 'Vitesse'} ]
而最后全部的字典嵌套到一个列表中存储,而接下来是一个逆过程,即将这种嵌套了字典的列表还原为csv文件存储起来ui
# write nested list of dict to csv def nestedlist2csv(list, out_file): with open(out_file, 'wb') as f: w = csv.writer(f) fieldnames=list[0].keys() # solve the problem to automatically write the header w.writerow(fieldnames) for row in list: w.writerow(row.values())
注意其中的fieldnames
用于传递key
即第一行的属性spa
csv文件转换为字典rest
第一行为key,其他行为value日志
每一行为key,value的记录excel
csv文件转换为二级字典
字典转换为csv文件
第一行为key,其他行为value
每一行为key,value的记录
针对常见的首行为属性,其他行为值的情形
# convert csv file to dict # @params: # key/value: the column of original csv file to set as the key and value of dict def csv2dict(in_file,key,value): new_dict = {} with open(in_file, 'rb') as f: reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') for row in reader: new_dict[row[key]] = row[value] return new_dict
其中的new_dict[row[key]] = row[value]
中的'key'
和'value'
是csv文件中的对应的第一行的属性字段,须要注意的是这里假设csv文件比较简单,所指定的key是惟一的,不然直接从csv转换为dict文件会形成重复字段的覆盖而丢失数据,若是原始数据指定做为key的列存在重复的状况,则须要构建列表字典
,将value部分设置为list,可参照列表字典
的构建部分代码
针对每一行均为键值对
的特殊情形
这里默认认为第一列为所构建的字典的key,而第二列对应为value,可根据须要进行修改
# convert csv file to dict(key-value pairs each row) def row_csv2dict(csv_file): dict_club={} with open(csv_file)as f: reader=csv.reader(f,delimiter=',') for row in reader: dict_club[row[0]]=row[1] return dict_club
[更新]
构造有值为列表的字典,主要适用于须要把csv中的某些列对应的值做为某一个列的值的情形
或者说自己并不适合做为单纯的字典结构,同一个键对应的值不惟一
# build a dict of list like {key:[...element of lst_inner_value...]} # key is certain column name of csv file # the lst_inner_value is a list of specific column name of csv file def build_list_dict(source_file, key, lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: for element in lst_inner_value: new_dict.setdefault(row[key], []).append(row[element]) return new_dict # sample: # test_club=build_list_dict('test_info.csv','season',['move from','move to']) # print test_club
这个通常是特殊用途,将csv文件进一步结构化,将其中的某一列(属性)所对应的值做为key,而后将其他键值对构成子字典做为value,通常用于匹配时优先过滤来创建一种层级结构提升准确度
例如我有csv文件的记录以下(以表格形式表示)
id | name | age | country |
---|---|---|---|
1 | danny | 21 | China |
2 | Lancelot | 22 | America |
... | ... | ... | ... |
通过二级字典转换后(假设构建country-name两级)获得以下字典
dct={'China':{'danny':{'id':'1','age':'21'}} 'America':{'Lancelot':{'id':'2','age':'22'}}}
代码以下
# build specific nested dict from csv files(date->name) def build_level2_dict(source_file): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row['country'], dict()) item[row['name']] = {k: row[k] for k in ('id','age')} new_dict[row['country']] = item return new_dict
[更新]
进一步改进后可使用更加灵活一点的方法来构建二级字典,不用修改内部代码,二是指定传入的键和值,有两种不一样的字典构建,按需查看
构建的二级字典的各层级的键值均人为指定为某一列的值
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict # inner_value:set the inner value for the inner key def build_level2_dict2(source_file,outer_key,inner_key,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = row[inner_value] new_dict[row[outer_key]] = item return new_dict
指定第一层和第二层的字典的键,而将csv文件中剩余的键值对存储为最内层的值
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key def build_level2_dict(source_file,outer_key,inner_key): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = {k: row[k] for k in inner_keyset} new_dict[row[outer_key]] = item return new_dict
还有另外一种构建二级字典的方法,利用的是pop()
方法,可是我的以为不如这个直观,贴在下面
def build_dict(source_file): projects = defaultdict(dict) # if there is no header within the csv file you need to set the header # and utilize fieldnames parameter in csv.DictReader method # headers = ['id', 'name', 'age', 'country'] with open(source_file, 'rb') as fp: reader = csv.DictReader(fp, dialect='excel', skipinitialspace=True) for rowdict in reader: if None in rowdict: del rowdict[None] nationality = rowdict.pop("country") date_of_birth = rowdict.pop("name") projects[nationality][date_of_birth] = rowdict return dict(projects)
[更新]
另外另种构造二级字典的方法,主要是针对csv文件并不适合直接构造单纯的字典结构,某些键对应多个值,因此须要在内部用列表来保存值,或者对每个键值对用列表保存
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:[{pairs of lst_inner_value}]} def build_level2_dict3(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value}) return new_dict
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}} def build_level2_dict4(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) # item.setdefault('move from',[]).append(row['move from']) # item.setdefault('move to', []).append(row['move to']) for element in lst_inner_value: item.setdefault(element, []).append(row[element]) new_dict[row[outer_key]] = item return new_dict
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_key:a list of column name # lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct # {outer_key:{lst_inner_key:[...lst_inner_value...]}} def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value]) new_dict[row[outer_key]] = item return new_dict # dct=build_list_dict2('test_info.csv','season','move from','move to')
相似的,能够从csv重构造三级字典甚至多级字典,方法和上面的相似,就不赘述了,只贴代码
# build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}} # the params are extract from the csv column name as you like def build_level3_dict(source_file,outer_key,inner_key1,inner_key2): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key1) inner_keyset.remove(inner_key2) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset} item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}} # the params are extract from the csv column name as you like def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = row[inner_value] item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict
这里一样给出两种根据不一样需求构建字典的方法,一种是将剩余的键值对原封不动地保存为最内部的值,另外一种是只取所须要的键值对保留。
此外还有一种特殊情形,当你的最内部的值不是一个单独的元素而须要是一个列表来存储多个对应同一个键的元素,则只须要对于最内部的键值对进行修改
# build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}} # for multiple inner_value with the same inner_key2,thus gather them in a list # the params are extract from the csv column name as you like def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item.setdefault(row[inner_key2], []).append(row[inner_value]) item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict
其中的核心部分是这一句sub_item.setdefault(row[inner_key2], []).append(row[inner_value])
每一行为key,value的记录
第一行为key,其他行为value
输出列表字典
前述csv文件转换为字典的逆过程,比较简单就直接贴代码啦
def dict2csv(dict,file): with open(file,'wb') as f: w=csv.writer(f) # write each key/value pair on a separate row w.writerows(dict.items())
def dict2csv(dict,file): with open(file,'wb') as f: w=csv.writer(f) # write all keys on one row and all values on the next w.writerow(dict.keys()) w.writerow(dict.values())
其实这个不太经常使用,却是逆过程比较常见,就是从常规的csv文件导入到列表的字典(自己是一个字典,csv文件的首行构成键,其他行依次构成对应列下的键的值,其中值造成列表),不过若是碰到这种情形要保存为csv文件的话,作法以下
import csv import pandas as pd from collections import OrderedDict dct=OrderedDict() dct['a']=[1,2,3,4] dct['b']=[5,6,7,8] dct['c']=[9,10,11,12] header = dct.keys() rows=pd.DataFrame(dct).to_dict('records') with open('outTest.csv', 'wb') as f: f.write(','.join(header)) f.write('\n') for data in rows: f.write(",".join(str(data[h]) for h in header)) f.write('\n')
这里用到了三个包,除了csv包用于常规的csv文件读取外,其中OrderedDict
用于让csv文件输出后保持原有的列的顺序,而pandas
则适用于中间的一步将列表构成的字典转换为字典构成的列表,举个例子
[('a', [1, 2, 3, 4]), ('b', [5, 6, 7, 8]), ('c', [9, 10, 11, 12])] to [{'a': 1, 'c': 9, 'b': 5}, {'a': 2, 'c': 10, 'b': 6}, {'a': 3, 'c': 11, 'b': 7}, {'a': 4, 'c': 12, 'b': 8}]
这个主要是针对那种分隔符比较特殊的csv文件,通常情形下csv文件统一用一种分隔符是关系不大的(向上述操做基本都是针对分隔符统一用,
的情形),而下面这种第一行属性分隔符是,
然后续值的分隔符均为;
的读取时略有不一样,通常可逐行转换为字典在进行操做,代码以下:
def func(id_list,input_file,output_file): with open(input_file, 'rb') as f: # if the delimiter for header is ',' while ';' for rows reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=';') rows = [row for row in reader if row['players.player_id'] in set(id_list)] # operation on rows...
可根据须要修改分隔符中的内容.
关于csv文件的一些操做我在实验过程当中遇到的问题大概就是这些啦,大部分其实均可以在stackoverflow上找到或者本身提问解决,上面的朋友仍是很给力的,后续会小结一下实验过程当中的一些对数据的其余处理如格式转换,除重,重复判断等等
最后,源码我发布在github上的csv_toolkit
里面,欢迎随意玩耍~
更新日志一、2016-12-22: 改进了构建二级字典的方法,使其变得更加灵活二、2016-12-24 14:55:30: 加入构造三级字典的方法三、2017年1月9日11:26:59: 最内部可保存制定列的元素列表四、2017年1月16日10:29:44:加入了列表字典的构建;针对特殊二级字典的构建(须要保存对应同一个键的多个值);五、2017年2月9日10:54:41: 加入新的二级列表字典的构建六、2017年2月10日11:18:01:改进了简单的csv文件到字典的构建代码