整理了一些我的在利用python处理csv文件时常常用到的一些自定义方法,放在这里主要方便本身查阅,也能够给其余人作参考python
输出CSV文件某列的匹配/不匹配的记录git
调整csv文件的列的顺序github
CSV转换器segmentfault
抽取特定列app
除去彻底重复的记录ui
根据列名排序rest
键值互换日志
主要用于从csv文件中抽取出匹配特定列的特定字段集合的记录,好比现有这么一个csv文件(表格化后)code
name | age | sex |
---|---|---|
Danny | 24 | male |
Daisy | 23 | female |
Lancelot | 23 | unknown |
Lydia | 21 | female |
... | ... | ... |
须要输出其中age
为23的记录到新的csv文件,则咱们能够先把23这么个关键词用一个列表收集起来,而后经过下列代码从csv文件中找出全部符合条件的记录并输出排序
import sys import csv # try to fix '_csv.Error: field larger than field limit (131072)' csv.field_size_limit(sys.maxint) # write to common csv file with delimiter ',' # output the rows with matched id in id_list to a new csv file def csv_match(id_list,key,input_file,output_file): with open(input_file, 'rb') as f: reader = csv.DictReader(f) rows = [row for row in reader if row[key] in set(id_list)] header = rows[0].keys() with open(output_file, 'w') as f: f.write(','.join(header)) f.write('\n') for data in rows: f.write(",".join(data[h] for h in header)) f.write('\n')
调用的时候:
lst=['23'] csv_match(lst,'age','in.csv','out.csv')
key
为须要匹配的列名,另外咱们也能够提取不符合该条件的记录,‘取个反’就好了
# output the rows with not matched id in id_list to a new csv file def csv_not_match(id_list, key, input_file, output_file): with open(input_file, 'rb') as f: reader = csv.DictReader(f) rows = [row for row in reader if not row[key] in set(id_list)] header = rows[0].keys() with open(output_file, 'w') as f: f.write(','.join(header)) f.write('\n') for data in rows: f.write(",".join(data[h] for h in header)) f.write('\n')
对于须要判断csv文件中多个列的值的状况,只需修改对应的判别条件和传入参数状况便可
# output the rows with matched key1 or key2 in refer_list to a new csv file # @params # refer_list: the list referred to # key,key2: column name of csv file to check the value in the refer_list or not def csv_match2(refer_list, key1, key2, input_file, output_file): with open(input_file, 'rb') as f: reader = csv.DictReader(f) rows = [row for row in reader if (row[key1] in set(refer_list)) or (row[key2] in set(refer_list))] header = rows[0].keys() with open(output_file, 'w') as f: f.write(','.join(header)) f.write('\n') for data in rows: f.write(",".join(data[h] for h in header)) f.write('\n')
有时候咱们输出的或者拿到的csv文件的列的顺序不够‘人性化’,为了让咱们看起来更加直观,更舒服一点,咱们能够按照咱们的须要调整列的顺序
import csv # reorder the column of the csv file to what you want def csv_reorder(in_file, out_file,lst_order): with open(in_file, 'rb') as infile, open(out_file, 'wb') as outfile: fieldnames=lst_order writer = csv.DictWriter(outfile, fieldnames=fieldnames) writer.writeheader() for row in csv.DictReader(infile): writer.writerow(row)
其中lst_order
为咱们须要的列名顺序,用list存储,举个例子
season_id,league_name,league_size 2003,scottish-premiership,12 2016,1-hnl,10 2004,alka-superligaen,12 2006,allsvenskan,14 1992,premier-league,22 ...
如今咱们想调整他的顺序,按照league_name,season_id,league_size
的顺序从新组合一下
则调用
lst_order = ['league_name','season_id','league_size'] csv_reorder('leagues_size.csv', 'leagues_size_new.csv', lst_order)
获得结果
league_name,season_id,league_size scottish-premiership,2003,12 1-hnl,2016,10 alka-superligaen,2004,12 allsvenskan,2006,14 premier-league,1992,22 ...
这个主要是用来进行csv和python的一些内置的容器例如list,dict之类的转换,包括一些特殊的多级字典,或者是嵌套列表的字典等等,这里只是把他们打个包放在一块儿,具体的能够参照我以前写的一篇文章
import csv #---------------------------------------------------csv <--> dict-------------------------------------------- # convert csv file to dict # @params: # key/value: the column of original csv file to set as the key and value of dict def csv2dict(in_file,key,value): new_dict = {} with open(in_file, 'rb') as f: reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') for row in reader: new_dict[row[key]] = row[value] return new_dict # convert csv file to dict(key-value pairs each row) # default: set row[0] as key and row[1] as value of the dict def row_csv2dict(csv_file): dict_club={} with open(csv_file)as f: reader=csv.reader(f,delimiter=',') for row in reader: dict_club[row[0]]=row[1] return dict_club # write dict to csv file # write each key/value pair on a separate row def dict2csv(dict, file): with open(file, 'wb') as f: w = csv.writer(f) # write each key/value pair on a separate row w.writerows(dict.items()) # write dict to csv file # write all keys on one row and all values on the next def dict2csv2(dict, file): with open(file, 'wb') as f: w = csv.writer(f) # write all keys on one row and all values on the next w.writerow(dict.keys()) w.writerow(dict.values()) # build a dict of list like {key:[...element of lst_inner_value...]} # key is certain column name of csv file # the lst_inner_value is a list of specific column name of csv file def build_list_dict(source_file, key, lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: for element in lst_inner_value: new_dict.setdefault(row[key], []).append(row[element]) return new_dict # sample: # test_club=build_list_dict('test_info.csv','season',['move from','move to']) # print test_club # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key def build_level2_dict(source_file,outer_key,inner_key): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = {k: row[k] for k in inner_keyset} new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict # inner_value:set the inner value for the inner key def build_level2_dict2(source_file,outer_key,inner_key,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = row[inner_value] new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:[{pairs of lst_inner_value}]} def build_level2_dict3(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value}) return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}} def build_level2_dict4(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) # item.setdefault('move from',[]).append(row['move from']) # item.setdefault('move to', []).append(row['move to']) for element in lst_inner_value: item.setdefault(element, []).append(row[element]) new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_key:a list of column name # lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct # {outer_key:{lst_inner_key:[...lst_inner_value...]}} def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value]) new_dict[row[outer_key]] = item return new_dict # dct=build_list_dict2('test_info.csv','season','move from','move to') # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}} # the params are extract from the csv column name as you like def build_level3_dict(source_file,outer_key,inner_key1,inner_key2): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key1) inner_keyset.remove(inner_key2) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset} item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}} # the params are extract from the csv column name as you like def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = row[inner_value] item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}} # for multiple inner_value with the same inner_key2,thus gather them in a list # the params are extract from the csv column name as you like def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item.setdefault(row[inner_key2], []).append(row[inner_value]) item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict #---------------------------------------------------------------------------------------------------------- #---------------------------------------------------csv <--> list-------------------------------------------- def list2csv(list, file): # def list2csv(list): # wr = csv.writer(open(file, 'wb'), quoting=csv.QUOTE_ALL) wr=open(file,'w') for word in list: # print ''.join(word) # wr.writerow([word]) wr.write(word+'\n') # wr.writerow(str.split(word,'"')[0]) # print [word] # test_list = ['United States', 'China', 'America', 'England'] # list2csv(test_list,'small_test.csv') # write nested list of dict to csv def nestedlist2csv(list, out_file): with open(out_file, 'wb') as f: w = csv.writer(f) fieldnames=list[0].keys() # solve the problem to automatically write the header w.writerow(fieldnames) for row in list: w.writerow(row.values()) # my_list = [{'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', # 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', # 'players.date_of_birth': '08/02/1991', 'players.team': 'Bordeaux'}, # {'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', # 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', # 'players.date_of_birth': '08/02/1991', 'players.team': 'Sunderland'}, # {'players.vis_name': 'Lewis Baker', 'players.role': 'Midfielder', 'players.country': 'England', # 'players.last_name': 'Baker', 'players.player_id': '9574', 'players.first_name': 'Lewis', # 'players.date_of_birth': '25/04/1995', 'players.team': 'Vitesse'} # ] # nestedlist2csv(my_list, 'dict2csv_test.csv') # collect and convert the first column of csv file to list def csv2list(csv_file): lst = [] with open(csv_file, 'rb')as f: reader = csv.reader(f, delimiter=',') for row in reader: lst.append(row[0]) return list(set(lst)) #----------------------------------------------------------------------------------------------------------
抽取特定列的全部值并存储于列表
根据下标抽取特定列到某个新的csv文件
获取某列原始的数据并保存为列表
# get certain column value of csv(for common csv file(',')) def get_origin_column_value(file, column_name): with open(file, 'rb') as f: role_list = [] reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') for row in reader: role_list.append(row[column_name]) return role_list
对于某些有特殊须要的能够直接修改代码,好比对原始的列的值进行除重和排序后获取,以下
# get certain column value of csv(for common csv file(',')),and judge if it's repeated def get_column_value2(file, column_name): with open(file, 'rb') as f: role_list = [] reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') for row in reader: role_list.append(row[column_name]) role_set = set(role_list) return sorted(list(role_set))
import csv # extract certain column from csv file according to the column# def column_extract(file_in,file_out,index): with open(file_in,'r') as f_in: with open(file_out,'w') as f_out: for line in f_in: f_out.write(line.split(',')[index]) f_out.write('\n') # comment if a new line already exists
# eliminated the completely repeated record in repeated file for further analysis def eliminate_repeated_row(in_file,out_file): with open(in_file,'rb') as in_file,open(out_file,'wb')as out_file: seen=set() for line in in_file: # print line if line in seen:continue seen.add(line) out_file.write(line)
# sort the csv file by certain column to put the similar record together for further analysis def sort_csv_byColumn(in_file, out_file,column_name): with open(in_file, 'rb') as f: reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') sorted_list = sorted(reader, key=lambda row: row[column_name], reverse=True) # print sorted_list csv_converter.nestedlist2csv(sorted_list, out_file)
例如咱们按照league_name
排序(注意这里调用了csv转换器中的方法将列表的字典转换为csv文件)
sort_csv_byColumn('leagues_size.csv','ordered_leagues_size.csv','league_name')
获得结果
season_id,league_name,league_size 2016,ykkonen,9 2003,ykkonen,14 2005,ykkonen,14 2006,ykkonen,14 2007,ykkonen,14 2010,ykkonen,13 2011,ykkonen,10 2009,ykkonen,14 2008,ykkonen,14 2012,ykkonen,10 2013,ykkonen,10 2014,ykkonen,10 2015,ykkonen,10 2016,wiener-stadtliga,16 1988,wiener-stadtliga,16 1993,wiener-stadtliga,16 1994,wiener-stadtliga,16 1995,wiener-stadtliga,16 1996,wiener-stadtliga,16 1997,wiener-stadtliga,16 1998,wiener-stadtliga,16
若是咱们按league_size
排序
sort_csv_byColumn('leagues_size.csv', 'orderedbysize_leagues_size.csv','league_size')
获得结果
season_id,league_name,league_size 2008,virsliga,9 2010,virsliga,9 2012,a-lyga,9 2012,a-pojat-sm-sarja,9 2013,a-pojat-sm-sarja,9 1953,salzburger-liga,9 2010,3-lig-grup-1,9 2013,armenian-first-league,9 2016,ykkonen,9 2014,stirling-sports-premiership,9 2014,hong-kong-premier-league,9 2015,hong-kong-premier-league,9 1996,s-league,9 2015,s-league,9 2013,united-football-league,9 2016,i-league,9
csv文件每一条记录其实能够看做是一个字典,有时csv文件里有不一样的键对应同一个值的状况,咱们想讲记录反转一下,即让值做为键,对应的键做为值
# return a dict with the same value in original as new key and keys as value def dict_same_value(original_dict): new_dict={} for k,v in original_dict.iteritems(): new_dict.setdefault(v,[]).append(k) return new_dict
最后欢迎你们fork关于这个的github上的repository,一块儿丰富更多好玩的功能~
更新日志一、2016-12-18 修复了从csv文件中获取特定的列的值保存为集合的问题,而是存储为原始的列表二、2016-12-22 改进了csv转换器中的构建二级字典的方法,使其变得更加灵活三、2016年12月24日14:57:48 在csv转换器部分加入三级字典构造的参照方法四、2017年1月9日11:28:45 在csv转换器部分,三级字典构造中,加入了最内部存储值为列表的构造方法五、2017年1月16日10:43:41 在csv转换器部分,加入了构造列表字典的方法以及构造特殊的二级字典(内部为列表)的方法六、2017年2月9日10:58:17 在csv转换器部分,加入了新的构造特殊的二级字典(内部为列表)的方法七、2017年2月10日11:21:45 在csv转换器部分,改进了简单的csv文件转换为字典的方法,此外在Csv_Match部分,加入了匹配判断多个列对应的元素条件的方法