数据分析过程当中常常须要进行读写操做,Pandas实现了不少 IO 操做的API,这里简单作了一个列举。html
格式类型 | 数据描述 | Reader | Writer |
---|---|---|---|
text | CSV | read_ csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | clipboard | read_clipboard | to_clipboard |
binary | Excel | read_excel | to_excel |
binary | HDF5 | read_hdf | to_hdf |
binary | Feather | read_feather | to_feather |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQLGoogle | Big Query | read_gbq | to_gbq |
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)[source]¶
pd.read_excel(io, sheetname=0,header=0,skiprows=None,index_col=None,names=None, arse_cols=None,date_parser=None,na_values=None,thousands=None, convert_float=True,has_index_names=None,converters=None,dtype=None, true_values=None,false_values=None,engine=None,squeeze=False,**kwds)
重要参数详解python
pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, tupleize_cols=None, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True)[source]
参数详解mysql
A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with
'https'
you might try removing the's'
.web接收网址、文件、字符串。网址不接受https,尝试去掉s后爬去正则表达式
The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.sql
正则表达式,返回与正则表达式匹配的表格。数据库
The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of
None
tries to uselxml
to parse and if that fails it falls back onbs4
+html5lib
.express解析器默认为‘lxml’json
The row (or list of rows for a
MultiIndex
) to use to make the columns headers.指定列标题所在的行,list为多重索引
The column (or list of columns) to use to create the index.
指定行标题对应的列,list为多重索引
0-based. Number of rows to skip after parsing the column integer. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.
跳过第n行(序列标示)或跳过n行(整数标示)
This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example,
attrs = {'id': 'table'}
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document.
attrs = {'asdf': 'table'}
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.
传递一个字典,标示表格的属性值。
boolean or list of ints or names or list of lists or dict, default False
- boolean. If True -> try parsing the index.
- list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetime
afterpd.read_csv
Note: A fast-path exists for iso8601-formatted dates.
解析日期
If
False
try to parse multiple header rows into aMultiIndex
, otherwise return raw tuples. Defaults toFalse
.Deprecated since version 0.21.0: This argument will be removed and will always convert to MultiIndex
不推荐使用
Separator to use to parse thousands. Defaults to
','
.千位分隔符
The encoding used to decode the web page. Defaults to
None
.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).解码方式,默认使用文档提供的编码
Character to recognize as decimal point (e.g. use ‘,’ for European data).
New in version 0.19.0.
小数点标示,默认使用“.”
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.
New in version 0.19.0.
转换某些列的函数的字典:键为列名或者整数,值为转换函数,函数只能传入一个参数,就是该列单元格的值。
Custom NA values
New in version 0.19.0.
标示那些为NA值
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
New in version 0.19.0.
保持默认的NA值,与na_values一块儿使用
# -*- coding: utf-8 -*- """ @Datetime: 2018/11/11 @Author: Zhang Yafei """ from multiprocessing import Pool import pandas import requests import os BASE_DIR = os.path.dirname(os.path.abspath(__file__)) HTML_DIR = os.path.join(BASE_DIR,'药品商品名通用名称数据库') if not os.path.exists(HTML_DIR): os.mkdir(HTML_DIR) name_list = [] if os.path.exists('drug_name.csv'): data = pandas.read_csv('drug_name.csv',encoding='utf-8') header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Content-Length': '248', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': 'JSESSIONID=0000ixyj6Mwe6Be4heuHcvtSW4C:-1; Hm_lvt_3849dadba32c9735c8c87ef59de6783c=1541937281; Hm_lpvt_3849dadba32c9735c8c87ef59de6783c=1541940406', 'Upgrade-Insecure-Requests': '1', 'Origin': 'http://pharm.ncmi.cn', 'Referer': 'http://pharm.ncmi.cn/dataContent/dataSearch.do?did=27', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36', } def spider(page): adverse_url = 'http://pharm.ncmi.cn/dataContent/dataSearch.do?did=27' form_data = { 'method': 'list', 'did': 27, 'ec_i': 'ec', 'ec_crd': 15, 'ec_p': page, 'ec_rd': 15, 'ec_pd': page, } response = requests.post(url=adverse_url,headers=header,data=form_data) filename = '{}.html'.format(page) with open(filename,'w',encoding='utf-8') as f: f.write(response.text) print(filename,'下载完成') def get_response(page): file = os.path.join(HTML_DIR,'{}.html') with open(file.format(page),'r',encoding='utf-8') as f: response = f.read() return response def parse(page): response = get_response(page) result = pandas.read_html(response,attrs={'id':'ec_table'})[0] data = result.iloc[:,:5] data.columns = ['序号','批准文号','药品中文名称','药品商品名称','生产单位'] if page==1: data.to_csv('drug_name.csv',mode='w',encoding='utf_8_sig',index=False) else: data.to_csv('drug_name.csv',mode='a',encoding='utf_8_sig',header=False,index=False) print('第{}页数据存取完毕'.format(page)) def get_unparse_data(): if os.path.exists('drug_name.csv'): pages = data['序号'] pages = list(set(range(1,492))-set(pages.values)) else: pages = list(range(1,492)) return pages def download(): pool = Pool() pool.map(spider,list(range(1,492))) pool.close() pool.join() def write_to_csv(): pages = get_unparse_data() print(pages) list(map(parse,pages)) def new_data(chinese_name): trade_name = '/'.join(set(data[data.药品中文名称==chinese_name].药品商品名称)) name_list.append(trade_name) def read_from_csv(): name = data['药品中文名称'].values print(len(name)) chinese_name = list(set(data['药品中文名称'].values)) list(map(new_data,chinese_name)) df_data = {'药品中文名称':chinese_name,'药品商品名称':name_list} new_dataframe = pandas.DataFrame(df_data) new_dataframe.to_csv('unique_chinese_name.csv',mode='w',encoding='utf_8_sig',index=False) return new_dataframe def main(): # download() # write_to_csv() return read_from_csv() if __name__ == '__main__': drugname_dataframe = main()
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
效果:将SQL查询或数据库表读入DataFrame。
此功能是一个方便的包装和 (为了向后兼容)。它将根据提供的输入委派给特定的功能。SQL查询将被路由到,而数据库表名将被路由到。请注意,委派的功能可能有更多关于其功能的特定说明,此处未列出。
参数详解
SQL query to be executed or a table name.
要执行的SQL查询或表名。
or DBAPI2 connection (fallback mode)
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
或DBAPI2链接(后备模式)
使用SQLAlchemy可使用该库支持的任何数据库。若是是DBAPI2对象,则仅支持sqlite3。
Column(s) to set as index(MultiIndex).
要设置为索引的列(MultiIndex)。
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
尝试将非字符串,非数字对象(如decimal.Decimal)的值转换为浮点,这对SQL结果集颇有用。
List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}
要解析为日期的列名列表。
在解析字符串时,格式字符串是strftime兼容的格式字符串,或者是(D、s、ns、ms、us),以防解析整型时间戳。
{column_name:arg dict}的字典,其中arg dict对应于pandas.to_datetime()的关键字参数。对于没有本机Datetime支持的数据库(如SQLite)特别有用。
List of column names to select from SQL table (only used when reading a table).
从SQL表中选择的列名列表(仅在读取表时使用)。
If specified, return an iterator where chunksize is the number of rows to include in each chunk.
若是指定,则返回一个迭代器,其中chunksize是要包含在每一个块中的行数。
使用案例
import pymysql import pandas as pd con = pymysql.connect(host="127.0.0.1",user="root",password="password",db="world") # 读取sql data_sql=pd.read_sql("SQL查询语句",con) # 存储 data_sql.to_csv("test.csv")
pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)[source]
效果:将SQL数据库表读入DataFrame。
给定一个表名和一个SQLAlchemy可链接,返回一个DataFrame。此功能不支持DBAPI链接。
参数详解
Name of SQL table in database.
数据库中SQL表的名称。
SQLite DBAPI connection mode not supported.
不支持SQLite DBAPI链接模式。
Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default).
要查询的数据库中的SQL模式的名称(若是数据库flavor支持此功能)。若是为None(默认值),则使用默认架构。
Column(s) to set as index(MultiIndex).
要设置为索引的列(MultiIndex)。
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision.
尝试将非字符串,非数字对象(如decimal.Decimal)的值转换为浮点值。可能致使精度损失。
要解析为日期的列名列表。
{column_name:format string}的字典,其中格式字符串在解析字符串时间时与strftime兼容,或者在解析整 数时间戳的状况下是(D,s,ns,ms,us)之一。
{column_name:arg dict}的字典,其中arg dict对应于pandas.to_datetime()的关键字参数。对于没有本机Datetime支持的数据库(如SQLite)特别有用。
List of column names to select from SQL table
从SQL表中选择的列名列表
If specified, returns an iterator where chunksize is the number of rows to include in each chunk.
若是指定,则返回一个迭代器,其中chunksize是要包含在每一个块中的行数。
使用案例
import pandas as pd import pymysql from sqlalchemy import create_engine con = create_engine('mysql+pymysql://user_name:password@127.0.0.1:3306/database_name') data = pd.read_sql_table("table_name", con) data.to_csv("table_name.csv")
DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')[source]¶
参数详解
File path or object, if None is provided the result is returned as a string.
字符串或文件句柄,默认无文件
路径或对象,若是没有提供,结果将返回为字符串。
Field delimiter for the output file.
默认字符 ‘ ,’
输出文件的字段分隔符。
Missing data representation
字符串,默认为 ‘’
浮点数格式字符串
Format string for floating point numbers
字符串,默认为 None
浮点数格式字符串
顺序,可选列写入
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
字符串或布尔列表,默认为true
写出列名。若是给定字符串列表,则假定为列名的别名。
Write row names (index)
布尔值,默认为Ture
写入行名称(索引)
字符串或序列,或False,默认为None
若是须要,可使用索引列的列标签。若是没有给出,且标题和索引为True,则使用索引名称。若是数据文件使用多索引,则应该使用这个序列。若是值为False,不打印索引字段。在R中使用index_label=False 更容易导入索引.
模式:值为‘str’,字符串
Python写模式,默认“w”
编码:字符串,可选
表示在输出文件中使用的编码的字符串,Python 2上默认为“ASCII”和Python 3上默认为“UTF-8”。
字符串,可选项
表示在输出文件中使用的压缩的字符串,容许值为“gzip”、“bz2”、“xz”,仅在第一个参数是文件名时使用。
字符串,默认为 ‘\n’
在输出文件中使用的换行字符或字符序列
CSV模块的可选常量
默认值为to_csv.QUOTE_MINIMAL。若是设置了浮点格式,那么浮点将转换为字符串,所以csv.QUOTE_NONNUMERIC会将它们视为非数值的。
字符串(长度1),默认“”
用于引用字段的字符
布尔,默认为Ture
控制一个字段内的quotechar
字符串(长度为1),默认为None
在适当的时候用来转义sep和quotechar的字符
一次写入行
布尔值 ,默认为False
从版本0.21.0中删除:此参数将被删除,而且老是将多索引的每行写入CSV文件中的单独行
(若是值为false)将多索引列做为元组列表(若是TRUE)或以新的、扩展的格式写入,其中每一个多索引列是CSV中的一行。
注意事项:
to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,inf_rep='inf', verbose=True, freeze_panes=None)
经常使用参数解析
writer = pd.ExcelWriter('data/excel.xlsx') df.to_excel(writer, sheet_name='user', index=False) writer.save()
补充:固定输出列的顺序
data = pd.DataFrame(data=data_list) # 固定列表的输出顺序 data = data.loc[:, columns]
import pandas as pd data = [ {"name":"张三","age":18,"city":"北京"}, {"name":"李四","age":19,"city":"上海"}, {"name":"王五","age":20,"city":"广州"}, {"name":"赵六","age":21,"city":"深圳"}, {"name":"孙七","age":22,"city":"武汉"} ] df = pd.DataFrame(data,columns=["name","age","city"]) df
from sqlalchemy import create_engine table_name = "user" engine = create_engine( "mysql+pymysql://root:0000@127.0.0.1:3306/db_test?charset=utf8", max_overflow=0, # 超过链接池大小外最多建立的链接 pool_size=5, # 链接池大小 pool_timeout=30, # 池中没有线程最多等待的时间,不然报错 pool_recycle=-1 # 多久以后对线程池中的线程进行一次链接的回收(重置) ) conn = engine.connect() df.to_sql(table_name, conn, if_exists='append',index=False)
注意事项:
一、咱们用的库是sqlalchemy,官方文档提到to_sql是被sqlalchemy支持
文档地址:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
二、数据库配置用你本身的数据库配置,db_flag为数据库类型,根据不一样状况更改,在保存数据以前,要先建立数据库字段。3. engine_config为数据库链接配置信息
四、create_engine是根据数据库配置信息建立链接对象
五、if_exists = 'append',追加数据
六、index = False 保存时候,不保存df的行索引,这样恰好df的3个列和数据库的3个字段一一对应,正常保存,若是不设置为false的话,数据至关于4列,跟MySQL 3列对不上号,会报错
- 这里提个小问题,好比咱们想在遍历的时候来一条数据,保存一条,而不是总体生成Dataframe后才保存,该怎么作?上面提到if_exists,能够追加,用这个便可实现,包括保存csv一样也有此参数,能够参考官方文档