Python读写excel的工具库不少,好比最耳熟能详的xlrd、xlwt,xlutils,openpyxl等。其中xlrd和xlwt库一般配合使用,一个用于读,一个用于写excel。xlutils结合xlrd能够达到修改excel文件目的。openpyxl能够对excel文件同时进行读写操做。算法
而说到数据预处理,pandas就体现除了它的强大之处,而且它还支持可读写多种文档格式,其中就包括对excel的读写。本文重点就是介绍pandas对excel数据集的预处理。数组
机器学习经常使用的模型对数据输入都是有要求的,多数机器学习算法最基本的要求是训练数据要转换成数值格式。固然,也有像决策树算法这种不须要转换为数值的算法,这里不作特例讨论。机器学习
pandas读取excel文件的函数是pandas.read_excel(),主要参数包括:ide
io : 读取的excel文档地址,函数
string, path object (pathlib.Path or py._path.local.LocalPath),工具
file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx学习
sheet_name : 读取的excel指定的sheet页spa
string, int, mixed list of strings/ints, or None, default 0excel
Strings are used for sheet names, Integers are used in zero-indexed sheet positions.code
Lists of strings/integers are used to request multiple sheets.
Specify None to get all sheets.
str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.
Available Cases
- Defaults to 0 -> 1st sheet as a DataFrame
- 1 -> 2nd sheet as a DataFrame
- “Sheet1” -> 1st sheet as a DataFrame
- [0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames
- None -> All sheets as a dictionary of DataFrames
header : 设置读取的excel第一行是否做为列名称
int, list of ints, default 0
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a
MultiIndex
. Use None if there is no header.
names :设置每列的名称,数组形式参数
array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None
index_col :设置读取的excel第一列是否做为行名称
int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a
MultiIndex
. If a subset of data is selected withusecols
, index_col is based on the subset.
usecols :执行须要读取的数据列,一般载入的excel包含不须要的列
int or list, default None
- If None then parse all columns,
- If int then indicates last column to be parsed
- If list of ints then indicates list of column numbers to be parsed
- If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
下尽是一些pandas读取excel数据的示例:
将数据集写入excel文件:
>>> df_out = pd.DataFrame([('string1', 1), ... ('string2', 2), ... ('string3', 3)], ... columns=['Name', 'Value']) >>> df_out Name Value 0 string1 1 1 string2 2 2 string3 3 >>> df_out.to_excel('tmp.xlsx')
读取excel文件:
>>> pd.read_excel('tmp.xlsx') Name Value 0 string1 1 1 string2 2 2 string3 3
参数index_col and header 都设置为None表示不读取excel的第一行和第一列做为标题和默认索引:
>>> pd.read_excel('tmp.xlsx', index_col=None, header=None) 0 1 2 0 NaN Name Value 1 0.0 string1 1 2 1.0 string2 2 3 2.0 string3 3
甚至能够专门制定列的格式:
>>> pd.read_excel('tmp.xlsx', dtype={'Name':str, 'Value':float}) Name Value 0 string1 1.0 1 string2 2.0 2 string3 3.0
下面是综合示例:读取text.xlsx文件的sheet1页,仅载入D:F列的数据。这里F列是类别标签,须要类别1和类别2转换为数字,应用于机器学习的输入建模。
import pandas as pd def reader(path,sheet): return pd.read_excel(path, sheet_name=sheet, usecols='D:F') trainrd = reader('text.xlsx','sheet1') trainrd.head(5) #查看前5行数据 trainrd['x']=0 #新建一列x trainrd.loc[trainrd['类别']=='类别1','x']=0 #将类别列的文字转换为数字 trainrd.loc[trainrd['类别']=='类别2','x']=1