机器学习之数据预处理，Pandas读取excel数据

时间 2019-12-08

标签机器学习数据预处理 pandas 读取 excel 栏目 Microsoft Office 繁體版

原文原文链接

Python读写excel的工具库不少，好比最耳熟能详的xlrd、xlwt，xlutils，openpyxl等。其中xlrd和xlwt库一般配合使用，一个用于读，一个用于写excel。xlutils结合xlrd能够达到修改excel文件目的。openpyxl能够对excel文件同时进行读写操做。算法

而说到数据预处理，pandas就体现除了它的强大之处，而且它还支持可读写多种文档格式，其中就包括对excel的读写。本文重点就是介绍pandas对excel数据集的预处理。数组

机器学习经常使用的模型对数据输入都是有要求的，多数机器学习算法最基本的要求是训练数据要转换成数值格式。固然，也有像决策树算法这种不须要转换为数值的算法，这里不作特例讨论。机器学习

pandas读取excel文件的函数是pandas.read_excel()，主要参数包括：ide

io : 读取的excel文档地址，函数

string, path object (pathlib.Path or py._path.local.LocalPath),工具

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx学习

sheet_name : 读取的excel指定的sheet页spa

string, int, mixed list of strings/ints, or None, default 0excel

Strings are used for sheet names, Integers are used in zero-indexed sheet positions.code

Lists of strings/integers are used to request multiple sheets.

Specify None to get all sheets.

str|int -> DataFrame is returned. list|None -> Dict of DataFrames is returned, with keys representing sheets.

Available Cases

Defaults to 0 -> 1st sheet as a DataFrame

1 -> 2nd sheet as a DataFrame

“Sheet1” -> 1st sheet as a DataFrame

[0,1,”Sheet5”] -> 1st, 2nd & 5th sheet as a dictionary of DataFrames

None -> All sheets as a dictionary of DataFrames

header : 设置读取的excel第一行是否做为列名称

int, list of ints, default 0

Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

names :设置每列的名称，数组形式参数

　　　array-like, default None

List of column names to use. If file contains no header row, then you should explicitly pass header=None

index_col :设置读取的excel第一列是否做为行名称

　　　int, list of ints, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

usecols :执行须要读取的数据列，一般载入的excel包含不须要的列

　　　　int or list, default None

If None then parse all columns,

If int then indicates last column to be parsed

If list of ints then indicates list of column numbers to be parsed

If string then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

下尽是一些pandas读取excel数据的示例：

将数据集写入excel文件：

 
  >>> df_out = pd.DataFrame([('string1', 1), ... ('string2', 2), ... ('string3', 3)], ... columns=['Name', 'Value']) >>> df_out  Name Value 0 string1 1 1 string2 2 2 string3 3 >>> df_out.to_excel('tmp.xlsx')  
 

读取excel文件：

 
  >>> pd.read_excel('tmp.xlsx')  Name Value 0 string1 1 1 string2 2 2 string3 3 
 

参数index_col and header 都设置为None表示不读取excel的第一行和第一列做为标题和默认索引：

 
  >>> pd.read_excel('tmp.xlsx', index_col=None, header=None)  0 1 2 0 NaN Name Value 1 0.0 string1 1 2 1.0 string2 2 3 2.0 string3 3  
 

甚至能够专门制定列的格式：

 
  >>> pd.read_excel('tmp.xlsx', dtype={'Name':str, 'Value':float})  Name Value 0 string1 1.0 1 string2 2.0 2 string3 3.0  
 

下面是综合示例：读取text.xlsx文件的sheet1页，仅载入D:F列的数据。这里F列是类别标签，须要类别1和类别2转换为数字，应用于机器学习的输入建模。

import pandas as pd

def reader(path,sheet):
    return pd.read_excel(path, sheet_name=sheet, usecols='D:F')
    
trainrd = reader('text.xlsx','sheet1')
trainrd.head(5)  #查看前5行数据
trainrd['x']=0  #新建一列x
trainrd.loc[trainrd['类别']=='类别1','x']=0 #将类别列的文字转换为数字
trainrd.loc[trainrd['类别']=='类别2','x']=1