python数据处理：pandas基础

时间 2019-11-12

标签 python 数据处理 pandas 基础栏目 Python 繁體版

原文原文链接

本文资料来源：html

　　Python for Data Anylysis： Chapter 5python

　　10 mintues to pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html#mingit

文中实例查看地址：http://nbviewer.jupyter.org/github/RZAmber/for_blog/blob/master/learn_numpy.ipynbgithub

1. Pandas简介数据库

通过数年的发展，pandas已经成为python处理数据中最常被使用的package。如下是开发pandas最开始的目的，也是如今pandas最经常使用的功能数据结构

　　a: Data structures with labeled axes supporting automatic or explicit data alignment(数据调整). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from differernt sources.函数

　　b: Integrated time series functionalityoop

　　c: The same data structures handle both time series data and non-time series data.spa

　　d: Arithmetic operations and reductions (like summing across an axis) would pass on the metadata(axis labels，元数据)。3d

　　e: Flexible handling of missing data

　　f: Merge and other relational operations found in popular database databases(SQL-based, for example)

有一篇文章“Don't use Hadoop when your data isn't that big ”指出：只有在超过5TB数据量的规模下，Hadoop才是一个合理的技术选择。因此通常处理<5TB的数据量的时候，python pandas已经足够能够应付。

2. pandas data structure

2.1 Series

Series是一个一维的array-like对象，由两部分组成：1. 任意numpy数据类型的array 2. 数据标签，称之为index。

所以一个series有两个主要参数：values和index

示例为建立一个series，得到其value和index的过程

经过传递一个可以被转换成相似序列结构的字典对象来建立一个Series:

字典的key做为index表示。在Series中还能够加入index参数来规定index的顺序，其value会自动根据key来匹配数值。

Series有一个重要的特征就是：在进行数学运算时，它的对齐特征(Data alignment features)能够自动调整不一样index的数据，以便同一种数据进行数学运算。

并且Series对象自己和index参数都有一个参量为name，好比obj.name='population', obj.index.name = 'state'

2.2 DataFrame

DataFrame能够用来表达图表类型、数据库关系类型的数据，它包含数个顺序排列的columns，每一个col中的数据类型一致，可是col彼此间数据类型能够不一致。

DataFrame有两个index：row和column

create dataframe的方法：经过同等长度的list或者array或者tuples的dictionary，经过nested dict of dicts，经过dicts of seires等等，详见书本table5.1

提取列：经过obj3['state']或者obj3.year获取列的信息，返回类型为Series，与DataFrame有一样的index

提取row：用ix函数以及row的位置信息或者名字

经常使用函数：

del：删除列 del obj['year']

常见参数：index和 columns都有name参数，value

2.3 index ojbect和reindexing

pandas index的做用：for holding the axis labels and other metadata(like the axis name or names)

Index对象是不变的，意思就是没法被用户修改，因此下列code没法经过，这个对应了咱们简介中所说的a这一条

reindex()方法能够对指定轴上的索引(index)进行改变/增长/删除操做，这将返回原始数据的一个拷贝

reindex()中参数介绍：

　　　　index：新的index，代替原来的，原来的index不会copy。pandas的处理通常都会自动copy原始value，这点与ndarry不一样

　　　　method：有ffill和bfill

　　　　fill_value：填补NAN value

　　　　copy等等

3.查看数据

　　 3.1 sorting：返回一个排序好的object

　　　　a：按照轴(行列)进行排序

　　　　　　sort_Index()

　　　　　　参数介绍：默认按照row排序，axis=1即按照列

　　　　　　　　　　　默认升序，降序ascedning=False

　　　　b:按照value排序

　　　　　　order()：缺值排在末尾

　　3.2 ranking

　　　　rank():按照值出现的顺序赋值，返回一个新的obj。有一样的值的时候，默认返回排序的mean

　　3.3 unique

　　　　is_unique: tell you whether its values are unique or not，返回true or false

　　　　unique：返回不重复的值，返回一个array

　　3.4 value_count：计算序列中各个值出现的次数

　　3.5 describe() 对于数据快速统计汇总

4.选择数据

　　4.1 drop

　　drop行：

　　pandas的处理通常都会自动copy原始value，这点与ndarry不一样，举例以下，drop一行以后调用原始对象，发现没有改变

　　drop列：obj4.drop('Nevada',axis=1)

　　　　　　在python不少函数的参数中，默认都是考虑row的，因此有axis（轴）这个参数　　　　　　

　　　　　　axis=1 为垂直的，即列　　　　

　　　　　　axis=0 为水平的，即行

　　4.2 选择selection，切片slicing，索引index　

　　a: 选择一个单独的列，这将会返回一个Series，df['A'] 和 df.A一个意思

　　b: 经过[]进行选择，这将会对行进行切片

　　c: 经过标签选择：endpoint is inclusive 即obj['b':'c']包含‘c'行

　　d: 选择row和columns的子集：ix

　　f: 经过标签进行索引: loc

　　e: 经过位置进行索引: iloc

　　4.3 使用isin()方法来过滤：

　　　　用于过滤数据

5.缺失值处理

　　5.1 missing value

　　　　pandas用NaN(floating point value）来表示missing data

　　 5.2 去掉包含缺失值的行或者列

　　　　dropna

　　　　参数说明：how='all' only drop row that all NA

　　　　　　　　 axis=1， drop column

　　　　　　　　 thresh=3，只保留还有3个obseration的行

　　5.3 对缺失值进行填充

　　　　fillna

　　5.4 isnull：返回like-type对象，包含boolean values指明value是否为缺失值

　　　 notnull: isnull的副作用

6.计算函数

　　a:对于不一样index的两个df对象相加“+”，其结果与数据库中union相似，缺失值为NaN

　　b:具体的加减用add()或者sub()，缺失值能够用fill_value代替

　　c:sum，count，min，max等等，包含一些method

　　d:correlation and covariance

　　　　　.corr()

　　　　　.cov()

7.合并 reshape

8.分组

　　对于”group by”操做，咱们一般是指如下一个或多个操做步骤：

　　（Splitting）按照一些规则将数据分为不一样的组；

　　（Applying）对于每组数据分别执行一个函数；

　　（Combining）将结果组合到一个数据结构中；

注：本文并不全面，仅仅总结了目前我所须要的部分。