pandas进阶

  pandas是基于numpy构建的库,在数据处理方面能够把它理解为numpy的增强版,因为numpy主要用于科学计算,特长不在于数据处理,咱们日常处理的数据通常带有列标签和index索引,这时pandas做为数据分析包而被开发出来。html

pandas数据结构(Series/DataFrame)python

1、Series

一、Series建立

  Series相似一维数组的数据结构,由一组数据(各类numpy数据类型)和与之关联的数据标签(索引)组成,结构至关于定长有序的字典,index和value之间相互独立.数组

In [2]:
import pandas as pd
import numpy as np
In [3]:
# 建立Series
a1 = pd.Series([1, 2, 3])  # 数组生成Series
a1
Out[3]:
0    1
1    2
2    3
dtype: int64
In [4]:
a2 = pd.Series(np.array([1, 2, 3]))  # numpy数组生成Series
a2
Out[4]:
0    1
1    2
2    3
dtype: int32
In [5]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])  # 指定标签index生成
a3
Out[5]:
index1    1
index2    2
index3    3
dtype: int64
In [6]:
a4 = pd.Series({"index1": 1, "index2": 2, "index3": 3})  # 字典生成Series
a4
Out[6]:
index1    1
index2    2
index3    3
dtype: int64
In [8]:
a5 = pd.Series({"index": 1, "index2": 2, "index3": 3},
               index=["index1", "index2", "index3"])  # 字典生成Series,指定index,不匹配部分为NaN
a5
Out[8]:
index1    NaN
index2    2.0
index3    3.0
dtype: float64
In [9]:
a6 = pd.Series(10, index=["index1", "index2", "index3"])
a6
Out[9]:
index1    10
index2    10
index3    10
dtype: int64
 

二、Series属性

  能够把Series当作一个定长的有序字典数据结构

  能够经过shape(维度),size(长度),index(键),values(值)等获得series的属性app

In [10]:
a1 = pd.Series([1, 2, 3])
a1.index  # Series索引
Out[10]:
RangeIndex(start=0, stop=3, step=1)
In [12]:
a1.values  # Series数值
Out[12]:
array([1, 2, 3], dtype=int64)
In [13]:
a1.name = "population"  # 指定Series名字
a1.index.name = "state"  # 指定Series索引名字
a1
Out[13]:
state
0    1
1    2
2    3
Name: population, dtype: int64
In [14]:
a1.shape
Out[14]:
(3,)
In [15]:
a1.size
Out[15]:
3
 

三、Series查找元素

loc为显示切片(经过键),iloc为隐式切片(经过索引)ide

访问单个元素

s[indexname]
s.loc[indexname] 推荐
s[loc]
s.iloc[loc] 推荐<函数

访问多个元素

s[[indexname1,indexname2]]
s.loc[[indexname1,indexname2]] 推荐
s[[loc1,loc2]]
s.iloc[[loc1,loc2]] 推荐ui

In [17]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])
a3
Out[17]:
index1    1
index2    2
index3    3
dtype: int64
In [18]:
a3["index1"]
Out[18]:
1
In [19]:
a3.loc['index1']
Out[19]:
1
In [20]:
a3[1]
Out[20]:
2
In [22]:
a3.iloc[1]
Out[22]:
2
In [23]:
a3[['index1','index2']]
Out[23]:
index1    1
index2    2
dtype: int64
In [24]:
a3.loc[['index1','index2']]
Out[24]:
index1    1
index2    2
dtype: int64
In [25]:
a3[[1,2]]
Out[25]:
index2    2
index3    3
dtype: int64
In [26]:
a3.iloc[[1,2]]
Out[26]:
index2    2
index3    3
dtype: int64
In [27]:
a3[a3 > np.mean(a3)]  # 布尔值查找元素
Out[27]:
index3    3
dtype: int64
In [28]:
a3[0:2]  # 绝对位置切片
Out[28]:
index1    1
index2    2
dtype: int64
In [30]:
a3["index1":"index2"]  # 索引切片
Out[30]:
index1    1
index2    2
dtype: int64
 

四、Series修改元素

In [32]:
# 修改元素
a3["index3"] = 100  # 按照索引修改元素
a3
Out[32]:
index1      1
index2      2
index3    100
dtype: int64
In [33]:
a3[2] = 1000  # 按照绝对位置修改元素
a3
Out[33]:
index1       1
index2       2
index3    1000
dtype: int64
 

五、Series添加元素

In [34]:
# 添加元素
a3["index4"] = 10  # 按照索引添加元素
a3
Out[34]:
index1       1
index2       2
index3    1000
index4      10
dtype: int64
 

六、Series删除元素

In [35]:
a3.drop(["index4", "index3"], inplace=True)  # inplace=True表示做用在当前Series
a3
Out[35]:
index1    1
index2    2
dtype: int64
 

七、Series方法

In [36]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])
a3["index3"] = np.NaN  # 添加元素
a3
Out[36]:
index1    1.0
index2    2.0
index3    NaN
dtype: float64
In [37]:
a3.isnull()  # 判断Series是否有缺失值
Out[37]:
index1    False
index2    False
index3     True
dtype: bool
In [38]:
a3.notnull()  # 判断Series是否没有缺失值
Out[38]:
index1     True
index2     True
index3    False
dtype: bool
In [39]:
"index1" in a3  # 判断Series中某个索引是否存在
Out[39]:
True
In [47]:
a3.isin([1,2])  # 判断Series中某个值是否存在
Out[47]:
index1     True
index2     True
index3    False
dtype: bool
In [48]:
a3.unique()  # 统计Series中去重元素
Out[48]:
array([ 1.,  2., nan])
In [49]:
a3.value_counts()  # 统计Series中去重元素和个数
Out[49]:
2.0    1
1.0    1
dtype: int64
 

2、Dataframe

  DataFrame是一个【表格型】的数据结构,能够看作是【由Series组成的字典】(共用同一个索引)。DataFrame由按必定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引,也有列索引。spa

行索引:index
列索引:columns
值:values(numpy的二维数组)设计

 

一、建立DataFrame

1.1经过字典建立

In [50]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame1 = pd.DataFrame(data=data)  # 经过字典建立
dataFrame1
Out[50]:
 
  color price
0 green 1
1 red 2
2 blue 3
3 black 4
4 yellow 5
In [51]:
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[51]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [52]:
dataFrame3 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"],
                          columns=["price"])  # 指定列索引
dataFrame3
Out[52]:
 
  price
index1 1
index2 2
index3 3
index4 4
index5 5
In [53]:
dataFrame4 = pd.DataFrame(data=np.arange(12).reshape(3, 4))  # 经过numpy数组建立
dataFrame4
Out[53]:
 
  0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [54]:
dic = {
    '张三':[150,150,150,300],
    '李四':[0,0,0,0]
}
pd.DataFrame(data=dic,index=['语文','数学','英语','理综'])
Out[54]:
 
  张三 李四
语文 150 0
数学 150 0
英语 150 0
理综 300 0
In [56]:
data = [[0,150],[0,150],[0,150],[0,300]]
index = ['语文','数学','英语','理综']
columns = ['李四','张三']
pd.DataFrame(data=data,index=index,columns=columns)
Out[56]:
 
  李四 张三
语文 0 150
数学 0 150
英语 0 150
理综 0 300
 

1.2经过Series建立

In [59]:
cars = pd.Series({"Beijing": 300000, "Shanghai": 350000, "Shenzhen": 300000, "Tianjian": 200000, "Guangzhou": 250000,
                  "Chongqing": 150000})
cars
Out[59]:
Beijing      300000
Shanghai     350000
Shenzhen     300000
Tianjian     200000
Guangzhou    250000
Chongqing    150000
dtype: int64
In [60]:
cities = {"Shanghai": 90000, "Foshan": 4500, "Dongguan": 5500, "Beijing": 6600, "Nanjing": 8000, "Lanzhou": None}
apts = pd.Series(cities, name="price")
apts
Out[60]:
Shanghai    90000.0
Foshan       4500.0
Dongguan     5500.0
Beijing      6600.0
Nanjing      8000.0
Lanzhou         NaN
Name: price, dtype: float64
In [61]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df
Out[61]:
 
  apts cars
Beijing 6600.0 300000.0
Chongqing NaN 150000.0
Dongguan 5500.0 NaN
Foshan 4500.0 NaN
Guangzhou NaN 250000.0
Lanzhou NaN NaN
Nanjing 8000.0 NaN
Shanghai 90000.0 350000.0
Shenzhen NaN 300000.0
Tianjian NaN 200000.0
 

1.3经过dicts的list来构建Dataframe

In [62]:
data = [{"Beijing": 1000, "Shanghai": 2500, "Nanjing": 9850}, {"Beijing": 5000, "Shanghai": 4600, "Nanjing": 7000}]
pd.DataFrame(data)
Out[62]:
 
  Beijing Nanjing Shanghai
0 1000 9850 2500
1 5000 7000 4600
 

二、查找DataFrame中的元素

In [65]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[65]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [66]:
dataFrame2.columns  # 查找dataFrame中全部列标签
Out[66]:
Index(['color', 'price'], dtype='object')
In [67]:
dataFrame2.index  # 查找dataFrame中的全部行标签
Out[67]:
Index(['index1', 'index2', 'index3', 'index4', 'index5'], dtype='object')
In [68]:
dataFrame2.values  # 查找dataFrame中的全部值
Out[68]:
array([['green', 1],
       ['red', 2],
       ['blue', 3],
       ['black', 4],
       ['yellow', 5]], dtype=object)
In [72]:
dataFrame2["color"]["index1"]  # 索引查找数值(先列后行,不然报错)
Out[72]:
'green'
In [73]:
dataFrame2.at["index1", "color"]  # 索引查找数值(先行后列,不然报错)
Out[73]:
'green'
In [79]:
dataFrame2.iat[0, 1]  # 绝对位置查找数值
Out[79]:
1
 

三、查找DataFrame中某一行/列元素

In [89]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[89]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [91]:
dataFrame2.loc["index1"]  # 查找一行元素
Out[91]:
color    green
price        1
Name: index1, dtype: object
In [92]:
dataFrame2.iloc[0]  # 查找一行元素(绝对位置)
Out[92]:
color    green
price        1
Name: index1, dtype: object
In [96]:
dataFrame2.iloc[0:2]  # 经过iloc方法能够拿到行和列,直接按照index的顺序来取。# 能够当作numpy的ndarray的二维数组来操做。
Out[96]:
 
  color price
index1 green 1
index2 red 2
In [100]:
dataFrame2.loc[:, "price"]  # 查找一列元素
Out[100]:
index1    1
index2    2
index3    3
index4    4
index5    5
Name: price, dtype: int64
In [101]:
dataFrame2.iloc[:, 0]  # 查找一列元素(绝对位置)
Out[101]:
index1     green
index2       red
index3      blue
index4     black
index5    yellow
Name: color, dtype: object
In [102]:
dataFrame2.values[0]  # 查找一行元素
Out[102]:
array(['green', 1], dtype=object)
In [103]:
dataFrame2["price"]  # 查找一列元素,#经过列名的方式,查找列,不能查找行
Out[103]:
index1    1
index2    2
index3    3
index4    4
index5    5
Name: price, dtype: int64
In [104]:
dataFrame2["color"] 
Out[104]:
index1     green
index2       red
index3      blue
index4     black
index5    yellow
Name: color, dtype: object
 

四、查找DataFrame中的多行/列元素

In [106]:
dataFrame2.head(5)  # 查看前5行元素
Out[106]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [107]:
dataFrame2.tail(5)  # 查看后5行元素
Out[107]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [108]:
dataFrame2["index1":"index4"]  # 切片多行
Out[108]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
In [109]:
dataFrame2[0:4]  # 切片多行
Out[109]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
In [111]:
dataFrame2.loc[["index1", "index2"]]  # 多行
Out[111]:
 
  color price
index1 green 1
index2 red 2
In [113]:
dataFrame2.iloc[[0, 1]]  # 多行
Out[113]:
 
  color price
index1 green 1
index2 red 2
In [114]:
dataFrame2.loc[:, ["price"]]  # 多列
Out[114]:
 
  price
index1 1
index2 2
index3 3
index4 4
index5 5
In [115]:
dataFrame2.iloc[:, [0, 1]]  # 多列
Out[115]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [116]:
dataFrame2.loc[["index1", "index3"], ["price"]]  # 索引查找
Out[116]:
 
  price
index1 1
index3 3
In [117]:
dataFrame2.iloc[[1, 2], [0]]  # 绝对位置查找
Out[117]:
 
  color
index2 red
index3 blue
 

五、添加一行/列元素

In [119]:
dataFrame2.loc["index6"] = ["pink", 3] 
dataFrame2
Out[119]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 pink 3
In [120]:
dataFrame2.loc["index6"]=10
dataFrame2
Out[120]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
In [123]:
dataFrame2.iloc[5] = 10
dataFrame2
Out[123]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
In [125]:
dataFrame2.loc["index7"] = 100
dataFrame2
Out[125]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
index7 100 100
In [129]:
dataFrame2.loc[:, "size"] = "small"
dataFrame2
Out[129]:
 
  color price size
index1 green 1 small
index2 red 2 small
index3 blue 3 small
index4 black 4 small
index5 yellow 5 small
index6 10 10 small
index7 100 100 small
In [130]:
dataFrame2.iloc[:, 2] = 10
dataFrame2
Out[130]:
 
  color price size
index1 green 1 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
 

六、修改元素

In [131]:
dataFrame2.loc["index1", "price"] = 100
dataFrame2
Out[131]:
 
  color price size
index1 green 100 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [132]:
dataFrame2.iloc[0, 1] = 10
dataFrame2
Out[132]:
 
  color price size
index1 green 10 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [133]:
dataFrame2.at["index1", "price"] = 100
dataFrame2
Out[133]:
 
  color price size
index1 green 100 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [135]:
dataFrame2.iat[0, 1] = 1000
dataFrame2
Out[135]:
 
  color price size
index1 green 1000 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
 

七、删除元素

In [136]:
dataFrame2.drop(["index6", "index7"], inplace=True)  # inplace=True表示做用在原数组
dataFrame2
Out[136]:
 
  color price size
index1 green 1000 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
In [141]:
a=dataFrame2.drop(["price"], axis=1, inplace=False)
dataFrame2
Out[141]:
 
  color price
index1 green 1000
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [142]:
a
Out[142]:
 
  color
index1 green
index2 red
index3 blue
index4 black
index5 yellow
 

8. 处理NaN数据

In [148]:
dates = pd.date_range('20180101', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])
df.iloc[1, 1], df.iloc[2, 2] = np.nan, np.nan
df
Out[148]:
 
  a b c d
2018-01-01 0 1.0 2.0 3
2018-01-02 4 NaN 6.0 7
2018-01-03 8 9.0 NaN 11
 

8.1删除NaN数据

In [151]:
re=df.dropna(axis=1, inplace=False)  # inplace默认为false
df
Out[151]:
 
  a b c d
2018-01-01 0 1.0 2.0 3
2018-01-02 4 NaN 6.0 7
2018-01-03 8 9.0 NaN 11
In [152]:
re
Out[152]:
 
  a d
2018-01-01 0 3
2018-01-02 4 7
2018-01-03 8 11
 

8.2填充NaN数据

In [153]:
re2 = df.fillna(value='*')
re2
Out[153]:
 
  a b c d
2018-01-01 0 1 2 3
2018-01-02 4 * 6 7
2018-01-03 8 9 * 11
 

8.3 检查是否存在NaN

In [155]:
df.isnull()
Out[155]:
 
  a b c d
2018-01-01 False False False False
2018-01-02 False True False False
2018-01-03 False False True False
 

9.合并DataFrame

 

9.1 concat函数

In [156]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df1
Out[156]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
In [157]:
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
df2
Out[157]:
 
  a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
In [158]:
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
df3
Out[158]:
 
  a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
In [159]:
# ignore_index=True将从新对index排序
pd.concat([df1, df2, df3], axis=0, ignore_index=True)
Out[159]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
In [160]:
# ignore_index=True将从新对index排序
pd.concat([df1, df2, df3], axis=0, ignore_index=False)
Out[160]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
 

join参数用法

In [164]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# join默认为'outer',不共有的列用NaN填充 
pd.concat([df1, df2], sort=False, join='outer')
Out[164]:
 
  a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
In [166]:
# join='inner'只合并共有的列
pd.concat([df1, df2], sort=False, join='inner',ignore_index=True)
Out[166]:
 
  b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
 

join_axes参数用法

In [167]:
# 按照df1的index进行合并
pd.concat([df1, df2], axis=1, join_axes=[df1.index])
Out[167]:
 
  a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
 

9.2 append函数

In [169]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])

re = df1.append(df2, ignore_index=True)
re
Out[169]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
 

append一组数据

In [170]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
s = pd.Series([4, 4, 4, 4], index=['a', 'b', 'c', 'd'])

re = df1.append(s, ignore_index=True)
re
Out[170]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 4.0 4.0 4.0 4.0
 

9.3 merge函数

基于某一列进行合并

In [171]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'KEY': ['K1', 'K2', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                    'KEY': ['K1', 'K2', 'K3']})

df1
Out[171]:
 
  A B KEY
0 A1 B1 K1
1 A2 B2 K2
2 A3 B3 K3
In [172]:
df2
Out[172]:
 
  C D KEY
0 C1 D1 K1
1 C2 D2 K2
2 C3 D3 K3
In [173]:
re = pd.merge(df1, df2, on='KEY')
re
Out[173]:
 
  A B KEY C D
0 A1 B1 K1 C1 D1
1 A2 B2 K2 C2 D2
2 A3 B3 K3 C3 D3
 

基于某两列进行合并

In [175]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'KEY1': ['K1', 'K2', 'K0'],
                    'KEY2': ['K0', 'K1', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                    'KEY1': ['K0', 'K2', 'K1'],
                    'KEY2': ['K1', 'K1', 'K0']})
# how:['left','right','outer','inner']
re = pd.merge(df1, df2, on=['KEY1', 'KEY2'], how='inner')
re
Out[175]:
 
  A B KEY1 KEY2 C D
0 A1 B1 K1 K0 C3 D3
1 A2 B2 K2 K1 C2 D2
 

按index合并

In [176]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3']},
                   index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3']},
                   index=['K0', 'K1', 'K3'])

re = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
re
Out[176]:
 
  A B C D
K0 A1 B1 C1 D1
K1 A2 B2 C2 D2
K2 A3 B3 NaN NaN
K3 NaN NaN C3 D3
 

为列加后缀

In [177]:
df_boys = pd.DataFrame({'id': ['1', '2', '3'],
                        'age': ['23', '25', '18']})
df_girls = pd.DataFrame({'id': ['1', '2', '3'],
                         'age': ['18', '18', '18']})
re = pd.merge(df_boys, df_girls, on='id', suffixes=['_boys', '_girls'])
re
Out[177]:
 
  id age_boys age_girls
0 1 23 18
1 2 25 18
2 3 18 18
相关文章
相关标签/搜索