内容目录python
数据准备spa
import pandas as pd import numpy as np index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") data = { "age": [18, 30, np.nan, 40, np.nan, 30], "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen", np.nan, " "], "sex": [None, "male", "female", "male", np.nan, "unknown"], "birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"] } user_info = pd.DataFrame(data=data, index=index) user_info Out[181]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female None James 40.0 ShenZhen male 1978-08-08 Andy NaN NaN NaN NaN Alice 30.0 unknown 1988-10-17
将出生日期转化为日期类型对象
user_info.birth = pd.to_datetime(user_info.birth) user_info Out[182]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Andy NaN NaN NaN NaT Alice 30.0 unknown 1988-10-17
能够看到,用户 Tom 的性别为 None,用户 Mary 的年龄为 NAN,生日为 NaT。在 Pandas 的眼中,
这些都属于缺失值,可使用 isnull() 或 notnull() 方法来操做。blog
1.判断缺失值索引
user_info.isna() Out[183]: age city sex birth name Tom False False True False Bob False False False False Mary True False False True James False False False False Andy True True True True Alice False False False False user_info.isnull() Out[184]: age city sex birth name Tom False False True False Bob False False False False Mary True False False True James False False False False Andy True True True True Alice False False False False
2. 过滤掉年龄为空的用户ci
user_info[user_info.age.notnull()] Out[185]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17
Seriese 使用 dropna 比较简单,对于 DataFrame 来讲,能够设置更多的参数。字符串
user_info.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
#series序列丢弃缺失值 user_info.age.dropna() Out[187]: name Tom 18.0 Bob 30.0 James 40.0 Alice 30.0 Name: age, dtype: float64 #一行数据只要有 user_info.dropna(axis=0,how='any') Out[188]: age city sex birth name Bob 30.0 ShangHai male 1988-10-17 James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17 # 一行数据全部字段都为空值才删除 user_info.dropna(axis=0,how='all') Out[189]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17 # 一行数据中只要 city 或 sex 存在空值即删除 user_info.dropna(axis=0, how="any", subset=["city", "sex"]) Out[190]: age city sex birth name Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17
除了能够丢弃缺失值外,也能够填充缺失值,最多见的是使用 fillna 完成填充。
fillna 这名字一看就是用来填充缺失值的。
填充缺失值时,常见的一种方式是使用一个标量来填充。例如,这里我样有缺失的年龄都填充为 0。pandas
user_info.age.fillna(0) Out[191]: name Tom 18.0 Bob 30.0 Mary 0.0 James 40.0 Andy 0.0 Alice 30.0 Name: age, dtype: float64 user_info.age.fillna(method="ffill") Out[192]: name Tom 18.0 Bob 30.0 Mary 30.0 James 40.0 Andy 40.0 Alice 30.0 Name: age, dtype: float64 user_info.age.fillna(method="backfill") Out[193]: name Tom 18.0 Bob 30.0 Mary 40.0 James 40.0 Andy 30.0 Alice 30.0 Name: age, dtype: float64 user_info.age.interpolate() Out[194]: name Tom 18.0 Bob 30.0 Mary 35.0 James 40.0 Andy 35.0 Alice 30.0 Name: age, dtype: float64
例如,在咱们的存储的用户信息中,假定咱们限定用户都是青年,出现了年龄为 40 的,咱们就能够认为这是一个异常值。再好比,咱们都知道性别分为男性(male)和女性(female),在记录用户性别的时候,对于未知的用户性别都记为了 “unknown”,很明显,咱们也能够认为“unknown”是缺失值。此外,有的时候会出现空白字符串,这些也能够认为是缺失值。对于上面的这种状况,咱们可使用 replace 方法来替换缺失值。it
user_info.age.replace(40,np.nan) Out[195]: name Tom 18.0 Bob 30.0 Mary NaN James NaN Andy NaN Alice 30.0 Name: age, dtype: float64 user_info.age.replace({40: np.nan})#制定一个映射字典 Out[196]: name Tom 18.0 Bob 30.0 Mary NaN James NaN Andy NaN Alice 30.0 Name: age, dtype: float64 user_info.replace({"age": 40, "birth": pd.Timestamp("1978-08-08")}, np.nan) Out[197]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James NaN ShenZhen male NaT Andy NaN NaN NaN NaT Alice 30.0 unknown 1988-10-17 user_info.sex.replace("unknown", np.nan) Out[198]: name Tom None Bob male Mary female James male Andy NaN Alice NaN Name: sex, dtype: object user_info.city.replace(r'\s+', np.nan, regex=True) Out[199]: name Tom BeiJing Bob ShangHai Mary GuangZhou James ShenZhen Andy NaN Alice NaN Name: city, dtype: object
除了咱们本身手动丢弃、填充已经替换缺失值以外,咱们还可使用其余对象来填充。
例若有两个关于用户年龄的 Series,其中一个有缺失值,另外一个没有,咱们能够将没有的缺失值的 Series 中的元素传给有缺失值的。class
age_new = user_info.age.copy() age_new.fillna(20, inplace=True) age_new Out[200]: name Tom 18.0 Bob 30.0 Mary 20.0 James 40.0 Andy 20.0 Alice 30.0 Name: age, dtype: float64 user_info.age.combine_first(age_new) Out[201]: name Tom 18.0 Bob 30.0 Mary 20.0 James 40.0 Andy 20.0 Alice 30.0 Name: age, dtype: float64