版权声明:本套技术专栏是做者(秦凯新)平时工做的总结和升华,经过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,若有任何学术交流,可随时联系。dom
数据包含球员和裁判的信息,2012-2013年的比赛数据,总共设计球员2053名,裁判3147名,特征列表以下:ide
Variable Name: | Variable Description: |
---|---|
playerShort | short player ID |
player | player name |
club | player club |
leagueCountry | country of player club (England, Germany, France, and Spain) |
height | player height (in cm) |
weight | player weight (in kg) |
position | player position |
games | number of games in the player-referee dyad |
goals | number of goals in the player-referee dyad |
yellowCards | number of yellow cards player received from the referee |
yellowReds | number of yellow-red cards player received from the referee |
redCards | number of red cards player received from the referee |
photoID | ID of player photo (if available) |
rater1 | skin rating of photo by rater 1 |
rater2 | skin rating of photo by rater 2 |
refNum | unique referee ID number (referee name removed for anonymizing purposes) |
refCountry | unique referee country ID number |
meanIAT | mean implicit bias score (using the race IAT) for referee country |
nIAT | sample size for race IAT in that particular country |
seIAT | standard error for mean estimate of race IAT |
meanExp | mean explicit bias score (using a racial thermometer task) for referee country |
nExp | sample size for explicit bias in that particular country |
seExp | standard error for mean estimate of explicit bias measure |
数据基本特征挖掘this
# Uncomment one of the following lines and run the cell:
df = pd.read_csv("redcard.csv.gz", compression='gzip')
df.shape
(146028, 28)
df.head()
复制代码
df.describe().T
复制代码
df.dtypesspa
playerShort object
player object
club object
leagueCountry object
birthday object
height float64
weight float64
position object
games int64
victories int64
ties int64
defeats int64
goals int64
yellowCards int64
yellowReds int64
redCards int64
photoID object
rater1 float64
rater2 float64
refNum int64
refCountry int64
Alpha_3 object
meanIAT float64
nIAT float64
seIAT float64
meanExp float64
nExp float64
seExp float64
dtype: object
复制代码
all_columns = df.columns.tolist()设计
all_columns
['playerShort',
'player',
'club',
'leagueCountry',
'birthday',
'height',
'weight',
'position',
'games',
'victories',
'ties',
'defeats',
'goals',
'yellowCards',
'yellowReds',
'redCards',
'photoID',
'rater1',
'rater2',
'refNum',
'refCountry',
'Alpha_3',
'meanIAT',
'nIAT',
'seIAT',
'meanExp',
'nExp',
'seExp']
复制代码
df['height'].mean()code
181.93593798236887
复制代码
df['height'].mean()cdn
181.93593798236887
复制代码
np.mean(df.groupby('playerShort').height.mean())blog
181.74372848007872
复制代码
Tidy Dataip
df2 = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
复制代码
分组聚合ci
grouped = df2['data1'].groupby(df['key1'])
grouped.mean()
key1
a -0.093686
b -0.322711
Name: data1, dtype: float64
player_index = 'playerShort'
player_cols = [#'player', # drop player name, we have unique identifier
'birthday',
'height',
'weight',
'position',
'photoID',
'rater1',
'rater2',
]
all_cols_unique_players = df.groupby(' ').agg({col:'nunique' for col in player_cols})
all_cols_unique_players.head()
复制代码
all_cols_unique_players[all_cols_unique_players > 1].dropna().shape[0] == 0
True
复制代码
去重
def get_subgroup(dataframe, g_index, g_columns):
g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
if g[g > 1].dropna().shape[0] != 0:
print("Warning: you probably assumed this had all unique values but it doesn't.")
return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})
players = get_subgroup(df, player_index, player_cols)
players.head()
复制代码
未完待续