用R语言分析与预測员工离职

时间 2019-11-05

原文原文链接

做者简单介绍css

糖甜甜甜，R语言中文社区专栏做者html

公众号：经管人学数据分析算法

在实验室搬砖以后，继续咱们的kaggle数据分析之旅，此次数据也是答主在kaggle上选择的比較火的一份关于人力资源的数据集，关注点在于员工离职的分析和预測，依旧仍是从数据读取，数据预处理，EDA和机器学习建模这几个部分開始进行，最后使用集成学习中比較火的random forest算法来预測离职状况。app

数据读取

setwd("E:/kaggle/human resource") library(data.table) library(plotly) library(corrplot) library(randomForest) library(pROC) library(tidyverse) library(caret) hr<-as.tibble(fread("HR_comma_sep.csv")) glimpse(hr) sapply(hr,function(x){sum(is.na(x))}) ———————————————————————————————————————————————————————————————————————————————————— Observations: 14,999 Variables: 10 $ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10, 0.92, 0.89, 0.42, 0.45, 0.11, 0.84, 0.41, 0.36, 0.38, 0.45, 0.78, 0.45, 0.76, 0.11, 0.3... $ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77, 0.85, 1.00, 0.53, 0.54, 0.81, 0.92, 0.55, 0.56, 0.54, 0.47, 0.99, 0.51, 0.89, 0.83, 0.5... $ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2, 2, 2, 2, 4, 2, 5, 6, 2, 6, 2, 2, 5, 4, 2, 2, 2, 6, 2, 2, 2, 4, 6, 2, 2, 6, 2, 5, 2, 2, ... $ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, 224, 142, 135, 305, 234, 148, 137, 143, 160, 255, 160, 262, 282, 147, 304, 139, 158, 242,... $ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3, 3, 3, 3, 6, 3, 5, 4, 3, 4, 3, 3, 5, 5, 3, 3, 3, 4, 3, 3, 3, 6, 4, 3, 3, 4, 3, 5, 3, 3, ... $ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ sales <chr> "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sales", "sa... $ salary <chr> "low", "medium", "medium", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low", "low... satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left 0 0 0 0 0 0 0 promotion_last_5years sales salary 0 0 0 dom

数据集状况例如如下。一共10维数据，14999个观測值。变量的表明名称各自是
satisfaction_level--惬意度，last_evaluation--最后一次评估，number_project--參与项目数量。average_montly_hours--每个月平均工做时间。time_spend_company--公司停留时间。Work_accident--工做事故次数，left--是否离职。promotion_last_5years--过去五年升值情况，sales--工种，salary--工资。机器学习

而且简单的观測了一下。没有发现缺失值，那么我就可以直接进入数据分析阶段了。ide

数据预处理

依据每一个特征的数值状况。咱们可以将很多特征因子化，方便后期作不一样类别的差别分析。post

hr$sales<-as.factor(hr$sales) hr$salary<-as.factor(hr$salary) hr$left<-as.factor(hr$left) hr$Work_accident<-as.factor(hr$Work_accident) hr$left<-recode(hr$left,'1'="yes",'0'="no") hr$promotion_last_5years<-as.factor(hr$promotion_last_5years)学习

看的出大部分数据都是数值型的。咱们使用相关性来衡量不一样变量之间的相关性高低：人工智能

cor.hr<-hr %>% select(-sales,-salary) cor.hr$Work_accident<-as.numeric(as.character(cor.hr$Work_accident)) cor.hr$promotion_last_5years<-as.numeric(as.character(cor.hr$promotion_last_5years)) cor.hr$left<-as.numeric(as.character(cor.hr$left)) corrplot(corr = cor(cor.hr),type = "lower",method = "square",title="变量相关性",order="AOE")

直观的来看。是否离职和惬意度高低就有很是高的关联性啊。

EDA

ggplot(group_by(hr,sales),aes(x=sales,fill=sales))+geom_bar(width = 1)+coord_polar(theta = "x")+ggtitle("不一样职业的人数") ggplot(hr,aes(x=sales,y=satisfaction_level,fill=sales))+geom_boxplot()+ggtitle("不一样职业的惬意度")+stat_summary(fun.y = mean,size=3,color='white',geom = "point")+ theme(legend.position = "none") ggplot(hr,aes(x=sales,y=satisfaction_level,fill=left))+geom_boxplot()+ggtitle("不一样职业的惬意度") ggplot(hr,aes(x=sales,y=average_montly_hours,fill=left))+geom_boxplot()+ggtitle("不一样职业的工做时长") ggplot(hr,aes(x=sales,y=number_project,fill=left))+geom_boxplot()+ggtitle("不一样职业的项目状况")

首先观察不一样岗位的工做人数。搞销售的人数真的是很多。难道有很多我大生科的同窗吗？？（哈哈哈哈哈哈哈。开个玩笑而已，只是说实话作生物真的很是累啊）。

销售，后期支持，和技术岗人数占领人数排行榜前三。

不一样的职业惬意度的分布大致至关。只是accounting的小伙伴们彷佛打分都不高哦，其它的几个工种均值和中位数都没有明显区别，接下来咱们看看不一样职业是否离职的状况和打分的高低状况：

和想象中结果差点儿没有区别，离职和不离职的打分区分度很是高，和职业差点儿没有关系。

那么不一样职业的平均工做时长呢，看图而言，没有离职的人群工做时间都很是稳定。但是离职人群的工做时间呈现两极分化的趋势。看来太忙和太闲都不是很是好。这对hr的考验仍是很是大的。

后面咱们来一次关注一下不一样特征和离职的关系问题：

ggplot(hr,aes(x=satisfaction_level,color=left))+geom_line(stat = "density")+ggtitle("惬意度和离职的关系") ggplot(hr,aes(x=salary,fill=left))+geom_histogram(stat="count")+ggtitle("工资和离职的关系") ggplot(hr,aes(x=promotion_last_5years,fill=left))+geom_histogram(stat="count")+ggtitle("近5年升值和离职的关系") ggplot(hr,aes(x=last_evaluation,color=left))+geom_point(stat = "count")+ggtitle("最后一次评价和离职的关系") hr %>% group_by(sales) %>% ggplot(aes(x=sales,fill=Work_accident))+geom_bar()+coord_flip()+ theme(axis.text.x = element_blank(),axis.title.x = element_blank(),axis.title.y = element_blank())+scale_fill_discrete(labels=c("no accident","at least once"))

没有离职的人群打分已知很是稳定，而离职人群的打分就有点难以估摸了

仍是那句话。“有钱好办事啊”

你不给宝宝升职，宝宝就生气离职

和前面的面积图几乎相同，hr也要警戒那些最后一次打分很是高的，尽管大部分是不许备离职的。但是有些为了给老东家面子仍是会来点“善意的谎话”的。

不出错是不可能的，出错人数多少基本和总人数成正比，因此这个对于离职来讲不是问题。

模型构建和评估

index<-sample(2,nrow(hr),replace = T,prob = c(0.7,0.3)) train<-hr[index==1,];test<-hr[index==2,] model<-randomForest(left~.,data = train) predict.hr<-predict(model,test) confusionMatrix(test$left,predict.hr) prob.hr<-predict(model,test,type="prob") roc.hr<-roc(test$left,prob.hr[,2],levels=levels(test$left)) plot(roc.hr,type="S",col="red",main = paste("AUC=",roc.hr$auc,sep = ""))

依据前面的特征分析，本次答主并无认为有很是好的特征来提取。就直接扔进算法里面计算去了，计算出来的混淆矩阵的状况效果仍是杠杠的：

Confusion Matrix and Statistics Reference Prediction no yes no 3429 5 yes 28 1010 Accuracy : 0.9926 95% CI : (0.9897, 0.9949) No Information Rate : 0.773 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.9791 Mcnemar's Test P-Value : 0.0001283 Sensitivity : 0.9919 Specificity : 0.9951 Pos Pred Value : 0.9985 Neg Pred Value : 0.9730 Prevalence : 0.7730 Detection Rate : 0.7668 Detection Prevalence : 0.7679 Balanced Accuracy : 0.9935 'Positive' Class : no

acc=0.9926,recall=0.9951,precision=0.9730,基本都是逆天的数据了，看来kaggle的数据集已经清洗的很是棒了，rf算法也是一如既往地给力。最后贴出ROC曲线的图

写在最后

本次分析事实上并无很是多的技巧可言，答主的ggplot2水平也遇到了瓶颈期，后期需要不断增强，而且仅仅会调包不懂算法后面的原理更是不可以的，因此近期在慢慢把几率论。线性代数，仍是统计学捡起来，固然R语言的数据分析实践仍是不会停下来的，答主英语还不错，可以和实验室的老外教授“忽悠”几句。也算是有了很多的进步。

道阻且长，你们共勉~~~

往期回想

词云一分钟了解周董的歌词

R语言实现统计分析——非參数若是检验

《我不是药神》30亿票房后分析徐峥的选角眼光

公众号后台回复keyword就能够学习

回复爬虫         爬虫三大案例实战
回复 Python 1小时破冰入门

回复数据挖掘   R语言入门及数据挖掘
回复人工智能   三个月入门人工智能
回复数据分析师  数据分析师成长之路
回复机器学习      机器学习的商业应用
回复数据科学      数据科学实战
回复常用算法      常用数据挖掘算法