R语言电信公司churn数据客户流失 k近邻（knn）模型预测分析

时间 2020-08-06

标签语言电信公司 churn 数据客户流失近邻 knn 模型预测分析栏目职业生涯繁體版

原文原文链接

原文连接：http://tecdat.cn/?p=5521

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. node

The data set is Churn . The fields are as follows:算法

State机器学习

discrete.工具

account length学习

continuous.测试

area code大数据

continuous.ui

phone numberspa

discrete..net

international plan

discrete.

voice mail plan

discrete.

number vmail messages

continuous.

total day minutes

continuous.

total day calls

continuous.

total day charge

continuous.

total eve minutes

continuous.

total eve calls

continuous.

total eve charge

continuous.

total night minutes

continuous.

total night calls

continuous.

total night charge

continuous.

total intl minutes

continuous.

total intl calls

continuous.

total intl charge

continuous.

number customer service calls

continuous.

churn

Discrete

Data Preparation and Exploration

查看数据概览
## state account.length area.code phone.number
## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
## AL : 124 Median :100.0 Median :415.0 327-2040: 1
## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
## (Other):4240 (Other) :4994
## international.plan voice.mail.plan number.vmail.messages
## no :4527 no :3677 Min. : 0.000
## yes: 473 yes:1323 1st Qu.: 0.000
## Median : 0.000
## Mean : 7.755
## 3rd Qu.:17.000
## Max. :52.000
## total.day.minutes total.day.calls total.day.charge total.eve.minutes
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
## Median :180.1 Median :100 Median :30.62 Median :201.0
## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
## total.eve.calls total.eve.charge total.night.minutes total.night.calls
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
## Median :100.0 Median :17.09 Median :200.4 Median :100.00
## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
## number.customer.service.calls churn
## Min. :0.00 False.:4293
## 1st Qu.:1.00 True. : 707
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00

从数据概览中咱们能够发现没有缺失数据，同时能够发现电话号地区代码是没有价值的变量，能够删去

Examine the variables graphically

从上面的结果中，咱们能够看到churn为no的样本数目要远远大于churn为yes的样本，所以全部样本中churn占多数。

从上面的结果中，咱们能够看到除了emailcode和areacode以外，其余数值变量近似符合正态分布。

## account.length area.code number.vmail.messages total.day.minutes
## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
## total.day.calls total.day.charge total.eve.minutes total.eve.calls
## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
## Median :100 Median :30.62 Median :201.0 Median :100.0
## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
## total.eve.charge total.night.minutes total.night.calls total.night.charge
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
## total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median :10.30 Median : 4.000 Median :2.780
## Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :20.00 Max. :20.000 Max. :5.400
## number.customer.service.calls
## Min. :0.00
## 1st Qu.:1.00
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00

Relationships between variables

从结果中咱们能够看到二者之间存在显著的正相关线性关系。

Using the statistics node, report

## account.length area.code
## account.length 1.0000000000 -0.018054187
## area.code -0.0180541874 1.000000000
## number.vmail.messages -0.0145746663 -0.003398983
## total.day.minutes -0.0010174908 -0.019118245
## total.day.calls 0.0282402279 -0.019313854
## total.day.charge -0.0010191980 -0.019119256
## total.eve.minutes -0.0095913331 0.007097877
## total.eve.calls 0.0091425790 -0.012299947
## total.eve.charge -0.0095873958 0.007114130
## total.night.minutes 0.0006679112 0.002083626
## total.night.calls -0.0078254785 0.014656846
## total.night.charge 0.0006558937 0.002070264
## total.intl.minutes 0.0012908394 -0.004153729
## total.intl.calls 0.0142772733 -0.013623309
## total.intl.charge 0.0012918112 -0.004219099
## number.customer.service.calls -0.0014447918 0.020920513
## number.vmail.messages total.day.minutes
## account.length -0.0145746663 -0.001017491
## area.code -0.0033989831 -0.019118245
## number.vmail.messages 1.0000000000 0.005381376
## total.day.minutes 0.0053813760 1.000000000
## total.day.calls 0.0008831280 0.001935149
## total.day.charge 0.0053767959 0.999999951
## total.eve.minutes 0.0194901208 -0.010750427
## total.eve.calls -0.0039543728 0.008128130
## total.eve.charge 0.0194959757 -0.010760022
## total.night.minutes 0.0055413838 0.011798660
## total.night.calls 0.0026762202 0.004236100
## total.night.charge 0.0055349281 0.011782533
## total.intl.minutes 0.0024627018 -0.019485746
## total.intl.calls 0.0001243302 -0.001303123
## total.intl.charge 0.0025051773 -0.019414797
## number.customer.service.calls -0.0070856427 0.002732576
## total.day.calls total.day.charge
## account.length 0.0282402279 -0.001019198
## area.code -0.0193138545 -0.019119256
## number.vmail.messages 0.0008831280 0.005376796
## total.day.minutes 0.0019351487 0.999999951
## total.day.calls 1.0000000000 0.001935884
## total.day.charge 0.0019358844 1.000000000
## total.eve.minutes -0.0006994115 -0.010747297
## total.eve.calls 0.0037541787 0.008129319
## total.eve.charge -0.0006952217 -0.010756893
## total.night.minutes 0.0028044650 0.011801434
## total.night.calls -0.0083083467 0.004234934
## total.night.charge 0.0028018169 0.011785301
## total.intl.minutes 0.0130972198 -0.019489700
## total.intl.calls 0.0108928533 -0.001306635
## total.intl.charge 0.0131613976 -0.019418755
## number.customer.service.calls -0.0107394951 0.002726370
## total.eve.minutes total.eve.calls
## account.length -0.0095913331 0.009142579
## area.code 0.0070978766 -0.012299947
## number.vmail.messages 0.0194901208 -0.003954373
## total.day.minutes -0.0107504274 0.008128130
## total.day.calls -0.0006994115 0.003754179
## total.day.charge -0.0107472968 0.008129319
## total.eve.minutes 1.0000000000 0.002763019
## total.eve.calls 0.0027630194 1.000000000
## total.eve.charge 0.9999997749 0.002778097
## total.night.minutes -0.0166391160 0.001781411
## total.night.calls 0.0134202163 -0.013682341
## total.night.charge -0.0166420421 0.001799380
## total.intl.minutes 0.0001365487 -0.007458458
## total.intl.calls 0.0083881559 0.005574500
## total.intl.charge 0.0001593155 -0.007507151
## number.customer.service.calls -0.0138234228 0.006234831
## total.eve.charge total.night.minutes
## account.length -0.0095873958 0.0006679112
## area.code 0.0071141298 0.0020836263
## number.vmail.messages 0.0194959757 0.0055413838
## total.day.minutes -0.0107600217 0.0117986600
## total.day.calls -0.0006952217 0.0028044650
## total.day.charge -0.0107568931 0.0118014339
## total.eve.minutes 0.9999997749 -0.0166391160
## total.eve.calls 0.0027780971 0.0017814106
## total.eve.charge 1.0000000000 -0.0166489191
## total.night.minutes -0.0166489191 1.0000000000
## total.night.calls 0.0134220174 0.0269718182
## total.night.charge -0.0166518367 0.9999992072
## total.intl.minutes 0.0001320238 -0.0067209669
## total.intl.calls 0.0083930603 -0.0172140162
## total.intl.charge 0.0001547783 -0.0066545873
## number.customer.service.calls -0.0138363623 -0.0085325365

若是把高相关性的变量保存下来，可能会形成多重共线性问题，所以须要把高相关关系的变量删去。

Data Manipulation

从结果中能够看到，total.day.calls和total.day.charge之间存在必定的相关关系。

特别是voicemial为no的变量之间存在负相关关系。

Discretize (make categorical) a relevant numeric variable

`对变量进行离散化`

construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay

Find a pair of numeric variables which are interesting with respect to churn.

从结果中能够看到，total.day.calls和total.day.charge之间存在必定的相关关系。

Model Building

特别是churn为no的变量之间存在相关关系。

## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
## stateAL 0.0151188 0.0462343 0.327 0.743680
## stateAR 0.0894792 0.0490897 1.823 0.068399 .
## stateAZ 0.0329566 0.0494195 0.667 0.504883
## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
## total.day.calls 0.0002191 0.0002235 0.981 0.326781
## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***

从结果中看，咱们能够发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的变量有重要的影响。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

## Direction.2005
## knn.pred 1 2
## 1 760 97
## 2 100 43
[1] 0.803

混淆矩阵（英语：confusion matrix）是可视化工具，特别用于监督学习，在无监督学习通常叫作匹配矩阵。 矩阵的每一列表明一个类的实例预测，而每一行表示一个实际的类的实例。

## Direction.2005
## knn.pred 1 2
## 1 827 104
## 2 33 36
[1] 0.863

从测试集的结果，咱们能够看到准确度达到86%。

Findings

咱们能够发现，total.day.calls和total.day.charge之间存在必定的相关关系。特别是churn为no的变量之间存在相关关系。同时咱们能够发现 state total.intl.calls 、number.customer.service.calls 、 total.day.minutes1medium、 total.day.minutes1short 的变量有重要的影响。同时咱们能够发现，total.day.calls和total.day.charge之间存在必定的相关关系。最后从knn模型结果中，咱们能够发现从训练集的结果中，咱们能够看到准确度有80%，从测试集的结果，咱们能够看到准确度达到86%。说明模型有很好的预测效果。

Python中用PyTorch机器学习分类预测银行_客户流失_模型

决策树算法创建电信_客户流失_模型

【大数据部落】(数据挖掘)如何用大数据作用户异常行为