翻译来自:http://news.csdn.net/article_preview.html?preview=1&reload=1&arcid=2825492html
摘要:本文解释了回归分析及其优点,重点总结了应该掌握的线性回归、逻辑回归、多项式回归、逐步回归、岭回归、套索回归、ElasticNet回归等七种最经常使用的回归技术及其关键要素,最后介绍了选择正确的回归模型的关键因素。git
【编者按】回归分析是建模和分析数据的重要工具。本文解释了回归分析的内涵及其优点,重点总结了应该掌握的线性回归、逻辑回归、多项式回归、逐步回归、岭回归、套索回归、ElasticNet回归等七种最经常使用的回归技术及其关键要素,最后介绍了选择正确的回归模型的关键因素。app
回归分析是一种预测性的建模技术,它研究的是因变量(目标)和自变量(预测器)之间的关系。这种技术一般用于预测分析,时间序列模型以及发现变量之间的因果关系。例如,司机的鲁莽驾驶与道路交通事故数量之间的关系,最好的研究方法就是回归。less
回归分析是建模和分析数据的重要工具。在这里,咱们使用曲线/线来拟合这些数据点,在这种方式下,从曲线或线到数据点的距离差别最小。我会在接下来的部分详细解释这一点。dom
如上所述,回归分析估计了两个或多个变量之间的关系。下面,让咱们举一个简单的例子来理解它:ide
好比说,在当前的经济条件下,你要估计一家公司的销售额增加状况。如今,你有公司最新的数据,这些数据显示出销售额增加大约是经济增加的2.5倍。那么使用回归分析,咱们就能够根据当前和过去的信息来预测将来公司的销售状况。函数
使用回归分析的好处良多。具体以下:工具
回归分析也容许咱们去比较那些衡量不一样尺度的变量之间的相互影响,如价格变更与促销活动数量之间联系。这些有利于帮助市场研究人员,数据分析人员以及数据科学家排除并估计出一组最佳的变量,用来构建预测模型。性能
有各类各样的回归技术用于预测。这些技术主要有三个度量(自变量的个数,因变量的类型以及回归线的形状)。咱们将在下面的部分详细讨论它们。学习
对于那些有创意的人,若是你以为有必要使用上面这些参数的一个组合,你甚至能够创造出一个没有被使用过的回归模型。但在你开始以前,先了解以下最经常使用的回归方法:
它是最为人熟知的建模技术之一。线性回归一般是人们在学习预测模型时首选的技术之一。在这种技术中,因变量是连续的,自变量能够是连续的也能够是离散的,回归线的性质是线性的。
线性回归使用最佳的拟合直线(也就是回归线)在因变量(Y)和一个或多个自变量(X)之间创建一种关系。
用一个方程式来表示它,即Y=a+b*X + e,其中a表示截距,b表示直线的斜率,e是偏差项。这个方程能够根据给定的预测变量(s)来预测目标变量的值。
一元线性回归和多元线性回归的区别在于,多元线性回归有(>1)个自变量,而一元线性回归一般只有1个自变量。如今的问题是“咱们如何获得一个最佳的拟合线呢?”。
如何得到最佳拟合线(a和b的值)?
这个问题可使用最小二乘法轻松地完成。最小二乘法也是用于拟合回归线最经常使用的方法。对于观测数据,它经过最小化每一个数据点到线的垂直误差平方和来计算最佳拟合线。由于在相加时,误差先平方,因此正值和负值没有抵消。
咱们可使用R-square指标来评估模型性能。想了解这些指标的详细信息,能够阅读:模型性能指标Part 1,Part 2 .
要点:
逻辑回归是用来计算“事件=Success”和“事件=Failure”的几率。当因变量的类型属于二元(1 / 0,真/假,是/否)变量时,咱们就应该使用逻辑回归。这里,Y的值从0到1,它能够用下方程表示。
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
上述式子中,p表述具备某个特征的几率。你应该会问这样一个问题:“咱们为何要在公式中使用对数log呢?”。
由于在这里咱们使用的是的二项分布(因变量),咱们须要选择一个对于这个分布最佳的连结函数。它就是Logit函数。在上述方程中,经过观测样本的极大似然估计值来选择参数,而不是最小化平方和偏差(如在普通回归使用的)。
要点:
对于一个回归方程,若是自变量的指数大于1,那么它就是多项式回归方程。以下方程所示:
y=a+b*x^2
在这种回归技术中,最佳拟合线不是直线。而是一个用于拟合数据点的曲线。
重点:
在处理多个自变量时,咱们可使用这种形式的回归。在这种技术中,自变量的选择是在一个自动的过程当中完成的,其中包括非人为操做。
这一壮举是经过观察统计的值,如R-square,t-stats和AIC指标,来识别重要的变量。逐步回归经过同时添加/删除基于指定标准的协变量来拟合模型。下面列出了一些最经常使用的逐步回归方法:
这种建模技术的目的是使用最少的预测变量数来最大化预测能力。这也是处理高维数据集的方法之一。
岭回归分析是一种用于存在多重共线性(自变量高度相关)数据的技术。在多重共线性状况下,尽管最小二乘法(OLS)对每一个变量很公平,但它们的差别很大,使得观测值偏移并远离真实值。岭回归经过给回归估计上增长一个误差度,来下降标准偏差。
上面,咱们看到了线性回归方程。还记得吗?它能够表示为:
y=a+ b*x
这个方程也有一个偏差项。完整的方程是:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
在一个线性方程中,预测偏差能够分解为2个子份量。一个是误差,一个是方差。预测错误可能会由这两个份量或者这两个中的任何一个形成。在这里,咱们将讨论由方差所形成的有关偏差。
岭回归经过收缩参数λ(lambda)解决多重共线性问题。看下面的公式
在这个公式中,有两个组成部分。第一个是最小二乘项,另外一个是β2(β-平方)的λ倍,其中β是相关系数。为了收缩参数把它添加到最小二乘项中以获得一个很是低的方差。
要点:
它相似于岭回归,Lasso (Least Absolute Shrinkage and Selection Operator)也会惩罚回归系数的绝对值大小。此外,它可以减小变化程度并提升线性回归模型的精度。看看下面的公式:
Lasso 回归与Ridge回归有一点不一样,它使用的惩罚函数是绝对值,而不是平方。这致使惩罚(或等于约束估计的绝对值之和)值使一些参数估计结果等于零。使用惩罚值越大,进一步估计会使得缩小值趋近于零。这将致使咱们要从给定的n个变量中选择变量。
要点:
· 若是预测的一组变量是高度相关的,Lasso 会选出其中一个变量而且将其它的收缩为零。
ElasticNet是Lasso和Ridge回归技术的混合体。它使用L1来训练而且L2优先做为正则化矩阵。当有多个相关的特征时,ElasticNet是颇有用的。Lasso 会随机挑选他们其中的一个,而ElasticNet则会选择两个。
Lasso和Ridge之间的实际的优势是,它容许ElasticNet继承循环状态下Ridge的一些稳定性。
要点:
除了这7个最经常使用的回归技术,你也能够看看其余模型,如Bayesian、Ecological和Robust回归。
当你只知道一个或两个技术时,生活每每很简单。我知道的一个培训机构告诉他们的学生,若是结果是连续的,就使用线性回归。若是是二元的,就使用逻辑回归!然而,在咱们的处理中,可选择的越多,选择正确的一个就越难。相似的状况下也发生在回归模型中。
在多类回归模型中,基于自变量和因变量的类型,数据的维数以及数据的其它基本特征的状况下,选择最合适的技术很是重要。如下是你要选择正确的回归模型的关键因素:
/*******************************************************************************************/
Linear and Logistic regressions are usually the first algorithms people learn in predictive modeling. Due to their popularity, a lot of analysts even end up thinking that they are the only form of regressions. The ones who are slightly more involved think that they are the most important amongst all forms of regression analysis.
The truth is that there are innumerable forms of regressions, which can be performed. Each form has its own importance and a specific condition where they are best suited to apply. In this article, I have explained the most commonly used 7 forms of regressions in a simple manner. Through this article, I also hope that people develop an idea of the breadth of regressions, instead of just applying linear / logistic regression to every problem they come
across and hoping that they would just fit!
Regression analysis is a form of predictive modelling technique which investigates the relationship between adependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. I’ll explain this in more details in coming sections.
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information.
There are multiple benefits of using regression analysis. They are as follows:
Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.
There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics (number of independent variables, type of dependent variables and shape of regression line). We’ll discuss them in detail in the following sections.
For the creative ones, you can even cook up new regressions, if you feel the need to use a combination of the parameters above, which people haven’t used before. But before you start that, let us understand the most commonly used regressions:
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics which people pick while learning predictive modeling. In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or moreindependent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term. This equation can be used to predict the value of target variable based on given predictor variable(s).
The difference between simple linear regression and multiple linear regression is that, multiple linear regression has (>1) independent variables, whereas simple linear regression has only 1 independent variable. Now, the question is “How do we obtain best fit line?”.
This task can be easily accomplished by Least Square Method. It is the most common method used for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Because the deviations are first squared, when added, there is no cancelling out between positive and negative values.
We can evaluate the model performance using the metric R-square. To know more details about these metrics, you can read: Model Performance metrics Part 1, Part 2 .
Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can represented by following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of presence of the characteristic of interest. A question that you should ask here is “why have we used log in the equation?”.
Since we are working here with a binomial distribution (dependent variable), we need to choose a link function which is best suited for this distribution. And, it is logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation:
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.
This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern significant variables. Stepwise regression basically fits the regression model by adding/dropping co-variates one at a time based on a specified criterion. Some of the most commonly used Stepwise regression methods are listed below:
The aim of this modeling technique is to maximize the prediction power with minimum number of predictor variables. It is one of the method to handle higher dimensionality of data set.
Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated). In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.
Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
In a linear equation, prediction errors can be decomposed into two sub components. First is due to the biased and second is due to the variance. Prediction error can occur due to any one of these two or both components. Here, we’ll discuss about the error caused due to variance.
Ridge regression solves the multicollinearity problem through shrinkage parameter λ (lambda). Look at the equation below.
In this equation, we have two components. First one is least square term and other one is lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to least square term in order to shrink the parameter to have a very low variance.
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing the variability and improving the accuracy of linear regression models. Look at the equation below: Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk towards absolute zero. This results to variable selection out of given n variables.
ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2 prior as regularizer. Elastic-net is useful when there are multiple features which are correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.
A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to inherit some of Ridge’s stability under rotation.
Beyond these 7 most commonly used regression techniques, you can also look at other models like Bayesian,Ecological and Robust regression.
Life is usually simple, when you know only one or two techniques. One of the training institutes I know of tells their students – if the outcome is continuous – apply linear regression. If it is binary – use logistic regression! However, higher the number of options available at our disposal, more difficult it becomes to choose the right one. A similar case happens with regression models.
Within multiple types of regression models, it is important to choose the best suited technique based on type of independent and dependent variables, dimensionality in the data and other essential characteristics of the data. Below are the key factors that you should practice to select the right regression model:
By now, I hope you would have got an overview of regression. These regression techniques should be applied considering the conditions of data. One of the best trick to find out which technique to use, is by checking the family of variables i.e. discrete or continuous.
In this article, I discussed about 7 types of regression and some key facts associated with each technique. As somebody who’s new in this industry, I’d advise you to learn these techniques and later implement them in your models.
From: http://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/