Vectoronica

Predictive power and the pitfalls in regression 본문

데이터 사이언스/통계

Predictive power and the pitfalls in regression

SeungUk Lee 2021. 5. 28. 12:05

In this blog, I want to talk about the predictive power and the pitfalls of the linear regression.

 

1. Predictive power : It's related to the correlation coefficient. More specifically, It is used when you want to describe how much your sample approximate a regression line(the best fitting straight line). You can think of the predictive power as a relative ratio of the linear regression that you get compared to the mean of the response variable. It's also called the proportion of explained variation, or R squared.

 1-1) The formula : There are 3 new words to calculate. The total sum of squares, the residual sum of squares, the regression sum of squares.

y hat means the predicted score in y for a specific value of x. y bar means the mean of the response variable(y). The total sum of squares describes the all the variation in the response variable. The residual sum of squares means the residuals(errors) from the ideal line. The regression sum of squares is the difference between the total sum of sqaures and the residual sum of squares. And, calaulting the proportion of explained variation is easy because we can simply use the formula. The formula implies the proportion of explained variation check how much all the data in my sample approximates the linear regression. 

 1-2) The number between 0 and 1 : Since the correlation coefficient is a number between -1 and 1, you get an interval from 0 to 1 when you square r. So, the proportion of explained variation is between 0 and 1. When the value is 0, it indicates that the mean of the response variable is the best prediction. But, if the value is 1, that means you have a perfect linear regression. In other words, all the data in your sample fit the linear regression perfectly !

 

2. The pitfalls in regression : As always, when you do something or meet some situation, you should consider many factors or varaibles to avoid the worst problems that possibly happens to you. So we always need to be ready for that things. I think mathematics is the same. So, in this chapter, I want to share the things you should know when you use linear regression. 

 2-1) correlation and simple linear regression capture lienar association. To show what can go wrong if you apply a linear regression to nonlinear data, ANSCOMBE'S QUARTET represents errors for that. Here there are.

Those 4 cases have the same correlation coefficient(0.8) and proportion of explained variation(0.64). But depending on how the data spread, they have different shapes. So, you should know that it can have a different sctter in spite of the same r and r squared.

 2-2) Outliers can have an influence if they(outliers) are far away from the regression line. You can see the cases that most of the data fit the line, but as the only one data deviates from the line, the correlation coefficient decreases. Or The data doesn't have a linear relation, but only one fits the line so well. So the correlation coefficient increases. So you should always check if there is an outlier.

 2-3) Erroneous Inferring causation : the word seems so abstract to understand, but it's quite straightforward. It simply means that correlation coefficient is not related to causation. The only way to tell if the relation is causal, is to perform a truly randomized experiment.

 2-4) Inappropriate extraploation : It says that you can't extrapolate the regression line endleslly beyond a specific range. In a particular range, data has a linear relation, but it doesn't mean it can assure that the relation goes forever because at some point, the relation can change.

 2-5) Ecological fallacy : It means drawing inappropriate conclusions about individual cases when correlation or regression is based on aggregates of these cases. In some experiment, you collect data from different countries or area. And you find out that they have a similar pattern between variables, although the place where the phenomenon happens is different. So you think that it's fine to put all the data together and make a conclusion about the whole data. But it leads to wrong conclusions because when you aggregate all the data, it tends to tight the relation, that means the correlation can increase and the shape of the data cloud is thinner than the original shape of the data. A good thing is that it doesn't influence that much to r squared. It's fine if you don't use results based on the aggregate data to draw conclusions at the level of individuals.

 2-6) The restriction of range : it just says that our sample contains a limited range of predictor values. For me, it sounds a bit similar with inappropriate extrapolation. It can seriously lower Pearson's r and r squared.