Vectoronica
The regression equation and the regression model 본문
In this blog, I'm going to talk about the regression equation and the regression model.
1. The regression equation : In simple linear regression, we assume the relation between the two quantitative variables is linear, so first of all, the line must be straight. Second, we get the best predictions from the line that produces predicted scores. So we need to find the straight line that minimizes the distance between observed and predicted scores for all the cases.
1-1) a residual (prediction error) : the difference between each observed and predicted score. It can be expressed like this.
y hat is a predicted score generated by the regression eqaution. The regression equation, in earlier blog, was the intercept plus the regression coefficient times a value of x(the independent, predictor variable).
it's a pretty simply thing. But you also have to consider in terms of if the values that you get about the intercept and the regression coefficient are the best. In other words, it means you have to make the regression line or equation that you get fit all the data as tight as possible. From here, you might have a question. What values for the intercept and the regression coefficient are the best ? Fortunately, we can calculate the equation to get the best prediction line by two formulas. I'm going to introduce you these formauls.
Let's see these formulas. First, a is the intercept of a regression equation, I mentioned it in the previous blog. It determines where the line is placed at and crosses the y axis when the predictor value(x) is equal to 0. If you want to calculate the best value for the intercept, you need to know a correlation coefficient of given data, standard deviations for each variable. With those three values, You can get the best value without bothering process. Second, the regression coefficient. It looks weird because there are bars on each variable, y bar, x bar. But don't worry. It just describes the means of those variables. When we use a linear regression, data have two information, that's the value of x and y. So, we take those values separately, and do something magical using the formulas. But, all these formulas come from the method called the ordinary least squares. To understand the method, you have to understand some part of Calculus. But i'm not good at mathematics, so i won't deal with the calculus parts. I just want to tell you the concept of the method because it's pretty simple to explain. The way how to use method(ordinary least square) is taking the difference between observed score and predicted score, squaring them, and adding them up.
1-2) The regression coefficient : as you see the intercept formula, it uses a correlation coefficient of data and standard deviation of each variable. Because of the correlation coefficient(or Pearson's r), it has a interesting point. Here it is.
the formula for the regression coefficient has the correlation coefficient to calculate the best fitting regression coefficient. So, the Pearson's r influences the direction of the line. The above picture describes the relation between the correlation coefficient and the direction of the regression line. If the value of the correlation coefficient is positive, then the regression coefficient also becomes positive, then the graph goes up(increase), whereas if the value of Pearson's r is negative, then the regression coefficient is also negative, then the graph goes down(decrease). When the correlation coefficient is equal to 0, the regression line goes horizontally because it means there is no relation between two quantitative variables.
1-3) The differences between a correlation coefficient and a regression coefficient
as you can see, there are some obvious difference between them. But as a beginner's point of view, i think these could be a bit confusing due to the language. Especially if you aren't a native speaker, it would be somewhat difficult to distinguish the differences for a while. But as there's a say, "practice makes perfect", you can make it!
1-4) The intercept : This part actually has nothing special, so i'll just explain it a bit. To get the best value of the intercept, you just need the mean of the predictor and response, and the value of the regression coefficient. With those three values, you can get the intercept.
you should note that if the value of b is equal to 0, then the intercept will be the mean of y. This is a picture how the line looks like when the regression coefficient is equal to 0.
This implies that when the two variables are not related to each other, the best fitting line will be a line going horizontally, which means the mean of the response variable would be the best prediction.
2. The regression model : i've just talked about the regression line in a small sample. But in this time, i want to discuss about the population ! The concept of the regression model in this chapter is pretty similar with the regresssion line in a small sample. But here, we have different assumptions to use the concept.
2-1) The assumptions : In simple linear regression, we assume that the shape of the conditional distributions has exactly the same for every distribution. Another assumptions is that the line describes the popoulation means of the conditional distributions, which are assumed to have the same shape and standard deviation. Here it is.
Note that the second assumption is based on that the relation is linear.
2-2) The equation for this situation : since we're dealing with population distributions, we cannot use the same equation before i used. Instead, it has another equation for this situation.
This equation looks like very similar with the equation for a sample. But it has different language to describe values. In this situation, it uses Greek language, which are used when you describe a population. In small samples, it uses English, which are used when it describes a sample.
2-3) describing individual cases in population : obviously, a difference exists between the linear regression for populations and samples. But there is a formula to predict an individual case in population. I think why it works is because a sample comes from a population, so all the cases in the sample are also included in the population. here I introduce the formula for this situation.
the alpha and beta are used in this situation as well because we want to calculate an individual case with the population parameters. And, here, you can see a new parameter. It's called the Epsilon. It indicates the variation around the conditional mean. When you use this formula, you have to check these 3 assumptions. Otherwise, you wouldn't get a good answer.
- The epsilons are normally distributed with a uniform standard deviation
- It has a mean of zero.
'데이터 사이언스 > 통계' 카테고리의 다른 글
Testing the model and Checking assumptions (0) | 2021.05.29 |
---|---|
Predictive power and the pitfalls in regression (0) | 2021.05.28 |
Fisher's exact test and linear regression(regression line) (0) | 2021.05.24 |
Chi-squared as goodness of fit and The side notes of the Chi-squared test (0) | 2021.05.23 |
The Chi-square test and Interpreting the Chi-squared test (0) | 2021.05.22 |