Vectoronica

Fisher's exact test and linear regression(regression line) 본문

데이터 사이언스/통계

Fisher's exact test and linear regression(regression line)

SeungUk Lee 2021. 5. 24. 20:18

In this blog, i'm going to talk about Fisher's exact test and linear regression.

 

1. Fisher's exact test : When expected frequencies in a contingency table are small, you cannot use a chi-square test for independence. However, in case of a two by two table, there's an alternative test that is designed for small samples. It's called Fisher's exact test.

 1-1) How to do Fisher's exact test? Fisher's exact test takes the count in a single cell as test statistic. And you can then formulate a null hypothesis taht its observed count is equal to its expected count versus the alternative that it's unequal. When using a two sided hypothesis, Fisher's exact test is equal to the chi-square test for independence. It's also possible to use a one sided test, but when you conduct a one sided test, you should have a very good reason to do that.

 

 1-2) the idea of the test : given that the marginal distributions, there are a limited number of arrangements possible in the table. by making all possible arrangements, you can make a probability distribution of all possible values in a given cell and then compare how extreme your observed frequency is in this distribution.

 

 1-3) the formula for a probability of a specific configuration :

you use the formula to calculate the probability of each configuration. And then, you compare these probabilities to the null hypothesis value at a 5% significance level.

 

** the general method for Fisher's exact test can in fact be applied to larger tables. The general approach and interpretation of the results is exactly the same.

 

 

2. The regression line : we analyze the relation between two quantitative variables, one independent variable and one dependent variable.

 

 2-1) the correlation coefficient: it's to determine how strongly the variables are linearly related. This is a number between -1 and 1 that expresses how tightly the data fit around an imaginary straight line through the scatter plot. The formula to calculate the correlation coefficient is this.

It's also called Pearson's R.

 

 2-2) the regression regression : But it has a problem that can't describe the relation between two quantitative variables specifically. So it would be useful to do that and be able to predict an exact score for a specific independent value. This is why we use linear regression. It makes us possible to describe a relation mathematically through a regression equation. This allows us to do a couple of interesting things.

  • we can use inferential statistics to test if the equation is likely to be an accuarate description of the relation in the population.
  • we can also see how closely the predictions approximate the observed data points. In other words, how good our predictions are.
  • we can use the regression equation to identify outliers.
  • we can generate predictions for new cases.

How does regression work? we distinguish between an independent and dependent variable. The dependent variable or outcome or response variable is a variable that we want to predict. The independent variable or predictor or explanatory variable is a variable that is used to predict the dependent variable. The response variable always goes on the Y axis, the predictor variable always goes on the X axis.

 

** In some cases, when the causal direction is unclear, it's arbitrary which varaible we consider the predictor and the response. In such cases, the choice is simply determined by how we choose to frame the research question.

 

 2-3) the regression line : it's an imaginary straight line, or the best fitting straight line through the scattor plot. We call this the regression line. It's described by the regression equation. This is the equation.

Y hat means a predicted score on the response variable y for case i givne the value x. The predicted score is determined by the intercept a and the regression coefficient or slope, b. The intercept a is equal to the value of y when the x equals 0, where it crosses the y axis. It determines where the line is spaced. The regression coefficient b determiens the slope of the line.