데이터 사이언스/통계

Spearman correlation and The runs test

SeungUk Lee 2021. 6. 30. 13:20

In this page, i'm going to talk about spearman correlation and the runs test

1. spearman correlation : A correlation coefficient, is a standardized measure to express the degree by which two variables are associated. The standardization implies that it has a fixed range over which it varies. Most correlation coefficients vary between -1 and +1.  In general, the Pearson correlation coefficient is sensitive to outliers, and skwedeness of the distribution in one or both variables. The Spearman correlation coefficient is a good replacement of the Pearson correlation, if one of the following conditions applies to your variable.

 1-1) conditions

  • they're not numerical, but one or both of the variables are ordinal.
  • they're not linearly related.
  • they contain one or more outliers.
  • they don't follow a bivariate normal distribution, or you cannot check this distribution due to lack of data.

 ** A monotonic functino is one that either never increases, or never decreases, as it's independent variable increases. The Spearman correlation, measures the strength of a monotonic relationship between paired data. This implies, that the Spearman correlation coefficient can be applied to monotonically increase or decrease.

 1-2) The interpretation : to interpret it, you assume that these variables are monotonically related. To test it, to see whether its value is significantly different from zero, there is no requirement on the distribution of the data.

 1-3) The calculation : The spearman's correlation is calculated by first ranking the variables, whereby average ranks are assigned in the case of ties. Subsequently, by calculating the Pearson correlation on the ranked values of data. Because it works on ranked data, the experiment correlation coefficient is also caleld the rank correlation coefficient. 

 

 

2. The runs test : there's a formal test to answer the question, whether a sequence of categorical or numerical data is random based on the sequence of the individual labels. it's called a runs test. 

 2-1) the meaning of the run : A run is a succession of identical values or labels which are followed and preceded by different values or labels. And the number of data items it contains is the run length. The runs tests only consider binomial data. As the runs test focusses on the ordering of the data, it's crucial that this ordering is unchanged from the moment the data was collected. 

 ** the runs test counts simply the number of runs in a given binomial data sequence. And this statistic, in spite of being rather simple, gives a good indication of whether the data is random.