- 皮尔逊Pearson 相关系数:使用前提:大小一致、连续、服从正态分布的数据集;
- 斯皮尔曼spearman等级相关系数:皮尔逊Pearson 相关系数使用前提任何一个条件不满足时可以考虑使用该系数;
- 肯德尔等级kendallta相关系数:和前两者比完全不一样,衡量有序分类型数据的序数相关性。
上面提到连续及分类型数据,统计学中都有哪些类型的数据集?
1、常见统计学中数据类型
以下结合scipy.stats简单梳理三个系数运用场景
2、 皮尔逊Pearson 相关系数
使用前提:大小一致的数据集、连续数据集、数据集服从正态分布,以下为scipy中描述几个人理解:
scipy.stats.pearsonr(x, y)
Pearson correlation coefficient and p-value for testing non-correlation.
The Pearson correlation coefficient [1] measures the linear relationship between two datasets【衡量两组数据的线性相关性】. The calculation of the p-value relies on the assumption that each dataset is normally distributed【前提假设两组数据服从正态分布,即数据必须是连续型数据(continuous)】. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.【pearson相关系数范围为-1到1,负值为负相关;0为不相关;正值为正相关】
3、斯皮尔曼spearman等级相关系数
使用前提:皮尔逊Pearson 相关系数使用前提任何一个条件不满足时可以考虑使用该系数;
Spearman与Pearson相关系数计算很类似,只是Spearman最后计算的是两个变量转化为序数(Rank)的操作;
以下为scipy中描述几个人理解:
scipy.stats.spearmanr(a, b=None, axis=0, nan_policy='propagate')
Calculate a Spearman correlation coefficient with associated p-value.
The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed【前提假设两组数据不需要服从正态分布】. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.【利用单调函数评价两个统计变量的相关性。 如果数据中没有重复值, 并且当两个变量完全单调相关时,斯皮尔曼相关系数则为 +1 或 −1 。】
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.【数据集元素大于500可能才靠谱】。
4、肯德尔等级kendallta相关系数
使用前提:和前两者比完全不一样,衡量有序分类型数据的序数相关性。
scipy.stats.kendalltau(x, y, initial_lexsort=None, nan_policy='propagate', method='auto')
Calculate Kendall’s tau, a correlation measure for ordinal data.【评估有序分类变量的相关性】
Kendall’s tau is a measure of the correspondence between two rankings【两个】. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement. This is the 1945 “tau-b” version of Kendall’s tau [2], which can account for ties and which reduces to the 1938 “tau-a” version [1] in absence of ties.
参考资料:
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html#scipy.stats.kendalltau
- https://wenku.baidu.com/view/898ace2fb4daa58da0114aac.html
- https://segmentfault.com/a/1190000007904710
- https://blog.csdn.net/lilanfeng1991/article/details/25681947
欢迎微信搜索关注pythonic生物人,分享数据科学干货。