Numpy/Pandas 关联多个不同长度的数组答案

【问题标题】：Numpy/Pandas correlate multiple arrays of different lengthNumpy/Pandas 关联多个不同长度的数组
【发布时间】：2021-06-07 16:46:42
【问题描述】：

我可以使用this method 关联两个不同长度的数组：

import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
df = pd.DataFrame(dict(x=a))

CORR_VALS = np.array(b)
def get_correlation(vals):
    return pearsonr(vals, CORR_VALS)[0]

df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)

我得到这样的结果：

In [1]: df
Out[1]: 

    x  correlation
0  0.0          NaN
1  0.4          NaN
2  0.2          NaN
3  0.4          NaN
4  0.2          NaN
5  0.4     0.527932
6  0.2    -0.159167
7  0.5     0.189482

首先，皮尔逊系数应该只是这个数据集中的最高数字...

其次，我怎样才能对多组数据执行此操作？我想要一个输出，就像我在 df.corr() 中得到的一样。适当标记索引和列。

例如，假设我有以下数据集：

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
c = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
d = [ 0.4, 0.2, 0.5]

我想要一个包含 16 个 Pearson 系数的相关矩阵...

【问题讨论】：

Pearson R 应该存在于 [-1, 1] 之间......不管你说的相关性是什么意思还不清楚，因为从技术上讲，它对于不等长度的向量是未定义的。您指向的解决方案确定了较小向量与较大向量中所有连续子集的相关性（因此您会得到滚动的相关性数组），但是 1）不清楚您想要从中获得什么信号值，以及 2）现在当你用多个向量成对地做这件事时没有任何意义，你如何组织所有这些不相关的滚动相关性？
在规则网格上插值是另一种选择，但它伴随着一整套其他假设，只有您可以决定是否合适。
@ALollz 是的，Pearson R 应该存在于 [-1,1] 之间。数据集 a、b、c、d 不是系数，以防这就是您所解释的。 1）我想要皮尔逊系数，当两个数据集相关性最高时，这是最高的数字。我知道它是沿着较大的向量扫描较小的向量，但是为什么它不返回最大值（皮尔逊系数）？
@ALollz 2 ）我不想组织任何不相关的相关性，我想组织相关的相关性，即皮尔逊系数。我想要一个像 df.corr() 中最相关系数的有组织的网格。其中一些向量在数据收集中被“截断”。我想看看它们是否至少在停止收集数据之前是相关的。
您应该根据其分布更改数据以获得可接受的结果，或者您只是制作无用的数字来显示并且它们没有任何意义。您可以使用scipy.signal.resample 使它们具有相同的长度。之后使用 pearsonr 或任何其他方法来获得它们的相关性。

标签： python pandas statistics data-science pearson-correlation

【解决方案1】：

import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
c = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
d = [ 0.4, 0.2, 0.5]

# To store the data
dict_series = {'a': a,'b': b,'c':c,'d':d}
list_series_names = [i for i in dict_series.keys()]

def get_max_correlation_from_lists(a, b):
    # This is to make sure the longest list is in the dataframe
    if len(b)>=len(a):
        a_old = a
        a = b
        b= a_old
    # Taking the body from the original code.
    df = pd.DataFrame(dict(x=a))
    CORR_VALS = np.array(b)
    def get_correlation(vals):
        return pearsonr(vals, CORR_VALS)[0]
    # Collecting the max
    return df.rolling(window=len(CORR_VALS)).apply(get_correlation).max().values[0]

# This is to create the "correlations" matrix
correlations_matrix = pd.DataFrame(index=list_series_names,columns=list_series_names )
for i in list_series_names:
    for j in list_series_names:
        correlations_matrix.loc[i,j]=get_max_correlation_from_lists(dict_series[i], dict_series[j])

print(correlations_matrix)
          a         b         c         d
a       1.0  0.527932  0.995791       1.0
b  0.527932       1.0   0.52229  0.427992
c  0.995791   0.52229       1.0  0.992336
d       1.0  0.427992  0.992336       1.0

【讨论】：