有效计算仅 1 列数组的 Pearson 相关系数答案

【问题标题】：Calculate Pearson correlation coefficient for only 1 column of array efficiently有效计算仅 1 列数组的 Pearson 相关系数
【发布时间】：2020-08-18 09:44:13
【问题描述】：

我有一个形状为 ~(700,36000) 的数组，并且想计算仅针对特定列（相对于所有其他列）但数千次的 pearson 相关系数。我已经尝试了很多方法，但似乎都没有那么有效：

import numpy 

df_corr = numpy.corrcoef(df.T)
corr_column = df_corr.iloc[:, column_index]

这当然会计算整个相关矩阵，在我的机器上大约需要 12 秒；这是一个问题，因为我需要这样做约 35,000 次（在创建相关矩阵之前每次都会稍微改变 arr）！

我也尝试过逐列迭代：

corr_column = numpy.zeros(len(df))

for x in df.columns:
    corr_column[x] = numpy.corrcoef(x=p_subset.iloc[:,gene_ix],y=p_subset.iloc[:,x])[0][1]
    corr_column = vals.reshape(-1,1)

这在每次迭代约 10 秒时稍快，但仍然太慢。有没有办法更快地找到一列与所有其他列之间的相关系数？

【问题讨论】：

标签： python numpy correlation

【解决方案1】：

你可以自己实现公式：

import numpy as np

def corr(a, i):
    '''
    Parameters
    ----------
    a: numpy array
    i: column index

    Returns
    -------
    c: numpy array
       correlation coefficients of a[:,i] against all other columns of a
    '''

    mean_t = np.mean(a, axis=0)
    std_t = np.std(a, axis=0)

    mean_i = mean_t[i]
    std_i = std_t[i]

    mean_xy = np.mean(a*a[:,i][:,None], axis=0)

    c = (mean_xy - mean_i * mean_t)/(std_i * std_t)
    return c


a = np.random.randint(0,10, (700,36000))

%timeit corr(a,0)
608 ms ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.corrcoef(a.T)
# Actually didn't have the patience to let it finish in my machine 
# Using a smaller sample, the implementation above is 100x faster.

【讨论】：