如何对数据集执行线性相关并返回具有最大相关性的列名？答案

【问题标题】：How to perform linear correlation on a data set and return the column name which has the most correlation?如何对数据集执行线性相关并返回具有最大相关性的列名？
【发布时间】：2018-02-20 09:51:50
【问题描述】：

我正在研究一个包含股票收盘价的数据集。

'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]

这些是过去 12 天的收盘价。我需要确定给定公司的哪对股票的每日收盘价百分比变化的相关性最高，并将它们作为数组返回。

import pandas as pd
import numpy as np

class StockPrices:
    # param prices dict of string to list. A dictionary containing the tickers of the stocks, and each tickers daily prices.
    # returns list of strings. A list containing the tickers of the two most correlated stocks.
    @staticmethod
    def most_corr(prices):
        return 


#For example, with the parameters below the function should return ['FB', 'MSFT'].
prices = {
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
}

print(StockPrices.most_corr(prices))

我已经完成了 numpy 相关函数，但是如何使用该精确函数来确定以下两个向量中的哪一个具有最大相关性？

【问题讨论】：

标签： python python-3.x numpy vector correlation

【解决方案1】：

import pandas as pd
import numpy as np

def most_corr(prices):
    """
    :param prices: (pandas.DataFrame) A dataframe containing each ticker's 
                   daily closing prices.
    :returns: (container of strings) A container, containing the two tickers that 
              are the most highly (linearly) correlated by daily percentage change.
    """
    l=list()
    price=prices.pct_change().dropna(how="any")
    df=price.corr()
    for col in df.columns:
        l.append(sorted(df[col].values)[-2])
    df[df.isin([max(l)]).any()==True]
    val=df[df.isin([max(l)]).any()==True].reset_index()['index'].unique()
    return val
    



#For example, the code below should print: ('FB', 'MSFT')
print(most_corr(pd.DataFrame.from_dict({
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
})))

【讨论】：

在您发布的代码旁边，以纯文本形式进行一些解释会非常有帮助

【解决方案2】：

这是我通过所有测试的解决方案：

import pandas as pd
import numpy as np

def most_corr(prices):
    """
    :param prices: (pandas.DataFrame) A dataframe containing each ticker's 
                   daily closing prices.
    :returns: (container of strings) A container, containing the two tickers that 
              are the most highly (linearly) correlated by daily percentage change.
    """
    n_cols = prices.shape[1] 
    df = prices.pct_change().dropna(how="any")
    cor = df.corr()
    mx, row, col = 0,0,0
    for i in range(n_cols):
        for j in range(i+1,n_cols):

            if abs(cor.iloc[i,j]>mx):
                mx = cor.iloc[i,j]
                row = i
                col = j
    return [prices.columns[row], prices.columns[col]]

【讨论】：

【解决方案3】：

如上所述，您可以通过调用 corr() 函数在数据帧上使用 Pearson's R 的内置计算：

df = pd.DataFrame(prices)
df = df.pct_change()
df.corr()

请注意，您最感兴趣的是股票每日回报的相关性，即每个品种的每日百分比变化。如果您计算实际值的相关性，您可能会看到由于不同价格水平而导致的失真效应。可以使用 pandas 的 pct_change() 函数计算每日收益。

然后可以通过调用来获得给定符号的最大相关性，例如，df.corr()['AAPL'].nlargest(2)（请注意，df.corr().max() 简单地返回每个符号与其自身的 1.0 相关性）但在许多情况下，您可能对选择值感兴趣高于某个阈值，例如，

df.corr() > 0.85

【讨论】：

【解决方案4】：

如果你不想走 Pandas 路线，你可以使用 python 工具自己做：

import itertools
import operator

tuples = list(itertools.combinations(prices.keys(), 2))

correlations = {}
for pair in tuples:
    correlations.update({pair: np.corrcoef(prices[pair[0]],prices[pair[1]])[1,0]})

max(correlations.keys(), key=(lambda key: correlations[key]))

第一步创建所有成对组合。然后它为每个成对组合及其各自的系数创建一个字典，然后返回最大值。

pandas 的答案很好，但是您需要解析该数据框以找到正确的值，这也是一种很好的处理方式:)

【讨论】：

【解决方案5】：

您可以通过将字典转换为数据框来使用 pandas corr 函数。此函数返回数据框中数字列的相关矩阵。

import pandas as pd

prices = {
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
}

df = pd.DataFrame.from_dict(prices)
print(df.corr())

输出：

          AAPL        FB      GOOG      MSFT
AAPL  1.000000  0.886750  0.853015  0.894846
FB    0.886750  1.000000  0.799421  0.858784
GOOG  0.853015  0.799421  1.000000  0.820544
MSFT  0.894846  0.858784  0.820544  1.000000

默认计算pearson相关性（这是标准），如果您需要其他方法，kendall和spearman也可用.

【讨论】：