【问题标题】:How to perform (modified) t-test for multiple variables and multiple models如何对多个变量和多个模型执行(修改)t 检验
【发布时间】:2019-08-25 11:12:14
【问题描述】:

我使用 WEKA 创建并分析了大约 16 个机器学习模型。现在,我有一个 CSV 文件,其中显示了模型的指标(例如 percent_correct、F-measure、recall、precision 等)。我正在尝试对这些模型进行(修改后的)学生 t 检验。我可以进行一项(根据此链接),其中我只比较两个模型共有的一个变量。我想一次使用 MULTIPLE 变量和 MULTIPLE 模型执行一个(或多个)t 检验。

如前所述,我只能在两个模型(比如决策表和神经网络)中使用一个变量(比如 F-measure)进行测试。

这是代码。我正在执行 Kolmogorov-Smirnov 测试(修改后的 t):

from matplotlib import pyplot
from pandas import read_csv, DataFrame
from scipy.stats import ks_2samp

results = DataFrame()
results['A'] = read_csv('LMT (f-measure).csv', header=None).values[:, 0]
results['B'] = read_csv('LWL (f-measure).csv', header=None).values[:, 0]
print(results.describe())
results.boxplot()
pyplot.show()
results.hist()
pyplot.show()

value, pvalue = ks_2samp(results['A'], results['B'])
alpha = 0.05
print(value, pvalue)
if pvalue > alpha:
    print('Samples are likely drawn from the same distributions (fail to reject H0)')
else:
    print('Samples are likely drawn from different distributions (reject H0)')

有什么想法吗?

【问题讨论】:

    标签: python pandas data-visualization t-test hypothesis-test


    【解决方案1】:

    这是我的问题的简单解决方案。它只处理两个模型和两个变量,但您可以轻松拥有包含分类器名称和要分析的指标的列表。出于我的目的,我只是分别更改了COIROI_1ROI_2 的值。

    注意:此解决方案也可推广。 如何?只需更改COIROI_1ROI_2 的值df = pandas.read_csv("FILENAME.csv, ...) 中加载任何选定的数据集。如果您想要另一个可视化,只需更改接近末尾的pyplot 设置。

    关键是将新的DataFrame 分配给原来的DataFrame 并实现.loc["SOMESTRING"] 方法。它会删除数据中的所有行,但指定为参数的行除外。

    但是,请记住,在阅读文件时包含index_col=0使用其他方法设置DataFrame 的索引。如果不这样做,您的row 值将只是来自0 to MAX_INDEX 的索引。

    # Written: April 4, 2019
    
    import pandas                       # for visualizations
    from matplotlib import pyplot       # for visualizations
    from scipy.stats import ks_2samp    # for 2-sample Kolmogorov-Smirnov test
    import os                           # for deleting CSV files
    
    # Functions which isolates DataFrame
    def removeColumns(DataFrame, typeArray, stringOfInterest):
        for i in range(0, len(typeArray)):
            if typeArray[i].find(stringOfInterest) != -1:
                continue
            else:
                DataFrame.drop(typeArray[i], axis = 1, inplace = True)
    
    # Get the whole DataFrame
    df = pandas.read_csv("ExperimentResultsCondensed.csv", index_col=0)
    dfCopy = df
    
    # Specified metrics and models for comparison
    COI = "Area_under_PRC"
    ROI_1 = "weka.classifiers.meta.AdaBoostM1[DecisionTable]"
    ROI_2 = "weka.classifiers.meta.AdaBoostM1[DecisionStump]"
    
    # Lists of header and row in dataFrame
    #  `rows` may act strangely
    headers = list(df.dtypes.index)
    rows = list(df.index)
    
    # remove irrelevant rows
    df1 = dfCopy.loc[ROI_1]
    df2 = dfCopy.loc[ROI_2]
    
    # remove irrelevant columns
    removeColumns(df1, headers, COI)
    removeColumns(df2, headers, COI)
    
    # Make CSV files
    df1.to_csv(str(ROI_1 + "-" + COI + ".csv"), index=False)
    df2.to_csv(str(ROI_2 + "-" + COI) + ".csv", index=False)
    
    results = pandas.DataFrame()
    # Read CSV files
    # The CSV files can be of any netric/measure, F-measure is used as an example
    results[ROI_1] = pandas.read_csv(str(ROI_1 + "-" + COI + ".csv"), header=None).values[:, 0]
    results[ROI_2] = pandas.read_csv(str(ROI_2 + "-" + COI + ".csv"), header=None).values[:, 0]
    
    # Kolmogorov-Smirnov test since we have Non-Gaussian, independent, distinctive variance datasets
    # Test configurations
    value, pvalue = ks_2samp(results[ROI_1], results[ROI_2])
    # Corresponding confidence level: 95%
    alpha = 0.05
    
    # Output the results
    print('\n')
    print('\033[1m' + '>>>TEST STATISTIC: ')
    print(value)
    print(">>>P-VALUE: ")
    print(pvalue)
    if pvalue > alpha:
        print('\t>>Samples are likely drawn from the same distributions (fail to reject H0 - NOT SIGNIFICANT)')
    else:
        print('\t>>Samples are likely drawn from different distributions (reject H0 - SIGNIFICANT)')
    
    # Plot files
    df1.plot.density()
    pyplot.xlabel(str(COI + " Values"))
    pyplot.ylabel(str("Density"))
    pyplot.title(str(COI + " Density Distribution of " + ROI_1))
    pyplot.show()
    
    df2.plot.density()
    pyplot.xlabel(str(COI + " Values"))
    pyplot.ylabel(str("Density"))
    pyplot.title(str(COI + " Density Distribution of " + ROI_2))
    pyplot.show()
    
    # Delete Files
    os.remove(str(ROI_1 + "-" + COI + ".csv"))
    os.remove(str(ROI_2 + "-" + COI + ".csv"))
    

    【讨论】:

      猜你喜欢
      • 2020-07-26
      • 2014-10-12
      • 1970-01-01
      • 2023-01-13
      • 2020-07-12
      • 2017-07-25
      • 1970-01-01
      • 1970-01-01
      • 2016-08-19
      相关资源
      最近更新 更多