【问题标题】:Compare two dataframe cell by cell with conditions on columns逐个单元格比较两个数据框与列上的条件
【发布时间】:2020-10-01 18:18:44
【问题描述】:

我想比较两个数据帧并输出一个数据帧及其差异。但是,我可以容忍 2 天之内的日期差异,并在 5 分之内得分。如果 df1 的值在可接受的范围内,我将保留它们。

df1

id    group      date        score
10     A       2020-01-10     50
29     B       2020-01-01     80
39     C       2020-01-21     84
38     A       2020-02-02     29

df2

id    group      date        score
10     B       2020-01-11     56
29     B       2020-01-01     81
39     C       2020-01-22     85
38     A       2020-02-12     29

我的预期输出:

id    group           date                      score
10     A -> B       2020-01-10                50 -> 56
29     B            2020-01-01                   80
39     C            2020-01-21                   84
38     A            2020-02-02 -> 2020-02-12     29

因此,我想在某些列上逐个单元格和条件比较数据框。

我从这个开始:

df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
result = []
for col in df1.columns:
    for index, row in df1.iterrows():
        diff = []
        compare_item = row[col][index]
        for index, row in df2.iterrows():
            if col == 'date':
                # acceptable if it's within 2 days differences
            if col == 'score':
                # acceptable if it's within 5 points differences
            if compare_item == row[col][index]:
                diff.append(compare_item)
            else:
                diff.append('{} --> {}'.format(compare_item, row[col]))
    result.append(diff)
df = pd.DataFrame(result, columns = [df1.columns]) 

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    让我们试试吧:

    thresh = {'date':pd.to_timedelta('2D'),
              'score':5}
    
    def update(col):
        name = col.name
    
        # if there is a threshold, we update only if threshold is surpassed
        if name in thresh:
            return col.where(col.sub(df2[name]).abs()<=thresh[name], df2[name])
    
        # there is no threshold for the column
        # return the corresponding column from df2
        return df2[name]
    
    df1.apply(update)
    

    输出:

       group       date  score
    id                        
    10     B 2020-01-10     56
    29     B 2020-01-01     80
    39     C 2020-01-21     84
    38     A 2020-02-12     29
    

    【讨论】:

    • 感谢您的快速响应,但我预期的解决方案是用箭头指出差异。前任。在 id 10 中:我想要 A 组 -> B
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-05
    • 1970-01-01
    • 1970-01-01
    • 2017-01-20
    • 2021-04-27
    • 1970-01-01
    相关资源
    最近更新 更多