【问题标题】:Getting uncommon records from two dataframes从两个数据框中获取不常见的记录
【发布时间】:2019-03-17 04:54:52
【问题描述】:

我正在尝试获取两个数据框的差异。所以,我想删除不同的记录数并从中创建单独的数据帧。我按照这里Comparing two dataframes and getting the differences的解释执行:

train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
print len(train_abusive),len(train_non_abusive)

val_abusive=train_abusive.sample(frac=0.1)
val_non_abusive=train_non_abusive.sample(frac=0.2)

train_abusive=pd.concat([val_abusive,train_abusive],ignore_index=True)
train_abusive=train_abusive.drop_duplicates(keep=False)

train_non_abusive=pd.concat([val_non_abusive,train_non_abusive],ignore_index=True)
train_non_abusive=train_non_abusive.drop_duplicates(keep=False)

print len(train_abusive),len(train_non_abusive)

它给出以下输出:

50000 200000
44596 155010

但数学不成立。我不知道为什么。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    已编辑:如果您只想比较两个数据帧,您可以使用断言。

    train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
    train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
    
    from pandas.util.testing import assert_frame_equal
    assert_frame_equal(train_abusive, train_non_abusive)
    

    我还在另一个post 中看到了Tom Chapin 的答案,您可能会感兴趣。

    def get_different_rows(train_abusive, train_non_abusive):
        """Returns just the rows from the new dataframe that differ from the source dataframe"""
        merged_df = train_abusive.merge(train_non_abusive, indicator=True, how='outer')
        changed_rows_df = merged_df[merged_df['_merge'] == 'right_only']
        return changed_rows_df.drop('_merge', axis=1)
    

    【讨论】:

    • 我需要从数据帧中随机抽取记录并从原始数据帧中删除
    • 你可以尝试使用 assert 来比较两个数据框
    猜你喜欢
    • 1970-01-01
    • 2023-04-03
    • 1970-01-01
    • 1970-01-01
    • 2018-07-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-09-16
    相关资源
    最近更新 更多