【发布时间】:2019-03-17 04:54:52
【问题描述】:
我正在尝试获取两个数据框的差异。所以,我想删除不同的记录数并从中创建单独的数据帧。我按照这里Comparing two dataframes and getting the differences的解释执行:
train_abusive=pd.read_csv('train_abusive.csv',low_memory=False)
train_non_abusive=pd.read_csv('train_non_abusive.csv',low_memory=False)
print len(train_abusive),len(train_non_abusive)
val_abusive=train_abusive.sample(frac=0.1)
val_non_abusive=train_non_abusive.sample(frac=0.2)
train_abusive=pd.concat([val_abusive,train_abusive],ignore_index=True)
train_abusive=train_abusive.drop_duplicates(keep=False)
train_non_abusive=pd.concat([val_non_abusive,train_non_abusive],ignore_index=True)
train_non_abusive=train_non_abusive.drop_duplicates(keep=False)
print len(train_abusive),len(train_non_abusive)
它给出以下输出:
50000 200000
44596 155010
但数学不成立。我不知道为什么。
【问题讨论】: