以有效的方式组合两个数据帧，不重复和反转| Python答案

【问题标题】：Combination of two dataframes without duplicate and reversion in efficient way | python以有效的方式组合两个数据帧，不重复和反转| Python
【发布时间】：2018-08-07 20:22:07
【问题描述】：

我有两个包含数千行的数据框，我需要将它们组合成一个数据框，而不会重复和反转。例如：

数据框 1

drug1
drug2
drug3

数据框 2

disease1
disease2
disease3

因此，输出数据帧将是：

输出数据帧

drug1 disease1
drug1 disease2
drug1 disease3
drug2 disease1
drug2 disease2
drug2 disease3 
drug3 disease1
drug3 disease2
drug3 disease3

我不想要包含以下内容的输出组合：

disease1 drug1
drug1 drug1
disease1 disease1

我实际上使用pd.merge 尝试它，但它返回重复和还原，并且还需要很长时间，因为我在 Dataframes 1 和 2 中有数千个

有什么帮助吗？

【问题讨论】：

这是Cartesian product，我确定在itertools 中可用
不，没关系，我想要 drug2 disease2 和 drug3 disease3 @ScottBoston
我认为@ScottBoston 的意思是您的“不想要的”输出中有drug1 disease1。
哦对不起我混了，我的意思是如果我有一个组合我不想要相反的@ScottBoston
cartesian product in pandas的可能重复

标签： python pandas dataframe combinations

【解决方案1】：

纯粹在pandas 中的一种方法是创建一个MultiIndex from product，然后将其转换为数据框：

>>> df1
       0
0  drug1
1  drug2
2  drug3
>>> df2
          0
0  disease1
1  disease2
2  disease3

df3 = (pd.MultiIndex.from_product([df1[0],df2[0]])
       .to_frame()
       .reset_index(drop=True))

>>> df3
       0         1
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3

【讨论】：

【解决方案2】：

试试这个解决方案：

from pandas import DataFrame, merge

df1['key'] = 1
df2['key'] = 1

result = df1.merge(df2, on='key').drop('key', axis=1)

【讨论】：

这是我之前尝试过的并返回重复和还原:(
但它不应该。数据框中是否有重复项？
不，我不知道，但这也花了很长时间@Lev Zakharov

【解决方案3】：

设置

df1 = pd.DataFrame(dict(col1=[f"drug{i}" for i in range(1, 4)]))
df2 = pd.DataFrame(dict(col2=[f"disease{i}" for i in range(1, 4)]))

`merge` 在指定的列上

df1.assign(A=1).merge(df2.assign(A=1)).drop('A', 1)

    col1      col2
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3

理解

pd.DataFrame([
    (i, j) for i in df1.col1
           for j in df2.col2
], columns=['col1', 'col2'])

`pandas.concat`

泛化为任意两个数据帧的叉积

i = df1.index.repeat(len(df2))
j = np.tile(df2.index, len(df1))

pd.concat([
    df1.loc[i].reset_index(drop=True),
    df2.loc[j].reset_index(drop=True)
], sort=True, axis=1)

【讨论】：

在设置中，我的药物和疾病不是这样编号的，它们是任何名称
你的意思是他们没有这样编号？它似乎与您的输入相同。
这只是一个例子，所以实际上药物和疾病就像 CID00757 DOID_3762 @piRSquared
好的。该解决方案应推广到您的价值观。我用了你给的例子。我应该使用什么例子？
我的意思是您的解决方案似乎是硬编码的，我如何将它应用于药物和疾病的任何名称？

设置

merge 在指定的列上

理解

pandas.concat

`merge` 在指定的列上

`pandas.concat`