【问题标题】:New column based on matching values from another dataframe pandas基于来自另一个数据框 pandas 的匹配值的新列
【发布时间】:2019-03-13 13:31:35
【问题描述】:

如果我们在下面的示例中有两个数据帧,例如df1df2;我们如何合并它们以生成df3

import pandas as pd
import numpy as np

data1 = [("a1",["A","B"]),("a2",["A","B","C"]),("a3",["B","C"])]
df1 = pd.DataFrame(data1,columns = ["column1","column2"])
print df1

data2 = [("A",["1","2"]),("B",["1","3","4"]),("C",["5"])]
df2 = pd.DataFrame(data2,columns=["column3","column4"])
print df2

data3 = [("a1",["A","B"],["1","2","3","4"]),("a2",["A","B","C"], 
["1","2","3","4","5"]),("a3",["B","C"],["1","3","4","5"])]
df3 = pd.DataFrame(data3,columns = ["column1","column2","column5"])
print df3

我的目标是不使用 for 循环,因为我正在处理大型数据集

【问题讨论】:

    标签: python pandas dataframe merge


    【解决方案1】:

    使用DataFrame 重新创建后检查stack df1 的列表列,然后使用map 来自df2 的值


    此外,由于您要求不使用 for 循环,我正在使用 sum ,而 sum 在这种情况下比 *for loop*itertools 慢得多


    s=pd.DataFrame(df1.column2.tolist()).stack()
    df1['New']=s.map(df2.set_index('column3').column4).sum(level=0).apply(set)
    df1
    Out[36]: 
      column1    column2              New
    0      a1     [A, B]     {2, 4, 3, 1}
    1      a2  [A, B, C]  {3, 5, 4, 2, 1}
    2      a3     [B, C]     {4, 3, 1, 5}
    

    正如我提到的和我们大多数人所建议的,您也可以通过For loops with pandas - When should I care? 进行检查

    import itertools
    d=dict(zip(df2.column3,df2.column4))
    
    
    l=[set(itertools.chain(*[d[y] for y in x ])) for x in df1.column2.tolist()]
    df1['New']=l
    

    【讨论】:

    • 你必须删除重复的,不是吗?
    • @Wen-Ben .apply(set) 而不是 .apply(tuple)?
    • @jezrael 是的,我想说 for 循环非常适合这类问题
    • @coldspeed 很高兴我能帮上忙:-)
    【解决方案2】:

    你可以这样做:

    df2_dict = {i:j for i,j in zip(df2['column3'].values, df2['column4'].values)}
    # print(df2_dict)
    
    def func(val):
        return sorted(list(set(np.concatenate([df2_dict.get(i) for i in val]))))
    
    df1['column5'] = df1['column2'].apply(func)
    print(df1)
    

    输出:

      column1    column2          column5
    0      a1     [A, B]     [1, 2, 3, 4]
    1      a2  [A, B, C]  [1, 2, 3, 4, 5]
    2      a3     [B, C]     [1, 3, 4, 5]
    

    【讨论】:

      【解决方案3】:

      这行得通:

      df1['column2'].apply(lambda x: list(set((np.concatenate([df2.set_index('column3')['column4'][i] for i in list(x)])) )))

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-01-02
        • 1970-01-01
        • 1970-01-01
        • 2016-12-27
        • 2020-11-30
        • 2020-12-20
        • 2016-05-10
        • 1970-01-01
        相关资源
        最近更新 更多