【问题标题】:Merging Multiple Pandas DataFrames - Some with Shared Unique IDs, Some with Shared Columns合并多个 Pandas DataFrames - 一些具有共享唯一 ID,一些具有共享列
【发布时间】:2016-01-09 20:56:55
【问题描述】:

好的,对于 pandas 和 Python 来说相对较新,如果我的问题非常明显,敬请见谅。阅读了所有关于合并、连接和连接的 pandas 文档,阅读了有关 Stackoverflow 和 Scriptscoop 的所有类似问题,并观看了数小时的 pandas 教程YouTube。但是还没有弄清楚如何做我想做的事情,这在 pandas 中似乎相对容易。

基本上,对于每种类型的阳性细菌结果(大肠杆菌、金黄色葡萄球菌等),我都有一个 DataFrame。在 DataFrame 中有一个与患者相关联的唯一 ID(订单),以及结果、日期和病房名称。一个病人可能只对一种细菌呈阳性,也可能对多种细菌呈阳性,因此 DataFrame 之间的一些订单号重叠,有些只出现一次。

例如:

    Order  Test_EC  Results_EC     Date     Ward Name
0   K70201  E. coli  MODERATE   2014-01-02    North
1   K70277  E. coli  MODERATE   2014-01-02    North
2   K70205  E. coli  FEW        2014-01-02    West
3   K70818  E. coli  MODERATE   2014-01-03    South
4   K70202  E. coli  FEW        2014-01-03    West
5   K80070  E. coli  RARE       2014-01-03    North
6   K80666  E. coli  FEW        2014-01-03    East

    Order   Test_SA  Results_SA    Date     Ward Name
0   K80766  S.aureus MANY       2014-01-01    West
1   K70201  S.aureus MANY       2014-01-02    North
2   K70277  S.aureus MANY       2014-01-02    North
3   K70205  S.aureus FEW        2014-01-02    West
4   K90107  S.aureus FEW        2014-01-06    North

我想根据患者的订单号创建一个主数据库,并为每个阳性测试和结果以及日期和病房名称创建一个关联列。如果患者的一项测试为阳性而另一项测试为阴性,则 NaN 填充就可以了。如果来自不同 DataFrame 的两个订单号匹配,那么根据定义,它们将具有相同的日期和病房名称,因此基本上测试和结果列将是唯一的新信息。

简而言之,我想保留每个表中包含的所有信息,同时让每个订单号的所有关联数据显示在一行中。

我希望得到这样的东西:

    Order   Test_EC Results_EC  Test_SA Results_SA     Date  Ward Name
0   K70201  E. coli MODERATE    S.aureus MANY      2014-01-02 North
1   K70277  E. coli MODERATE    S.aureus MANY      2014-01-02 North
2   K70205  E. coli FEW         S.aureus FEW       2014-01-02 West
3   K70818  E. coli MODERATE    NaN      NaN       2014-01-03 South
4   K70202  E. coli FEW         NaN      NaN       2014-01-03 West
5   K80070  E. coli RARE        NaN      NaN       2014-01-03 North
6   K80666  E. coli FEW         NaN      NaN       2014-01-03 East
7   K80766  NaN     NaN         S.aureus MANY      2014-01-01 West
8   K90107  NaN     NaN         S.aureus FEW       2014-01-06 North

如您所见,生成的 DataFrame 缩短了三行,因为有 3 名患者同时感染了大肠杆菌和金黄色葡萄球菌。订单列中没有重复值,但所有信息都已保存。

我还想通过用不同的细菌再做大约 20 次相同的事情来继续建立这样的数据库。实际数据集大约有 100,000 个唯一的订单号。

如果我介绍了我尝试过的所有各种 join、merge 和 concat 函数的组合,以及它们为什么不起作用,这篇文章会太长。我知道我遗漏了一些明显的东西。任何想法,将不胜感激!

【问题讨论】:

    标签: python join pandas merge dataframe


    【解决方案1】:

    看起来你想要一个“外部”合并?

    In [154]: df1
    Out[154]: 
        Order  Test_EC Results_EC        Date Ward Name
    0  K70201  E. coli   MODERATE  2014-01-02     North
    1  K70277  E. coli   MODERATE  2014-01-02     North
    2  K70205  E. coli        FEW  2014-01-02      West
    3  K70818  E. coli   MODERATE  2014-01-03     South
    4  K70202  E. coli        FEW  2014-01-03      West
    5  K80070  E. coli       RARE  2014-01-03     North
    6  K80666  E. coli        FEW  2014-01-03      East
    
    In [155]: df2
    Out[155]: 
        Order   Test_SA Results_SA        Date Ward Name
    0  K80766  S.aureus       MANY  2014-01-01      West
    1  K70201  S.aureus       MANY  2014-01-02     North
    2  K70277  S.aureus       MANY  2014-01-02     North
    3  K70205  S.aureus        FEW  2014-01-02      West
    4  K90107  S.aureus        FEW  2014-01-06     North
    
    In [156]: df1.merge(df2, how='outer')
    Out[156]: 
        Order  Test_EC Results_EC        Date Ward Name   Test_SA Results_SA
    0  K70201  E. coli   MODERATE  2014-01-02     North  S.aureus       MANY
    1  K70277  E. coli   MODERATE  2014-01-02     North  S.aureus       MANY
    2  K70205  E. coli        FEW  2014-01-02      West  S.aureus        FEW
    3  K70818  E. coli   MODERATE  2014-01-03     South       NaN        NaN
    4  K70202  E. coli        FEW  2014-01-03      West       NaN        NaN
    5  K80070  E. coli       RARE  2014-01-03     North       NaN        NaN
    6  K80666  E. coli        FEW  2014-01-03      East       NaN        NaN
    7  K80766      NaN        NaN  2014-01-01      West  S.aureus       MANY
    8  K90107      NaN        NaN  2014-01-06     North  S.aureus        FEW
    

    【讨论】:

    • 嗯。是的! df1.merge(df2, on=['Order', 'Date', 'Ward Name'], how='outer') 似乎工作得很好。不知道为什么我遇到这么多麻烦。感谢您的帮助!
    猜你喜欢
    • 2011-10-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-08-16
    • 2021-09-06
    • 1970-01-01
    相关资源
    最近更新 更多