【问题标题】:merging dataframes with multicolumn conditions and comparison合并具有多列条件和比较的数据框
【发布时间】:2021-08-15 07:12:00
【问题描述】:

不确定这是做我想做的最好或正确的方法。

我有以下df:

df = pd.DataFrame(np.array([['1-1-2020', '123','How can I Help?', 'Delivered'], ['1-1-2020', '123','How can I Help?', 'Opened'], ['1-2-2021', '100','New Offer', 'Delivered'],['1-2-2021', '100','New Offer', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['2-4-2021', '144','Last chance, buy now!', 'Opened']]),

                   columns=['Date', 'Customer_ID','Subject', 'Status'])


    Date    Customer_ID     Subject              Status
0   1-1-2020    123     How can I Help?         Delivered
1   1-1-2020    123     How can I Help?         Opened
2   1-2-2021    100     New Offer               Delivered
3   1-2-2021    100     New Offer               Delivered
4   1-4-2021    144     Last chance, buy now!   Delivered
5   1-4-2021    144     Last chance, buy now!   Delivered
6   2-4-2021    144     Last chance, buy now!   Opened

在这个df中,客户123收到了一封电子邮件,然后在第二行打开了它。 客户 100 发送了两次电子邮件 并且客户 144 的电子邮件已发送两次,其中一个已打开。

我正在尝试跟踪每个客户的每封电子邮件的发送和打开状态以及最后操作日期。

因此,我创建了两个数据框:一个用于交付,一个用于打开,并将它们合并到交付的一个上以跟踪打开的内容。

df_del = df.loc[(df['Status'] == 'Delivered')]
df_open = df.loc[(df['Status'] == 'Opened')]

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject'], how='left')

w

这表明:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered         NaN        NaN
2   1-2-2021    100         New Offer                 Delivered         NaN        NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened

我的期望:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
2   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     1-4-2021       NaN

【问题讨论】:

  • Subject/Customer_ID 组合不是唯一的。您没有唯一的消息标识符?
  • @JanWilamowski 不幸的是,电子邮件没有唯一标识符,这就是为什么我使用客户 ID 和电子邮件主题的组合来在某种程度上跟踪打开率。
  • 也许这个答案有帮助:stackoverflow.com/questions/40575486/…

标签: python python-3.x pandas dataframe


【解决方案1】:

让我们使用一个伪“订单”列:

df_del = df.loc[(df['Status'] == 'Delivered')].copy()
df_open = df.loc[(df['Status'] == 'Opened')].copy()

df_del['order'] = df_del.groupby(['Customer_ID']).cumcount()
df_open['order'] = df_open.groupby(['Customer_ID']).cumcount()

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject','order'], how='left')

w['last action date'] = w['last action date'].fillna(w['Date Delivered'])

W

输出:

  Date Delivered Customer_ID                Subject     Status  order last action date Open Status
0       1-1-2020         123        How can I Help?  Delivered      0         1-1-2020      Opened
1       1-2-2021         100              New Offer  Delivered      0         1-2-2021         NaN
2       1-2-2021         100              New Offer  Delivered      1         1-2-2021         NaN
3       1-4-2021         144  Last chance, buy now!  Delivered      0         2-4-2021      Opened
4       1-4-2021         144  Last chance, buy now!  Delivered      1         1-4-2021         NaN

【讨论】:

    【解决方案2】:

    另一个选项,通过groupby cumcount 生成稍有不同的伪消息ID,通过combine_first 填充NaN:

    # Create a "message_id"
    df['m_id'] = (
        df.groupby(['Customer_ID', 'Subject', 'Status']).cumcount()
    )
    
    # Create Mask For Delivered Status
    m = df.Status.eq('Delivered')
    
    # Merge Delivered and ~Delivered
    df = (
        df[m].rename(columns={'Date': 'Date Delivered'})
            .merge(df[~m].rename(columns={'Date': 'last action date',
                                          'Status': 'Open Status'}),
                   on=['Customer_ID', 'Subject', 'm_id'],
                   how='left')
    )
    
    # Fill NaN in last action date column
    df['last action date'] = (
        df['last action date'].combine_first(df['Date Delivered'])
    )
    

    df:

      Date Delivered Customer_ID                Subject     Status  m_id last action date Open Status
    0       1-1-2020         123        How can I Help?  Delivered     0         1-1-2020      Opened
    1       1-2-2021         100              New Offer  Delivered     0         1-2-2021         NaN
    2       1-2-2021         100              New Offer  Delivered     1         1-2-2021         NaN
    3       1-4-2021         144  Last chance, buy now!  Delivered     0         2-4-2021      Opened
    4       1-4-2021         144  Last chance, buy now!  Delivered     1         1-4-2021         NaN
    

    【讨论】:

      【解决方案3】:

      只是使用np.wheregroupby 添加另一种方式

      df['last action date'] = df.groupby('Customer_ID').Date.transform('last')
      df['op'] = (
          df.groupby(['Customer_ID', 'Subject'])['Status'].cumcount()
      )
      df['Open Status'] = np.where((df.groupby(['Customer_ID'])\
          .Status.transform('last') == 'Opened') & (df.op==0), 'Opened',np.nan)
      df[df.Status=='Delivered'].drop(columns=['op'])
      

      输出

          Date    Customer_ID Subject             Status  last action date    Open Status
      0   1-1-2020    123 How can I Help?         Delivered   1-1-2020    Opened
      2   1-2-2021    100 New Offer               Delivered   1-2-2021    nan
      3   1-2-2021    100 New Offer               Delivered   1-2-2021    nan
      4   1-4-2021    144 Last chance, buy now!   Delivered   2-4-2021    Opened
      5   1-4-2021    144 Last chance, buy now!   Delivered   2-4-2021    nan
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-09-25
        • 2019-03-29
        • 2018-07-25
        • 1970-01-01
        • 1970-01-01
        • 2017-10-05
        • 1970-01-01
        • 2020-01-19
        相关资源
        最近更新 更多