【问题标题】:How to find values based on the ID information in two dataframes?如何根据两个数据框中的 ID 信息查找值?
【发布时间】:2019-02-11 03:09:07
【问题描述】:

包含订单信息的第一个数据帧。一个 Lead Order 可能有多个 orderid。 另一个dataframe有一个OrderID列表,想用dataframe1作为参考来查找LeadOrderID,请问如何使用python(Pandas)查找LeadOrderID? 谢谢你的帮助。真的很感激。

【问题讨论】:

  • @jpp - 这个问题不仅仅是关于合并,它还解决了每行有多个值必须以某种方式处理的问题。

标签: python pandas list dataframe boolean


【解决方案1】:

您应该将panda.merge()on=['OrderID']how='inner' 一起使用。

In [207]: df1 = pd.DataFrame({'OrderID':[i for i in range(10)], 'Lead Order':[1,3,5,8,6,7,7,5,2,1]}, index=[0,1,2,3,4,5,6,7,8,9])

In [208]: df1
Out[208]: 
   OrderID  Lead Order
0        0           1
1        1           3
2        2           5
3        3           8
4        4           6
5        5           7
6        6           7
7        7           5
8        8           2
9        9           1

In [209]: df2 = pd.DataFrame({'OrderID':[3,8,6,2]}, index=[0,1,2,3])

In [210]: df2
Out[210]: 
   OrderID
0        3
1        8
2        6
3        2

In [211]: df3 = pd.merge(df1, df2, on=['OrderID'], how='inner')

In [212]: df3
Out[212]: 
   OrderID  Lead Order
0        2           5
1        3           8
2        6           7
3        8           2

【讨论】:

    【解决方案2】:

    这个答案包括处理多个问题 OrderID(s) 列的行内的值。

    没有注释的完整代码在下面

    # imports
    import pandas as pd
    import numpy as np
    
    # create sample dataframe
    df_orig = \
        pd.DataFrame({'OrderID(s)':['0001, 0007, 0002', '0008', '0009, 0005, 0003',],
                      'Lead Order': ['00011', '00022', '00033']})
    

    df_orig

              OrderID(s)    Lead Order
    0   0001, 0007, 0002    00011
    1               0008    00022
    2   0009, 0005, 0003    00033
    

    -

    # force df values to strings
    # this makes splitting of multiple
    # values in OrderID(s) easier
    df_orig = df_orig.astype(str)
    
    # series created from data within df_orig['OrderID(s)'] column
    # remove spaces and split by commas
    split_col = df_orig['OrderID(s)'].str.replace(' ', '').str.split(",")
    print(split_col)
    
    0    [0001, 0007, 0002]
    1                [0008]
    2    [0009, 0005, 0003]
    Name: OrderID(s), dtype: object
    

    -

    # find length of each split_col row (how many OrderIDs in each row).
    # these values will be used to duplicate rows in the
    # df_orig dataframe with the numpy repeat function
    repeats = split_col.str.len().values
    print(repeats)
    
    [3 1 3]
    

    -

    # concatenate all values in orderid_column.
    # the length of this array will be the same as the length
    # of the df_stack_ids dataframe
    orderid_col = np.concatenate(split_col.values)
    print(orderid_col)
    
    ['0001' '0007' '0002' '0008' '0009' '0005' '0003']
    

    -

    # use pandas iloc and numpy repeat function to make a dataframe with
    # rows from df_orig duplicated according to the number of
    # df_orig['OrderID(s)'] values in each row relating to a common
    #Lead Order value (using repeats input from above)
    df_stack_ids = df_orig.iloc[np.repeat(df_orig.index.values, repeats)]. \
        reset_index(drop=True)
    

    df_stack_ids

              OrderID(s)    Lead Order
    0   0001, 0007, 0002    00011
    1   0001, 0007, 0002    00011
    2   0001, 0007, 0002    00011
    3               0008    00022
    4   0009, 0005, 0003    00033
    5   0009, 0005, 0003    00033
    6   0009, 0005, 0003    00033
    

    -

    # add the orderid_col to dataframe
    df_stack_ids['OrderID'] = orderid_col
    

    df_stack_ids

              OrderID(s)    Lead Order  OrderID
    0   0001, 0007, 0002         00011     0001
    1   0001, 0007, 0002         00011     0007
    2   0001, 0007, 0002         00011     0002
    3               0008         00022     0008
    4   0009, 0005, 0003         00033     0009
    5   0009, 0005, 0003         00033     0005
    6   0009, 0005, 0003         00033     0003
    

    -

    # get rid of the original OrderID(s) column
    df_stack_ids = df_stack_ids[['OrderID', 'Lead Order']]
    
    
    # this may be enough to answer the question
    # because each order id has a corresponding
    # lead order
    

    df_stack_ids

        OrderID Lead Order
    0      0001      00011
    1      0007      00011
    2      0002      00011
    3      0008      00022
    4      0009      00033
    5      0005      00033
    6      0003      00033
    

    -

    # to find matches for a specific list of order ids,
    # continue...
    # sort the OrderID column for easy reference and
    # reset index
    df_stack_ids = df_stack_ids.sort_values(by=['OrderID'])
    df_stack_ids.index = range(len(df_stack_ids))
    
    
    # create sample dataframe with a few order ids for lookup
    df_find_lead = pd.DataFrame({'OrderID': ['0001', '0002', '0005']})
    # force to string type for matching with df_stack_ids values
    # when merging
    df_find_lead = df_find_lead.astype(str)
    

    df_find_lead

        OrderID
    0      0001
    1      0002
    2      0005
    

    -

    # merge values from df_stack_ids['Lead Order'] column
    df_found_lead = pd.merge(df_find_lead, df_stack_ids,
                             on=['OrderID'], how='inner')
    

    df_found_lead

        OrderID Lead Order
    0      0001      00011
    1      0002      00011
    2      0005      00033
    

    -

    # if all original order data is formatted as numbers,
    # convert result dataframe back to integers
    df_found_lead.astype(int)
    
        OrderID Lead Order
    0         1         11
    1         2         11
    2         5         33
    

    完整代码:

    import pandas as pd
    import numpy as np
    
    df_orig = \
        pd.DataFrame({'OrderID(s)':['0001, 0007, 0002', '0008', '0009, 0005, 0003',],
                      'Lead Order': ['00011', '00022', '00033']})
    
    df_orig = df_orig.astype(str)
    split_col = df_orig['OrderID(s)'].str.replace(' ', '').str.split(",")
    
    repeats = split_col.str.len().values
    orderid_col = np.concatenate(split_col.values)
    
    df_stack_ids = df_orig.iloc[np.repeat(df_orig.index.values, repeats)]. \
        reset_index(drop=True)
    
    df_stack_ids['OrderID'] = orderid_col
    df_stack_ids = df_stack_ids[['OrderID', 'Lead Order']]
    df_stack_ids = df_stack_ids.sort_values(by=['OrderID'])
    df_stack_ids.index = range(len(df_stack_ids))
    
    df_find_lead = pd.DataFrame({'OrderID': ['0001', '0002', '0005']})
    df_find_lead = df_find_lead.astype(str)
    
    df_found_lead = pd.merge(df_find_lead, df_stack_ids, on=['OrderID'], how='inner')
    df_found_lead.astype(int)
    

    【讨论】:

      猜你喜欢
      • 2020-07-18
      • 2023-03-29
      • 1970-01-01
      • 2019-09-19
      • 2014-04-27
      • 1970-01-01
      • 1970-01-01
      • 2020-04-19
      • 1970-01-01
      相关资源
      最近更新 更多