【问题标题】:Merge Dataframes Based on Partial Substrings Match基于部分子串匹配合并数据帧
【发布时间】:2021-01-29 23:38:00
【问题描述】:

请帮助我如何进行以下操作。 我认为这是我目前需要在 pandas 中做的最特别的事情。

基本上我需要合并 2 个数据帧,其中在 df1 我有一个部分字符串(address_id),在数据帧 2 中我有相同的信息,但与另一个(concat_address_id)连接。

我尝试了几种方法来合并、提取字符串、预处理字符串、检查包含部分字符串匹配的列表。但是,没有找到一种聪明的方法来做我需要的事情,如下面的示例所示,即基于子字符串匹配合并数据帧。

这是 df1:

process     sku    qty  address_id  customer    country
process1    sku1    1   address1    customer5   BR
process1    sku2    1   address2    customer5   BR
process1    sku3    1   address3    customer5   BR
process1    sku4    1   address4    customer5   BR
process1    sku5    1   address5    customer5   BR

这是 df2。

concat_address_id   last_login  country_of_login
address1address5    15/10/2020  CN
address6address2    18/02/2020  NL
address3address5    13/05/2019  BR
address6address4    18/06/2020  NL
address5address8    13/05/2019  RU

这是预期的结果。

预期结果:

process        sku  qty address_id  customer     country    last_login  country_of_login
process1    sku1    1   address1    customer5   BR  15/10/2020  CN
process1    sku2    1   address2    customer5   BR  18/02/2020  NL
process1    sku3    1   address3    customer5   BR  13/05/2019  BR
process1    sku4    1   address4    customer5   BR  18/06/2020  NL
process1    sku5    1   address5    customer5   BR  13/05/2019  RU

【问题讨论】:

标签: python pandas


【解决方案1】:

基于此: How to merge pandas on string contains?

>>> df1
    process   sku address_id   customer country
0  process1  sku1   address1  customer5      BR
1  process1  sku2   address2  customer5      BR
2  process1  sku3   address3  customer5      BR
3  process1  sku4   address4  customer5      BR
4  process1  sku5   address5  customer5      BR
>>> df2
  concat_address_id  last_login   customer country_of_login
0  address1address5  15/10/2020  customer5               CN
1  address6address2   18/2/2020  customer5               NL
2  address3address5  13/05/2019  customer5               BR
3  address6address4  18/06/2020  customer5               NL
4  address5address8  13/05/2019  customer5               RU

>>> check = [(process, sku, address_id, customer, country, cust, last_login, country_li) for i, (process, sku, address_id, customer, country) in df1.iterrows() for j, (concat_addr, last_login, cust, country_li) in df2.iterrows() if address_id in concat_addr]

>>> check
[('process1', 'sku1', 'address1', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku2', 'address2', 'customer5', 'BR', 'customer5', '18/2/2020', 'NL'), ('process1', 'sku3', 'address3', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku4', 'address4', 'customer5', 'BR', 'customer5', '18/06/2020', 'NL'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'RU')]


>>> (pd.DataFrame(check, columns=["process", "sku", "address_id", "customer", "country", "customer", "last_login", "country_login"]))
    process   sku address_id   customer country   customer  last_login country_login
0  process1  sku1   address1  customer5      BR  customer5  15/10/2020            CN
1  process1  sku2   address2  customer5      BR  customer5   18/2/2020            NL
2  process1  sku3   address3  customer5      BR  customer5  13/05/2019            BR
3  process1  sku4   address4  customer5      BR  customer5  18/06/2020            NL
4  process1  sku5   address5  customer5      BR  customer5  15/10/2020            CN
5  process1  sku5   address5  customer5      BR  customer5  13/05/2019            BR
6  process1  sku5   address5  customer5      BR  customer5  13/05/2019            RU


I have redundant customer so it can be removed!

如果有帮助,请告诉我。

【讨论】:

  • 我尝试了这种方法,但它在我的数据框中没有执行 2300 万行。 def strmerge(strcolumn): for i in df2['column_common']: if strcolumn in i: return df2[df2['column_common'] == i]['column_b'].values[0] break else: pass df1['column_b'] = df1.apply(lambda x: strmerge(x['column_common']),axis=1),我也不能按数据输入数据,因为我使用的是最大的数据集。
  • 2300 万行是重要信息。 address_idconcat_address_id 是否与数据框中的索引匹配? address1 info 可以在第 10 行的 df1 上,address1address5 在第 7 行的 df2 上吗?
  • 是的,在 concat 中会有相同的子字符串,但是在某些情况下它可以是不同的顺序,因为这是客户的选择。
  • 如果来自 df1 的address 在 df2 中出现多次怎么办?
  • 是的,它发生了。
【解决方案2】:

这应该也可以

# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)

# Create empty address_id column to merge with df1
df2['address_id'] = ''

# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']

# Set value in address_id column 
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']

concat_address_id   last_login  country_of_login    address_id_1    address_id_2    address_id
0   address1address5    15/10/2020  CN                  address1    address5    address1
1   address6address2    18/02/2020  NL                  address6    address2    address2
2   address3address5    13/05/2019  BR                  address3    address5    address3
3   address6address4    18/06/2020  NL                  address6    address4    address4
4   address5address8    13/05/2019  RU                  address5    address8    address5

# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
                    on='address_id',how='left')

    process     sku     address_id  customer    country last_login  country_of_login
0   process1    sku1    address1    customer5   BR      15/10/2020  CN
1   process1    sku2    address2    customer5   BR      18/02/2020  NL
2   process1    sku3    address3    customer5   BR      13/05/2019  BR
3   process1    sku4    address4    customer5   BR      18/06/2020  NL
4   process1    sku5    address5    customer5   BR      13/05/2019  RU

【讨论】:

    猜你喜欢
    • 2018-03-18
    • 1970-01-01
    • 2020-10-15
    • 2019-09-02
    • 2021-08-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-08-10
    相关资源
    最近更新 更多