【问题标题】:Merge if two string columns are substring of one column from another dataframe in Python如果两个字符串列是 Python 中另一个数据框的一列的子字符串,则合并
【发布时间】:2021-07-28 14:53:13
【问题描述】:

给定两个数据框如下:

df1:

   id                                      address  price
0   1         8563 Parker Ave. Lexington, NC 27292      3
1   2         242 Bellevue Lane Appleton, WI 54911      3
2   3       771 Greenview Rd. Greenfield, IN 46140      5
3   4       93 Hawthorne Street Lakeland, FL 33801      6
4   5  8952 Green Hill Street Gettysburg, PA 17325      3
5   6    7331 S. Sherwood Dr. New Castle, PA 16101      4

df2:

  state            street  quantity
0    PA       S. Sherwood        12
1    IN  Hawthorne Street         3
2    NC       Parker Ave.         7

假设df2 中的statestreet 都包含在df2address 中,然后将df2 合并到df1

我怎么能在 Pandas 中做到这一点?谢谢。

预期结果df

   id                                      address  ...       street quantity
0   1         8563 Parker Ave. Lexington, NC 27292  ...  Parker Ave.     7.00
1   2         242 Bellevue Lane Appleton, WI 54911  ...          NaN      NaN
2   3       771 Greenview Rd. Greenfield, IN 46140  ...          NaN      NaN
3   4       93 Hawthorne Street Lakeland, FL 33801  ...          NaN      NaN
4   5  8952 Green Hill Street Gettysburg, PA 17325  ...          NaN      NaN
5   6    7331 S. Sherwood Dr. New Castle, PA 16101  ...  S. Sherwood    12.00

[6 rows x 6 columns]

我的测试代码:

df2['addr'] = df2['state'].astype(str) + df2['street'].astype(str)

pat = '|'.join(r'\b{}\b'.format(x) for x in df2['addr'])
df1['addr']= df1['address'].str.extract('\('+ pat + ')', expand=False)

df = df1.merge(df2, on='addr', how='left')

输出:

   id                                      address  ...  street_y quantity_y
0   1         8563 Parker Ave. Lexington, NC 27292  ...       NaN        nan
1   2         242 Bellevue Lane Appleton, WI 54911  ...       NaN        nan
2   3       771 Greenview Rd. Greenfield, IN 46140  ...       NaN        nan
3   4       93 Hawthorne Street Lakeland, FL 33801  ...       NaN        nan
4   5  8952 Green Hill Street Gettysburg, PA 17325  ...       NaN        nan
5   6    7331 S. Sherwood Dr. New Castle, PA 16101  ...       NaN        nan

[6 rows x 10 columns]

【问题讨论】:

    标签: python-3.x pandas dataframe


    【解决方案1】:

    试一试:

    pat_state = f"({'|'.join(df2['state'])})"
    pat_street = f"({'|'.join(df2['street'])})"
    df1['street'] = df1['address'].str.extract(pat=pat_street) 
    df1['state'] = df1['address'].str.extract(pat=pat_state) 
    df1.loc[df1['state'].isna(),'street'] = np.NAN
    df1.loc[df1['street'].isna(),'state'] = np.NAN
    df1 = df1.merge(df2, left_on=['state','street'], right_on=['state','street'], how ='left')
    

    【讨论】:

    • 谢谢,我会用我的真实数据进行测试,然后告诉你。
    • 抱歉,报错:error: missing ), unterminated subpattern
    • df2["street"] = df2['street'].str.replace('[^\w\s]','')删除标点符号后可以使用
    • 如果我需要基于3列合并?
    【解决方案2】:
    k="|".join(df2['street'].to_list())
    df1=df1.assign(temp=df1['address'].str.findall(k).str.join(', '), temp1=df1['address'].str.split(",").str[-1])
    dfnew=pd.merge(df1,df2, how='left', left_on=['temp','temp1'], right_on=['street',"state"])
    

    【讨论】:

    • 谢谢,但您没有使用df2['state']
    • 谢谢,如果address 没有, 可拆分,我们如何修改您的代码?
    猜你喜欢
    • 2020-05-22
    • 2019-09-20
    • 1970-01-01
    • 2019-07-05
    • 2023-02-17
    • 2023-02-07
    • 2021-05-01
    • 2021-11-15
    • 1970-01-01
    相关资源
    最近更新 更多