【问题标题】:Pyspark function for matching strings用于匹配字符串的 Pyspark 函数
【发布时间】:2020-12-26 12:30:38
【问题描述】:

我有两张桌子

表 1:(comment_df)

| Date | Comment | 
|:---- |:------:| 
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |

表2:(code_df)

| Tag | Comment | 
|:---- |:------:| 
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |

想要的表是

| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. | EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX | 
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |

我可能可以使用 .contains 或 matcher(nlp.vocab?) 来处理几个类别。但我有 30 多个类别,并且列表会随着时间的推移而增长。所以我希望使用 pyspark 的函数可以优雅地做到这一点。

干杯!

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql string-matching


    【解决方案1】:

    left join 可能是合适的:

    code_df = code_df.withColumnRenamed('Comment', 'Commentcode')
    
    result = comment_df.join(code_df, comment_df.Comment.contains(code_df.Commentcode), 'left').drop('Commentcode')
    
    result.show(truncate=False)
    +----------+---------------------------------------------------------------------+-------+
    |Date      |Comment                                                              |Tag    |
    +----------+---------------------------------------------------------------------+-------+
    |20/01/2020|Transfer from Euro Account to HSBC account done on Monday.           |EURO   |
    |20/01/2020|Brian initiated a Transfer from Euro Account to Natwest last Tuesday.|Natwest|
    |21/01/2020|AMEX payment to Natwest was delayed for second time in a row.        |AMEX   |
    |21/01/2020|AMEX receipts from Euro Account delayed.                             |null   |
    +----------+---------------------------------------------------------------------+-------+
    

    【讨论】:

    • 我遇到的挑战是,如果匹配多个关键字,则会出现多个标签,我希望它在第一个关键字本身停止。有解决此问题的建议吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-04-11
    • 2021-03-07
    • 2018-03-06
    • 2018-06-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多