【发布时间】:2020-12-26 12:30:38
【问题描述】:
我有两张桌子
表 1:(comment_df)
| Date | Comment |
|:---- |:------:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |
表2:(code_df)
| Tag | Comment |
|:---- |:------:|
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |
想要的表是
| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday. | EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX |
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |
我可能可以使用 .contains 或 matcher(nlp.vocab?) 来处理几个类别。但我有 30 多个类别,并且列表会随着时间的推移而增长。所以我希望使用 pyspark 的函数可以优雅地做到这一点。
干杯!
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql string-matching