【问题标题】:How to transform DF to add column with list of string contained within another column如何转换 DF 以添加包含在另一列中的字符串列表的列
【发布时间】:2021-03-24 14:54:29
【问题描述】:

假设我在 scala 中有一个关键字列表

val keywords = List("pineapple", "lemon")

还有这样的数据框

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+

如何将此数据框转换为包含Body 包含的关键字的新列?我正在寻找的结果类似于

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    检查下面的代码。

    使用所需的示例数据创建数据框。

    scala> val df = Seq(
          (123,"I contain both keywords pineapple and lemon"),
          (456,"I sadly don't contain anything"),
          (789,"Pineapple's are delicious")).toDF("id","body")
    
    df: org.apache.spark.sql.DataFrame = [id: int, body: string]
    
    scala> val keywords = List("pineapple", "lemon")
    keywords: List[String] = List(pineapple, lemon)
    

    typedLitkeywords 添加到数据帧并使用filter 高阶函数检查keyword 是否包含body 列。

    scala> df
    .withColumn("keywords",typedLit(keywords))
    .withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
    .show(false)
    

    最终输出

    +---+-------------------------------------------+------------------+------------------+
    |id |body                                       |keywords          |Contains_Keywords |
    +---+-------------------------------------------+------------------+------------------+
    |123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
    |456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
    |789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
    +---+-------------------------------------------+------------------+------------------+
    

    【讨论】:

      【解决方案2】:

      您可以将关键字列表转换为数据框,然后根据rlike 条件加入。最好在关键字前后添加\\\\b 来指定单词边界,这样可以防止部分匹配,例如apple 匹配 pineapple

      val result = df.as("df")
          .join(keywords.toDF("keywords").as("keywords"), 
                expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
                "left"
               )
          .groupBy("id", "body")
          .agg(collect_list("keywords").as("Contains_keywords"))
      
      result.show(false)
      +---+-------------------------------------------+------------------+
      |id |body                                       |Contains_keywords |
      +---+-------------------------------------------+------------------+
      |123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
      |789|Pineapple's are delicious                  |[pineapple]       |
      |456|I sadly don't contain anything             |[]                |
      +---+-------------------------------------------+------------------+
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-12-06
        • 1970-01-01
        • 2021-07-25
        • 2022-10-23
        • 2020-12-13
        • 2020-08-05
        • 2021-04-27
        • 2022-09-23
        相关资源
        最近更新 更多