【问题标题】:parse url string in spark df with PySpark使用 PySpark 解析 spark df 中的 url 字符串
【发布时间】:2020-10-19 17:58:41
【问题描述】:

我需要解析 spark df 中 refererurl 列中的 url 字符串。数据如下所示:

refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877

我只对delish.com 之后的内容感兴趣。期望的输出是:

content
cooking
food
cooking

我试过了:

data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))

返回所有空值

【问题讨论】:

    标签: python string apache-spark parsing pyspark


    【解决方案1】:

    在我们知道字符串的位置始终保持不变的情况下,使用 Split 和 element_at 函数解决问题的另一种方法。

    df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
    df.show(truncate=False)
    df = df.withColumn("splited_col", F.split("col2", "/"))
    df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
    df.show(truncate=False)
    

    输入

    +----+-------------------------------------------+
    |col1|col2                                       |
    +----+-------------------------------------------+
    |1   |https://www.delish.com/cooking/recipes/t678|
    |2   |https://www.delish.com/food/recipes/a463/  |
    |3   |https://www.delish.com/cooking/recipes/g877|
    +----+-------------------------------------------+
      
    

    输出

     +----+-------------------------------------------+--------------------------------------------------+-------+
    
    
    |col1|col2                                       |splited_col                                       |content|
    +----+-------------------------------------------+--------------------------------------------------+-------+
    |1   |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
    |2   |https://www.delish.com/food/recipes/a463/  |[https:, , www.delish.com, food, recipes, a463, ] |food   |
    |3   |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
    +----+-------------------------------------------+--------------------------------------------------+-------+
    

    【讨论】:

      【解决方案2】:

      您可以使用parse_url获取url的路径,然后使用regexp_extract获取路径的第一级:

      df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
          .show(truncate=False)
      

      输出:

      +-------------------------------------------+-------+
      |refererurl                                 |content|
      +-------------------------------------------+-------+
      |https://www.delish.com/cooking/recipes/t678|cooking|
      |https://www.delish.com/food/recipes/a463/  |food   |
      |https://www.delish.com/cooking/recipes/g877|cooking|
      +-------------------------------------------+-------+
      

      【讨论】:

        猜你喜欢
        • 2019-05-08
        • 1970-01-01
        • 2018-08-20
        • 2017-04-27
        • 2021-06-28
        • 1970-01-01
        • 1970-01-01
        • 2022-01-23
        • 2015-09-17
        相关资源
        最近更新 更多