使用 PySpark 解析 spark df 中的 url 字符串答案

【问题标题】：parse url string in spark df with PySpark使用 PySpark 解析 spark df 中的 url 字符串
【发布时间】：2020-10-19 17:58:41
【问题描述】：

我需要解析 spark df 中 refererurl 列中的 url 字符串。数据如下所示：

refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877

我只对delish.com 之后的内容感兴趣。期望的输出是：

content
cooking
food
cooking

我试过了：

data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))

返回所有空值

【问题讨论】：

标签： python string apache-spark parsing pyspark

【解决方案1】：

在我们知道字符串的位置始终保持不变的情况下，使用 Split 和 element_at 函数解决问题的另一种方法。

df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
df.show(truncate=False)
df = df.withColumn("splited_col", F.split("col2", "/"))
df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
df.show(truncate=False)

输入

+----+-------------------------------------------+
|col1|col2                                       |
+----+-------------------------------------------+
|1   |https://www.delish.com/cooking/recipes/t678|
|2   |https://www.delish.com/food/recipes/a463/  |
|3   |https://www.delish.com/cooking/recipes/g877|
+----+-------------------------------------------+

输出

 +----+-------------------------------------------+--------------------------------------------------+-------+


|col1|col2                                       |splited_col                                       |content|
+----+-------------------------------------------+--------------------------------------------------+-------+
|1   |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
|2   |https://www.delish.com/food/recipes/a463/  |[https:, , www.delish.com, food, recipes, a463, ] |food   |
|3   |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
+----+-------------------------------------------+--------------------------------------------------+-------+

【讨论】：

【解决方案2】：

您可以使用parse_url获取url的路径，然后使用regexp_extract获取路径的第一级：

df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
    .show(truncate=False)

输出：

+-------------------------------------------+-------+
|refererurl                                 |content|
+-------------------------------------------+-------+
|https://www.delish.com/cooking/recipes/t678|cooking|
|https://www.delish.com/food/recipes/a463/  |food   |
|https://www.delish.com/cooking/recipes/g877|cooking|
+-------------------------------------------+-------+

【讨论】：