在我们知道字符串的位置始终保持不变的情况下,使用 Split 和 element_at 函数解决问题的另一种方法。
df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
df.show(truncate=False)
df = df.withColumn("splited_col", F.split("col2", "/"))
df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
df.show(truncate=False)
输入
+----+-------------------------------------------+
|col1|col2 |
+----+-------------------------------------------+
|1 |https://www.delish.com/cooking/recipes/t678|
|2 |https://www.delish.com/food/recipes/a463/ |
|3 |https://www.delish.com/cooking/recipes/g877|
+----+-------------------------------------------+
输出
+----+-------------------------------------------+--------------------------------------------------+-------+
|col1|col2 |splited_col |content|
+----+-------------------------------------------+--------------------------------------------------+-------+
|1 |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
|2 |https://www.delish.com/food/recipes/a463/ |[https:, , www.delish.com, food, recipes, a463, ] |food |
|3 |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
+----+-------------------------------------------+--------------------------------------------------+-------+