如何在pyspark中使用分隔符拆分列表答案

【问题标题】：how to split a list with delimiters in pyspark如何在pyspark中使用分隔符拆分列表
【发布时间】：2021-05-27 22:10:36
【问题描述】：

我正在尝试使用分隔符“，”拆分列表，但在列表元素中还有字符“，”，例如：

1|[this is first element, this is seconde element, this is (bad, element)]

我想在数据名中玩，但是第三个元素中的这个逗号破坏了逻辑

current output :
id |name   |val
1  |Column0|this is first element
1  |Column2|this is seconde element
1  |Column3|this is (bad
1  |Column4|element)


expected output:
id |name   |val
1  |Column0|this is first element
1  |Column1|this is seconde element
1  |Column2|this is (bad, element)

df = df.select("id",f.split("text", ",").alias("text"),f.posexplode_outer(f.split("text", ",")).alias("pos", "val")).drop("val") \    .select("id","text",f.concat(f.lit("Column"),f.col("pos").cast("string")).alias("name"),f.expr("text[pos]").alias("val"))

【问题讨论】：

标签： python apache-spark pyspark

【解决方案1】：

您需要找到split 的正确模式以忽略, 之间的()

您可以使用这个基于负前瞻的正则表达式：

,\s*(?![^()]*\))

此正则表达式正在查找带有断言的逗号，该断言确保逗号不在括号中。这是使用负前瞻来完成的，它首先消耗所有匹配的( 和)，然后是)。 这假设括号是平衡且未转义的。

# Create data frame
df = spark.createDataFrame(
[(1, "this is first element, this is seconde element, this is (bad, element)")], 
("id", "text"))

# import functions
from pyspark.sql import functions as f

# apply transformation
df1 = df.select("id",f.split("text", ",\s*(?![^()]*\))").alias("text"),f.posexplode_outer(f.split("text", ",\s*(?![^()]*\))")).alias("pos", "val")).drop("val").select("id","text",f.concat(f.lit("Column"),f.col("pos").cast("string")).alias("name"),f.expr("text[pos]").alias("val"))

在下面的链接中传递你的字符串，它会给你结果

RegEx Demo

【讨论】：

解决方案完美运行，谢谢。但是，我发现了另一种使逻辑中断的模式，没有括号的字段中有逗号，例如：1|[这是一个元素]