【发布时间】:2022-08-09 17:36:24
【问题描述】:
我有以下代码:
L = {\'L1\': [\'us\'] }
#df1 = df1.withColumnRenamed(\"name\",\"OriginalCompanyName\")
for key, vals in L.items():
# regex pattern for extracting vals
pat = r\'\\\\b(%s)\\\\b\' % \'|\'.join(vals)
# extract matching occurrences
col1 = F.expr(\"regexp_extract_all(array_join(loc, \' \'), \'%s\')\" % pat)
# Mask the rows with null when there are no matches
df1 = df1.withColumn(key, F.when((F.size(col1) == 0), None).otherwise(col1))
它从loc 和key 列中提取us,否则为us 和null。我在loc 列中还有一些空列表[]。当loc 为空时,我还想将us 放在key 列中。如果我将 L = {\'L1\': [\'us\'] } 更改为 L = {\'L1\': [\'us\',\'[]\' } 它不起作用。
由于某种原因,当loc 为空时,此代码实际上会消除行。我可以修改代码吗?
暗示:空loc可以通过以下代码找到:
df1=df1.withColumn(\'empty_country\', when(sf.size(\'loc\')==0,\'us\'))
数据样本
loc
[\"this is ,us, better life\"]
[\"no one is, in charge\"]
[\"I am, very far, from us\"]
[]
loc
[\"this is ,us, better life\"] [\"us\"]
[\"no one is, in charge\"] null
[\"I am, very far, from us\"] [\"us\"]
[] [\"us\"]
标签: pyspark