Pyspark 的 JSON 文件中缺失值的零替换答案

【问题标题】：Zero Replacement for missing values in JSON file for PysparkPyspark 的 JSON 文件中缺失值的零替换
【发布时间】：2020-01-16 08:08:38
【问题描述】：

JSON 将如下所示。

{
"ThresholdTime": "48min", 
"FallTime": "Min", 
"description": "PowerAmplifier"
}
{
"ThresholdTime": "min", 
"FallTime": "200min", 
"description": "DolbyDigitall"
}

我正在使用regexp_extract 从字母数字字符串中删除字母字符。

df.withColumn("NewThresholdTime",regexp_extract("ThresholdTime","(\\d+)",1))

如何在ThresholdTime 或FallTime 没有时间的情况下添加0？

输出应该是：

+--------+-------------+--------------+----------------+    
|FallTime|ThresholdTime|   NewFallTime|NewThresholdTime|    
+--------+-------------+--------------+----------------+    
|   Min  |        48min|0             |          48    |
|  200min|          min|200           |          0     |    
+--------+-------------+--------------+----------------+

【问题讨论】：

在 scala 中我们使用 when 子句.. 类似将在 pyspark 中可用

标签： python apache-spark pyspark pyspark-sql

【解决方案1】：

假设我们有一个包含 JSON 中提供的值的数据框，那么您可以检查没有数字的列是否保持不变，然后保持原样，否则删除字母。

df = sqlContext.createDataFrame(
    [{"ThresholdTime": "48min", 
      "FallTime": "15Min", 
      "description": "PowerAmplifier"
    },
    {"ThresholdTime": "min", 
     "FallTime": "200min", 
     "description": "DolbyDigitall"}])

# What would column look like without alhpabets
col_without_alphabets = F.regexp_replace(df["ThresholdTime"], "[a-zA-Z]", "")

# What would column look like without numerals
col_without_numerals = F.regexp_replace(df["ThresholdTime"], "[0-9]", "")

# If without numerals the column remains the same then keep as-is, else remove alphabets
df.withColumn("NewThresholdTime",
              F.when(col_without_numerals == df["ThresholdTime"], 
                     F.lit(0))
              .otherwise(col_without_alphabets)).show()

输出：

+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime|   description|NewThresholdTime|
+--------+-------------+--------------+----------------+
|   15Min|        48min|PowerAmplifier|              48|
|  200min|          min| DolbyDigitall|               0|
+--------+-------------+--------------+----------------+

添加答案以扩展任意数量的变量。

循环遍历您希望对其执行相同操作的任何变量。

new_columns = list()
for column in ["ThresholdTime", "FallTime"]:

    # What would column look like without alphabets
    col_without_alphabets = F.regexp_replace(df[column], "[a-zA-Z]", "")

    # What would column look like without numerals
    col_without_numerals = F.regexp_replace(df[column], "[0-9]", "")

    # If without numerals the column remains the same then keep as-is, else remove alphabets
    new_columns.append(F.when(col_without_numerals == df[column], 
                        F.lit(0)).otherwise(col_without_alphabets).alias("New{}".format(column)))

df.select(["*"] + new_columns).show()

输出：

+--------+-------------+--------------+----------------+-----------+
|FallTime|ThresholdTime|   description|NewThresholdTime|NewFallTime|
+--------+-------------+--------------+----------------+-----------+
|   15Min|        48min|PowerAmplifier|              48|         15|
|  200min|          min| DolbyDigitall|               0|        200|
+--------+-------------+--------------+----------------+-----------+

【讨论】：

非常感谢 Sunny......但我正在寻找的是删除字符并在 newFallTime 和 NewThresholdTime 中只保留整数
啊，抱歉已经做了一些小改动，现在应该可以使用了。如果您需要，请接受答案。
获取错误为“NameError：名称'F'未定义。请帮助
import pyspark.sql.functions as F
from pyspark.sql import functions as F