【问题标题】:Zero Replacement for missing values in JSON file for PysparkPyspark 的 JSON 文件中缺失值的零替换
【发布时间】:2020-01-16 08:08:38
【问题描述】:

JSON 将如下所示。

{
"ThresholdTime": "48min", 
"FallTime": "Min", 
"description": "PowerAmplifier"
}
{
"ThresholdTime": "min", 
"FallTime": "200min", 
"description": "DolbyDigitall"
}

我正在使用regexp_extract 从字母数字字符串中删除字母字符。

df.withColumn("NewThresholdTime",regexp_extract("ThresholdTime","(\\d+)",1))

如何在ThresholdTimeFallTime 没有时间的情况下添加0?

输出应该是:

+--------+-------------+--------------+----------------+    
|FallTime|ThresholdTime|   NewFallTime|NewThresholdTime|    
+--------+-------------+--------------+----------------+    
|   Min  |        48min|0             |          48    |
|  200min|          min|200           |          0     |    
+--------+-------------+--------------+----------------+

【问题讨论】:

  • 在 scala 中我们使用 when 子句.. 类似将在 pyspark 中可用

标签: python apache-spark pyspark pyspark-sql


【解决方案1】:

假设我们有一个包含 JSON 中提供的值的数据框,那么您可以检查没有数字的列是否保持不变,然后保持原样,否则删除字母。

df = sqlContext.createDataFrame(
    [{"ThresholdTime": "48min", 
      "FallTime": "15Min", 
      "description": "PowerAmplifier"
    },
    {"ThresholdTime": "min", 
     "FallTime": "200min", 
     "description": "DolbyDigitall"}])

# What would column look like without alhpabets
col_without_alphabets = F.regexp_replace(df["ThresholdTime"], "[a-zA-Z]", "")

# What would column look like without numerals
col_without_numerals = F.regexp_replace(df["ThresholdTime"], "[0-9]", "")

# If without numerals the column remains the same then keep as-is, else remove alphabets
df.withColumn("NewThresholdTime",
              F.when(col_without_numerals == df["ThresholdTime"], 
                     F.lit(0))
              .otherwise(col_without_alphabets)).show()

输出:

+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime|   description|NewThresholdTime|
+--------+-------------+--------------+----------------+
|   15Min|        48min|PowerAmplifier|              48|
|  200min|          min| DolbyDigitall|               0|
+--------+-------------+--------------+----------------+

添加答案以扩展任意数量的变量。

循环遍历您希望对其执行相同操作的任何变量。

new_columns = list()
for column in ["ThresholdTime", "FallTime"]:

    # What would column look like without alphabets
    col_without_alphabets = F.regexp_replace(df[column], "[a-zA-Z]", "")

    # What would column look like without numerals
    col_without_numerals = F.regexp_replace(df[column], "[0-9]", "")

    # If without numerals the column remains the same then keep as-is, else remove alphabets
    new_columns.append(F.when(col_without_numerals == df[column], 
                        F.lit(0)).otherwise(col_without_alphabets).alias("New{}".format(column)))

df.select(["*"] + new_columns).show()

输出:

+--------+-------------+--------------+----------------+-----------+
|FallTime|ThresholdTime|   description|NewThresholdTime|NewFallTime|
+--------+-------------+--------------+----------------+-----------+
|   15Min|        48min|PowerAmplifier|              48|         15|
|  200min|          min| DolbyDigitall|               0|        200|
+--------+-------------+--------------+----------------+-----------+

【讨论】:

  • 非常感谢 Sunny......但我正在寻找的是删除字符并在 newFallTime 和 NewThresholdTime 中只保留整数
  • 啊,抱歉已经做了一些小改动,现在应该可以使用了。如果您需要,请接受答案。
  • 获取错误为“NameError:名称'F'未定义。请帮助
  • import pyspark.sql.functions as F
  • from pyspark.sql import functions as F
猜你喜欢
  • 2021-10-22
  • 2019-02-20
  • 2021-11-08
  • 2020-01-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多