将具有相同名称的 mx 值组合在一行中 pyspark答案

【问题标题】：combine the mx value with same name in one line pyspark将具有相同名称的 mx 值组合在一行中 pyspark
【发布时间】：2020-10-18 00:43:39
【问题描述】：

我想隐藏这个值

 {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx3.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx3.googlemail.com"}

    test.printSchema()
root
 |-- name: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- type: string (nullable = true)
 |-- value: string (nullable = true)

将具有相同名称的mx值组合在一行中 pyspark 我想要的结果

   { "timestamp":"1601093713", "name":"exmple1.com", "type":"mx", "value":" alt1.aspmx.l.google.com,alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
   { "timestamp":"1601093713", "name":"exmple2.com", "type":"mx", "value":" alt1.aspmx.l.google.com, alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }

【问题讨论】：

标签： python apache-spark pyspark apache-spark-sql

【解决方案1】：

您可以使用groupBy、agg 和collect_list [docs (external link)] 执行此操作。请注意，这将提供值列表而不是字符串。如果需要，如何进行转换可以在Convert PySpark dataframe column from list to string找到。

df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values'))

这里的后续问题是您希望如何处理其他列。例如。时间戳或类型。

【讨论】：

df = spark.read.json("1.json") df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values') ).write \ .save("mxrecord22",format="json") NameError: name 'F' is not defined
你需要导入函数from pyspark.sql import functions as F
AttributeError: 'DataFrame' 对象没有属性 'collect_list'
import pyspark from pyspark.sql.functions import col, countDistinct from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.sql.functions import * from pyspark .sql 导入函数为 F spark.conf.set("spark.sql.caseSensitive", "true") spark.conf.set("spark.sql.debug.maxToStringFields", 1000) df = spark.read.json( "1.json") df = df.groupby('name').agg(df.collect_list('value').alias('values')).write \ .save("mxrecord22",format="json" )
不是df.collect_list，而是F.collect_list。