【问题标题】:combine the mx value with same name in one line pyspark将具有相同名称的 mx 值组合在一行中 pyspark
【发布时间】:2020-10-18 00:43:39
【问题描述】:

我想隐藏这个值

 {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx3.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx3.googlemail.com"}

    test.printSchema()
root
 |-- name: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- type: string (nullable = true)
 |-- value: string (nullable = true)

将具有相同名称的mx值组合在一行中 pyspark 我想要的结果

   { "timestamp":"1601093713", "name":"exmple1.com", "type":"mx", "value":" alt1.aspmx.l.google.com,alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
   { "timestamp":"1601093713", "name":"exmple2.com", "type":"mx", "value":" alt1.aspmx.l.google.com, alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }

【问题讨论】:

    标签: python apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您可以使用groupByaggcollect_list [docs (external link)] 执行此操作。请注意,这将提供值列表而不是字符串。如果需要,如何进行转换可以在Convert PySpark dataframe column from list to string找到。

    df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values'))
    

    这里的后续问题是您希望如何处理其他列。例如。时间戳或类型。

    【讨论】:

    • df = spark.read.json("1.json") df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values') ).write \ .save("mxrecord22",format="json") NameError: name 'F' is not defined
    • 你需要导入函数from pyspark.sql import functions as F
    • AttributeError: 'DataFrame' 对象没有属性 'collect_list'
    • import pyspark from pyspark.sql.functions import col, countDistinct from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.sql.functions import * from pyspark .sql 导入函数为 F spark.conf.set("spark.sql.caseSensitive", "true") spark.conf.set("spark.sql.debug.maxToStringFields", 1000) df = spark.read.json( "1.json") df = df.groupby('name').agg(df.collect_list('value').alias('values')).write \ .save("mxrecord22",format="json" )
    • 不是df.collect_list,而是F.collect_list
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-07-25
    • 1970-01-01
    • 1970-01-01
    • 2019-06-25
    • 1970-01-01
    • 2022-01-02
    • 1970-01-01
    相关资源
    最近更新 更多