【问题标题】:How to make join between dataframes of pspark如何在pyspark中的数据框之间进行连接
【发布时间】:2021-03-08 16:48:52
【问题描述】:

我有两个DataFrame,分别叫DF1和DF2,每个DataFrame的内容如下:

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter
account1        100000000001        cluster1        111111      11111111
account2        200000000001        cluster2        222222      22222222

我需要为字段 df1.line_item_usage_account_id 和 df2.accountproviderid 进行连接

当两个字段具有相同的 ID 时,必须添加 DF1 line_item_unblended_cost 列的值。 而当DF1的line_item_usage_account_id字段的值不在DF2的accountproviderid列时,df1字段必须按如下方式聚合:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

account3 数据通过填充 DF2 的“na”列添加到新 DataFrame 的末尾。

任何帮助提前谢谢。

【问题讨论】:

    标签: python-3.x pyspark pyspark-dataframes


    【解决方案1】:
    from pyspark.sql import SparkSession   
    spark = SparkSession.builder.getOrCreate()
    
    df1 = spark.createDataFrame([
        [100000000001, 12.05, 'account1'], 
        [200000000001, 52.00, 'account2'], 
        [300000000003, 12.03, 'account3']], 
        schema=['line_item_usage_account_id',  'line_item_unblended_cost', 'name' ])
    
    df1.show()
    df1.printSchema()
    
    df2 = spark.createDataFrame([
        ['account1', 100000000001, 'cluster1', 111111, 11111111],
        ['account2', 200000000001, 'cluster2', 222222, 22222222]], 
        schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter'])
    
    df2.printSchema()
    df2.show()
    
    cols = ['name', 'line_item_usage_account_id', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost']
    resDF = df1.join(df2, df1.line_item_usage_account_id == df2.accountproviderid, "leftouter").select(*cols).withColumnRenamed('name', 'accountname').withColumnRenamed('line_item_usage_account_id', 'accountproviderid').orderBy('accountname')
    
    resDF.printSchema()
     # |-- accountname: string (nullable = true)
     # |-- accountproviderid: long (nullable = true)
     # |-- clustername: string (nullable = true)
     # |-- app_pmo: long (nullable = true)
     # |-- app_costcenter: long (nullable = true)
    #  |-- line_item_unblended_cost: double (nullable = true)
    
    resDF.show()
    # +-----------+-----------------+-----------+-------+--------------+------------------------+
    # |accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
    # +-----------+-----------------+-----------+-------+--------------+------------------------+
    # |   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
    # |   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
    # |   account3|     300000000003|       null|   null|          null|                   12.03|
    # +-----------+-----------------+-----------+-------+--------------+------------------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-03-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-06-13
      相关资源
      最近更新 更多