如何在pyspark中的数据框之间进行连接答案

【问题标题】：How to make join between dataframes of pspark如何在pyspark中的数据框之间进行连接
【发布时间】：2021-03-08 16:48:52
【问题描述】：

我有两个DataFrame，分别叫DF1和DF2，每个DataFrame的内容如下：

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter
account1        100000000001        cluster1        111111      11111111
account2        200000000001        cluster2        222222      22222222

我需要为字段 df1.line_item_usage_account_id 和 df2.accountproviderid 进行连接

当两个字段具有相同的 ID 时，必须添加 DF1 line_item_unblended_cost 列的值。而当DF1的line_item_usage_account_id字段的值不在DF2的accountproviderid列时，df1字段必须按如下方式聚合：

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

account3 数据通过填充 DF2 的“na”列添加到新 DataFrame 的末尾。

任何帮助提前谢谢。

【问题讨论】：

标签： python-3.x pyspark pyspark-dataframes

【解决方案1】：

from pyspark.sql import SparkSession   
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([
    [100000000001, 12.05, 'account1'], 
    [200000000001, 52.00, 'account2'], 
    [300000000003, 12.03, 'account3']], 
    schema=['line_item_usage_account_id',  'line_item_unblended_cost', 'name' ])

df1.show()
df1.printSchema()

df2 = spark.createDataFrame([
    ['account1', 100000000001, 'cluster1', 111111, 11111111],
    ['account2', 200000000001, 'cluster2', 222222, 22222222]], 
    schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter'])

df2.printSchema()
df2.show()

cols = ['name', 'line_item_usage_account_id', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost']
resDF = df1.join(df2, df1.line_item_usage_account_id == df2.accountproviderid, "leftouter").select(*cols).withColumnRenamed('name', 'accountname').withColumnRenamed('line_item_usage_account_id', 'accountproviderid').orderBy('accountname')

resDF.printSchema()
 # |-- accountname: string (nullable = true)
 # |-- accountproviderid: long (nullable = true)
 # |-- clustername: string (nullable = true)
 # |-- app_pmo: long (nullable = true)
 # |-- app_costcenter: long (nullable = true)
#  |-- line_item_unblended_cost: double (nullable = true)

resDF.show()
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
# |   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
# |   account3|     300000000003|       null|   null|          null|                   12.03|
# +-----------+-----------------+-----------+-------+--------------+------------------------+

【讨论】：