【问题标题】:Pyspark Join data framePyspark Join 数据框
【发布时间】:2022-01-25 11:44:11
【问题描述】:

我有两个 spark 数据框。

df1

id    product  price
0     x        100
1     y        120
2     z        110
3     x        150
4     x        100

和 df2

id    unique_products 
0     x        
1     y        
2     z         

我怎样才能得到这个结果:

id    unique_products  prices
0     x                [100, 150, 100]                      
1     y                [120]
2     z                [110]

【问题讨论】:

    标签: python dataframe pyspark


    【解决方案1】:

    您可以按product 分组并在price 上应用collect_list。最后加入df2获得id

    from pyspark.sql import functions as F
    
    data1 = [(0, "x", 100,),
            (1, "y", 120,),
            (2, "z", 110,),
            (3, "x", 150,),
            (4, "x", 100,), ]
    
    data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ]
    
    df1 = spark.createDataFrame(data1,("id", "product", "price",)) 
    df2 = spark.createDataFrame(data2,("id", "unique_products", ))
    
    df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices")
    
    df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show()
    
    

    输出

    +---+---------------+---------------+
    | id|unique_products|         prices|
    +---+---------------+---------------+
    |  0|              x|[100, 150, 100]|
    |  1|              y|          [120]|
    |  2|              z|          [110]|
    +---+---------------+---------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-11-22
      • 1970-01-01
      • 2021-12-29
      • 2020-04-22
      • 2022-01-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多