【问题标题】:how to perform join operation on pyspark dataframe?如何对 pyspark 数据框执行连接操作?
【发布时间】:2019-05-21 05:17:42
【问题描述】:

我有两个数据框 dd1 和 dd2,我想加入这些数据框。

dd1:

id name
 1  red
 2  green
 3  yellow
 4  black
 5  pink
 6  blue
 7  white
 8  grey

dd2:-

  id  name1
   1  banana
   2  apple
   4  orange
   8  grapes
   9  leamon

我想在 dd1 数据帧中这样输出:

id name     name1
 1  red     banana
 2  green   apple
 3  yellow  NULL
 4  black   orange
 5  pink    NULL 
 6  blue    NULL
 7  white   NULL
 8  grey    grapes

【问题讨论】:

    标签: dataframe pyspark apache-spark-sql


    【解决方案1】:

    你可以试试这个代码:

    df = spark.createDataFrame(
        [(1,'red'),(2,'green'),(3,'yellow'),(4,'black'),(5,'pink'),
        (6,'blue'),(7,'white'),(8,'grey')], ["id", "name"])
    
    +---+------+
    | id|  name|
    +---+------+
    |  1|   red|
    |  2| green|
    |  3|yellow|
    |  4| black|
    |  5|  pink|
    |  6|  blue|
    |  7| white|
    |  8|  grey|
    +---+------+
    
    df1 = spark.createDataFrame(
        [(1,'banana'),(2,'apple'),(4,'orange'),(8,'grapes'),(9,'leamon')], ["id1", "name1"])
    
    +---+------+
    |id1| name1|
    +---+------+
    |  1|banana|
    |  2| apple|
    |  4|orange|
    |  8|grapes|
    |  9|leamon|
    +---+------+
    
    condition = [df.id ==df1.id1]
    inner_join=df.join(df1,condition,how='left')
    
    inner_join=inner_join.drop("id1")
    inner_join=inner_join.orderBy("id")
    
    display(inner_join) 
    
    +---+------+------+
    | id|  name| name1|
    +---+------+------+
    |  1|   red|banana|
    |  2| green| apple|
    |  3|yellow|  null|
    |  4| black|orange|
    |  5|  pink|  null|
    |  6|  blue|  null|
    |  7| white|  null|
    |  8|  grey|grapes|
    +---+------+------+
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-09-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-06-16
      • 1970-01-01
      相关资源
      最近更新 更多