【问题标题】:join pysprak data to desire output将 pyspark 数据连接到所需的输出
【发布时间】:2022-01-04 03:57:02
【问题描述】:

我想加入两个数据集,如下所示:

数据集 1:

PIN           LOCATION
1234          Germany
2356          Poland
2894          England
3452          Bloomberg

数据集 2:

MAIL           STARTLOCATION      ENDLOCATION
ami@test.com        1234             2894
asd@test.com        2356             1234
cddv@test.com       3452             2894

输出应该是:

MAIL           STARTLOCATION      ENDLOCATION  LOCATION1       LOCATION2
ami@test.com        1234             2894       Germany         England
asd@test.com        2356             1234       poland          Germany
cddv@test.com       3452             2894      Bloomberg        England

试过了:

condi = [((df1.PIN == df2.STARTLOCATION) | (df1.PIN == df2.ENDLOCATION))]

joindata = df1.join(df2, on = condi, how = 'outer').select('*')

但它在LOCATION1LOCATION2 中给出NULL

【问题讨论】:

    标签: python-3.x apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您必须加入两次才能获得所需的结果,因为您希望结果中有两个额外的列。

    df1.show()
    +----+---------+
    | PIN| LOCATION|
    +----+---------+
    |1234|  Germany|
    |2356|   Poland|
    |2894|  England|
    |3452|Bloomberg|
    +----+---------+
    
    df2.show()
    +-------------+-------------+-----------+
    |         MAIL|STARTLOCATION|ENDLOCATION|
    +-------------+-------------+-----------+
    | ami@test.com|         1234|       2894|
    | asd@test.com|         2356|       1234|
    |cddv@test.com|         3452|       2894|
    +-------------+-------------+-----------+
    
    df2.join(df1, df2.STARTLOCATION==df1.PIN)\
        .withColumnRenamed("LOCATION", "LOCATION1")\
        .drop("PIN")\
        .join(df1, df2.ENDLOCATION==df1.PIN)\
        .withColumnRenamed("LOCATION", "LOCATION2")\
        .drop("PIN")\
        .show()
    +-------------+-------------+-----------+---------+---------+
    |         MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
    +-------------+-------------+-----------+---------+---------+
    |cddv@test.com|         3452|       2894|Bloomberg|  England|
    | ami@test.com|         1234|       2894|  Germany|  England|
    | asd@test.com|         2356|       1234|   Poland|  Germany|
    +-------------+-------------+-----------+---------+---------+
    

    【讨论】:

      【解决方案2】:

      如果你不介意使用sql语句来实现,可以试试:

      df1.createOrReplaceTempView('tmp1')
      df2.createOrReplaceTempView('tmp2')
      sql = """
          select a.MAIL,a.STARTLOCATION,a.ENDLOCATION,b.LOCATION as LOCATION1,c.LOCATION as LOCATION2
          from tmp2 a join tmp1 b on a.STARTLOCATION=b.PIN
              join tmp1 c on a.ENDLOCATION=c.PIN
      """
      df = spark.sql(sql)
      df.show(truncate=False)
      

      【讨论】:

        【解决方案3】:
        pyspark
        
        import findspark
        findspark.init()
        
        import pyspark
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import *
        from pyspark.sql.types import*
        import pandas as pd
        spark = SparkSession.builder.appName("Test").getOrCreate()
        
        df1=spark.createDataFrame([Row(PIN=1234, LOCATION='Germany'),Row(PIN=2356, LOCATION='Poland'), Row(PIN=2894, LOCATION='England'),Row(PIN=3452, LOCATION='Bloomberg')])
        
        df1.show()
        +----+---------+
        | PIN| LOCATION|
        +----+---------+
        |1234|  Germany|
        |2356|   Poland|
        |2894|  England|
        |3452|Bloomberg|
        +----+---------+
        
        df2=spark.createDataFrame([Row(MAIL='ami@test.com',STARTLOCATION=1234, ENDLOCATION=2894), Row(MAIL='asd@test.com',STARTLOCATION=2356, ENDLOCATION=1234),Row(MAIL='cddv@test.com',STARTLOCATION=3452, ENDLOCATION=2894)])
        
        df2.show()
        
        +-------------+-------------+-----------+
        |         MAIL|STARTLOCATION|ENDLOCATION|
        +-------------+-------------+-----------+
        | ami@test.com|         1234|       2894|
        | asd@test.com|         2356|       1234|
        |cddv@test.com|         3452|       2894|
        +-------------+-------------+-----------+
        
        df3= df2.join(df1, df1['PIN']==df2['STARTLOCATION'], 'inner')
        
        df3.show()
        
        +-------------+-------------+-----------+
        |         MAIL|STARTLOCATION|ENDLOCATION|
        +-------------+-------------+-----------+
        | ami@test.com|         1234|       2894|
        | asd@test.com|         2356|       1234|
        |cddv@test.com|         3452|       2894|
        +-------------+-------------+-----------+
        
        df3= df2.join(df1, df1['PIN']==df2['STARTLOCATION'], 'inner')
        
        df3.show()
        
        +-------------+-------------+-----------+----+---------+
        |         MAIL|STARTLOCATION|ENDLOCATION| PIN| LOCATION|
        +-------------+-------------+-----------+----+---------+
        |cddv@test.com|         3452|       2894|3452|Bloomberg|
        | asd@test.com|         2356|       1234|2356|   Poland|
        | ami@test.com|         1234|       2894|1234|  Germany|
        +-------------+-------------+-----------+----+---------+
        
        df4=df3.drop('PIN')
        df5=df4.withColumnRenamed('LOCATION', 'LOCATION1')
        df5.show()
        
        df6=df5.join(df1, df1['PIN']==df5['ENDLOCATION'], 'inner')
        
        final=df6.withColumnRenamed("LOCATION", "LOCATION2").drop('PIN')
        final.show()
        
        +-------------+-------------+-----------+---------+---------+
        |         MAIL|STARTLOCATION|ENDLOCATION|LOCATION1|LOCATION2|
        +-------------+-------------+-----------+---------+---------+
        |cddv@test.com|         3452|       2894|Bloomberg|  England|
        | ami@test.com|         1234|       2894|  Germany|  England|
        | asd@test.com|         2356|       1234|   Poland|  Germany|
        +-------------+-------------+-----------+---------+---------+
        
        
        

        【讨论】:

          猜你喜欢
          • 2018-01-31
          • 2015-11-16
          • 1970-01-01
          • 2018-06-23
          • 1970-01-01
          • 2018-06-13
          • 2023-03-22
          • 2021-10-11
          • 2017-08-13
          相关资源
          最近更新 更多