Spark/Pandas/SQL 表连接答案

【问题标题】：Spark/Pandas/SQL table JoinsSpark/Pandas/SQL 表连接
【发布时间】：2021-10-15 00:48:25
【问题描述】：

我有以下两个数据帧，并希望通过连接两个两个输入数据帧来填充最终数据帧。

df1 即 table1

id    | code | name | location  | val | date
1000  | 1    | 'A'  | 'AABB'    | 1   | 2021-01-01
1000  | 2    | 'B'  | 'BBCC'    | 3   | 2021-01-01
1000  | 3    | 'C'  | 'CCDD'    | 4   | 2021-01-01
1000  | 4    | 'D'  | 'DDEE'    | 1   | 2021-01-01
2000  | 1    | 'E'  | 'EEFF'    | 5   | 2021-03-01
2000  | 2    | 'F'  | 'XXYY'    | 4   | 2021-03-01
2000  | 3    | 'G'  | 'YYZZ'    | 2   | 2021-03-01
2000  | 4    | 'H'  | 'ZZAA'    | 1   | 2021-03-01
2000  | 4    | 'I'  | 'IIII'    | 1   | 2021-03-01

df1.createOrReplaceTempView('df1')

df2 即 table2

id   | city | dist | state | count | tot_sum | date

1000 | null | null | null  | null  | null    | 2021-01-01
2000 | null | null | null  | null  | null    | 2021-03-01

df2.createOrReplaceTempView('df2')

df3 即 table3

id   | city   | dist   | state  | count | tot_sum | date
1000 | 'AABB' | 'BBCC' | 'CCDD' | 1     |  9      | 2021-01-01
2000 | 'EEFF' | 'XXYY' | 'YYZZ' | 2     |  13     | 2021-03-01

逻辑：

当代码 =1 时，将位置视为城市

当代码 =2 时，将位置视为 dist

当代码 =3 时，将位置视为状态

当代码 =4 时，计算该 id 的代码的记录总数，即在 id 1000 的情况下，我们只有一条代码为 4 的记录，在 id 2000 的情况下，我们有 2 条记录

对于该 id 的所有 val 的代码 4 总和是 tot_sum 即对于 id 1000 它将是 1+3+4+1=9，对于 id 2000 它将是 5+4+2+1+ 1=13

尝试以下类似的方法，但没有成功


select d2.id as id,
       d2.date as date,
       CASE WHEN d1.code=1 then d1.location else null end as city,
       CASE WHEN d1.code=2 then d1.location else null end as dist,
       CASE WHEN d1.code=3 then d1.location else null end as state
FROM   df1 d1 join df2 d2 on d1.id=d2.id 




select d2.id,
       d2.date
       CASE WHEN d1.code=1 then state=d1.location,
       CASE WHEN d1.code=2 then dist=d1.location,
       CASE WHEN d1.code=3 then CityName=d1.location
FROM   df1 d1 join df2 d2 on d1.id=d2.id

有什么建议吗？

注意：寻找 SQL 查询（考虑两个输入表）/Pyspark 数据帧/Pandas 数据帧

DF1：

【问题讨论】：

标签： sql pandas dataframe apache-spark pyspark

【解决方案1】：

我真的不明白你为什么要加入 df2。您的所有数据都已经在 df1 中，这只是一个聚合。

使用 SQL

select * from df2 inner join (
select id,
       first(case when code=1 then location end) as city,
       first(case when code=2 then location end) as dist,
       first(case when code=3 then location end) as state,
       count(case when code=4 then 1 end) as count,
       sum(val) as tot_sum,
       date
from df1 
group by id, date
) t on t.id = df2.id

使用 Pyspark

from pyspark.sql import functions as F

df1 = df1.groupBy("id", "date").agg(
    F.first(F.when(F.col("code") == 1, F.col("location")),ignorenulls=True).alias("city"),
    F.first(F.when(F.col("code") == 2, F.col("location")),ignorenulls=True).alias("dist"),
    F.first(F.when(F.col("code") == 3, F.col("location")),ignorenulls=True).alias("state"),
    F.count(F.when(F.col("code") == 4, F.col("code"))).alias("count"),
    F.sum(F.col("val")).alias("tot_sum"),
)

df3 = df1.join(df2, on='id')

【讨论】：

感谢您的回答。在 DF1/Table1 中，我有很多我们不希望最终结果集的其他数据。我没有给出完整的 df1/table1 数据集（它有 100 列，我只对这 6 到 7 列感兴趣）。这就是创建 DF3 的原因。希望它清除。
@data_addict 好吧，一旦正确创建了 df1，您只需将其加入 df2 上的 id ......看看我的编辑。您放入选择空的 df2 列。
测试了上面的内容，它为 dist、state 填充空值，附上截图（在问题中）。任何建议，请
@data_addict show df1 content ... 可能你有问题。
附上问题中DF1的截图，你能查一下吗？

【解决方案2】：

我认为你只是想要聚合：

select d2.id as id,
       d2.date as date,
       max(CASE WHEN d1.code=1 then d1.location end) as city,
       max(CASE WHEN d1.code=2 then d1.location end) as dist,
       max(CASE WHEN d1.code=3 then d1.location end) as state,
       sum(CASE WHEN d1.code=4 then d1.val else 0 end) as cnt
from df1 d1 join
     df2 d2
     on d1.id=d2.id 
group by d2.id, d2.date;

【讨论】：

感谢您的回答。你能告诉我如何填充 tot_sum 吗？
能够填充tot_sum ` select d2.id as id, d2.date as date, max(CASE WHEN d1.code=1 then d1.location end) as city, max(CASE WHEN d1.code=2 then d1.location end) as dist, max(CASE WHEN d1.code=3 then d1.location end) as state, sum(CASE WHEN d1.code=4 then d1.val else 0 end) as cnt, sum(val) as tot_sum from df1 d1 join df2 d2 on d1.id=d2.id group by d2.id, d2.date`，谢谢

【解决方案3】：

根据我对问题的理解，您希望将 df1 和 df2 一起加入。

根据您希望用于此目的的列（我假设为 ID），您可以在 SQL 中实现如下：

select t1.id, t1.code, t1.name,
t1.location, t1.val, t1.date, 
t2.city, t2.dist, t2.state, 
t2.count, t2.tot_sum from df1 as t1 
inner join df2 as t2 on t1.id=t2.id;

【讨论】：

我想在 id 列上加入两个表，但是，加入应该基于条件。正如我在逻辑中提到的，如果 code=1 然后填充城市列，这些注释上的某些内容需要加入表格。这不是简单的加入。这有点使结果变平。