使用 PySpark 连接与另一列中的两列确定的范围相匹配的数据框答案

【问题标题】：Join dataframes matching a column with a range determined by two columns in the other one with PySpark使用 PySpark 连接与另一列中的两列确定的范围相匹配的数据框
【发布时间】：2021-12-29 14:31:58
【问题描述】：

我在左边有一个像这样的df：

+----+-----+
|  id|value|
+----+-----+
|   2|   xx|
|   4|   xx|
|  11|   xx|
|  14|   xx|
|  27|   xx|
|  28|   xx|
|  56|   xx|
|  55|   xx|
+----+-----+

右边还有一个像这样的：

+-----+---+----+
|start|end| ov |
+-----+---+----+
|    0|  9|   A|
|   10| 19|   B|
|   20| 29|   C|
|   30| 39|   D|
|   40| 49|   F|
+-----+---+----+

当第一个表的 id 在第二个表的起始端范围之间时，我需要加入行。输出应如下所示：

+----+-----+----+
|  id|value| ov |
+----+-----+----+
|   2|   xx|   A|
|   4|   xx|   A|
|  11|   xx|   B|
|  14|   xx|   B|
|  27|   xx|   C|
|  28|   xx|   C|
|  56|   xx|    |
|  55|   xx|    |
+----+-----+----+

如何使用 PySpark 实现此结果？

【问题讨论】：

标签： python pyspark

【解决方案1】：

将between 运算符与left join一起使用。

Example:

#using dataframes api
df.join(df1,(df['id'] >= df1['start']) & (df['id'] <= df1['end']),'left').select(df["*"],df1['ov']).show(10,False)


#using spark sql api
df.createOrReplaceTempView("t1")
df1.createOrReplaceTempView("t2")    
spark.sql("select t1.*,t2.ov from t1 left join t2 on t1.id between t2.start and t2.end").show()    

#this is just sample data
#+---+-----+----+
#| id|value|  ov|
#+---+-----+----+
#|  2|   xx|   A|
#|  4|   xx|   A|
#| 55|   zz|null|
#+---+-----+----+

【讨论】：

一个快速的旁注，如果连接条件不是相等条件，它会触发数据帧之间的笛卡尔积。如果其中一个数据框很大，则有助于在连接的右侧创建一个每个 id 值有一行的数据框，以便优化连接。 df = (df .withColumn('id', f.explode(f.sequence(f.col('start'), f.col('end')))) .select('id', 'ov') )