【问题标题】:joined based on the column value根据列值加入
【发布时间】:2020-07-28 21:58:40
【问题描述】:

我正在使用 spark-sql-2.4.1v 如何进行各种连接取决于列的值

样本数据

val data = List(
  ("20", "score", "school",  14 ,12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 ,14)
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school|    14|    12|
| 21|score|school|    13|    13|
| 22| rate|school|    11|    14|
| 21| rate|school|    13|    12|

基于我需要与其他各种表连接的“代码”列值

val rateDs = // val data1= List(
  ("22", 11 ,A),
  ("22", 14 ,B),
  ("20", 13 ,C),
  ("21", 12 ,C),
  ("21", 13 ,D)
)

val df = data1.toDF("id", "map_code","map_val")

val scoreDs = // scoreTable 

如果“code”列的值为“rate”,我需要加入 rateDs 如果“code”列值为“score”,我需要加入 scoreDs

如何在 spark 中处理这些事情?有什么最佳方法可以实现这一目标?

“比率”字段的预期结果

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school|     A|    B |
| 21| rate|school|     D|    C |

【问题讨论】:

  • 您可以过滤出两个数据框,与其他数据框合并并再次合并
  • @koiralo 谢谢,可以使用“when”子句吗?

标签: apache-spark apache-spark-sql


【解决方案1】:

例如,您可以简单地加入两次

val data = List(
  ("20", "score", "school",  14 , 12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 , 14),
  ("21", "rate", "school",  13 , 12)    
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

val data1 = List(
  ("22", 11 ,"A"),
  ("22", 14 ,"B"),
  ("20", 13 ,"C"),
  ("21", 12 ,"C"),
  ("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")

df.as("a")
  .join(rateDF.as("b"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("b.id") 
        && col("a.value1") === col("b.map_code"), "inner")
  .join(rateDF.as("c"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("c.id") 
        && col("a.value2") === col("c.map_code"), "inner")
  .select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
  .show(false)

+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A     |B     |
|21 |rate|school|D     |C     |
+---+----+------+------+------+

嗯,这看起来有点脏,但我不知道多列...

【讨论】:

  • 谢谢,可以使用“when”子句吗?我猜这个连接会影响性能。
  • 加入表格时不建议这样做。
  • 这里做了什么 coalesce("b.value1", "c.value1") ??
  • code = rate时b.value1不为null,c.value1为null,code = score时取反。因此,coalesce 将这两个结果收集为一列,但这取决于您,这只是一个示例。
  • 基于 col("a.id") === col("b.id"), "left") 条件加入后,我需要循环多个列值,即 "value_1" & "value_2" from "df" in "rateDs" ....如何处理?
猜你喜欢
  • 2015-04-27
  • 1970-01-01
  • 2017-11-30
  • 2023-03-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-29
  • 1970-01-01
相关资源
最近更新 更多