【发布时间】:2021-10-20 10:58:03
【问题描述】:
我正在创建一个函数,它将连接键和条件作为参数并动态连接两个数据帧。
我了解 Spark Scala Dataframe join done the following ways:
1) join(right: Dataset[_]): DataFrame
2) join(right: Dataset[_], usingColumn: String): DataFrame
3) join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
4) join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
5) join(right: Dataset[_], joinExprs: Column): DataFrame
6) join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
连接键/usingColumns 参数将是列名列表。
condition/joinExprs - 不知道如何传递它,但它可以是像"df2(colname) == 'xyz'"这样的字符串
Based on this post,我想出了以下内容。它负责连接键列表,但我怎样才能添加条件呢? (注意:为简单起见,我在这里使用了相同的数据框)
%scala
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined","dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
val empDF2 = emp.toDF(empColumns:_*)
val join_keys = Seq("emp_id","name") // this will be a parameter
val joinExprs = join_keys.map{case (c1) => empDF(c1) === empDF2(c1)}.reduce(_ && _)
// How do I add to joinExprs, another joinExpr like "empDF2(dept_id) == 10" here?
empDF.join(empDF2,joinExprs,"inner").show(false)
【问题讨论】:
标签: scala dataframe apache-spark join