基于另一个数据帧scala过滤数据帧答案

【问题标题】：Filter dataframe based on another data frame scala基于另一个数据帧scala过滤数据帧
【发布时间】：2018-06-25 17:37:02
【问题描述】：

目前我在做：

val DF = sqlSession.sql("select itemIdDig as itemId, "
      + "title"
      + "timestamp as time "
      + "from itemTable ")

val tempDF = sqlSession.sql("select itemIdDig as itemId "
      + "from itemTable "
      + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()


//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF  : _*)).toDF

但这很慢。有人可以建议我更好的方法来实现这一目标吗？基本上我是从不在tempDF 中的行中查找的（我尝试使用组，因为它给了我独特的itemId，但我想保留重复项）

【问题讨论】：

标签： sql scala apache-spark dataframe apache-spark-sql

【解决方案1】：

只是半连接：

DF.join(tempDF,  Seq("itemId"), "leftanti")

【讨论】：

关心解释半联接？
我认为它是“left_anti。它给了我不同的大小与我的方式和你的方式。