【发布时间】:2018-06-25 17:37:02
【问题描述】:
目前我在做:
val DF = sqlSession.sql("select itemIdDig as itemId, "
+ "title"
+ "timestamp as time "
+ "from itemTable ")
val tempDF = sqlSession.sql("select itemIdDig as itemId "
+ "from itemTable "
+ "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect()
//keep itemIds which are not in DF
DF.filter(!col("itemId").isin(tempDF : _*)).toDF
但这很慢。有人可以建议我更好的方法来实现这一目标吗?基本上我是从不在tempDF 中的行中查找的(我尝试使用组,因为它给了我独特的itemId,但我想保留重复项)
【问题讨论】:
标签: sql scala apache-spark dataframe apache-spark-sql