【问题标题】:Which join will Spark choose when all the selection criteria are not met?当不满足所有选择条件时,Spark 将选择哪个连接?
【发布时间】:2021-03-29 01:25:04
【问题描述】:

我们知道在Spark中有三种join——Broadcast Join、Shuffle Join和Sort-Merge Join:

  • 小表join大表时,使用Broadcast Join;
  • 当小表大于 BroadcastJoinThreshold 时,使用 Shuffle Join;
  • 当大表join,且join key可以排序时,使用Sort-Merge Join;

如果有两个大表连接,连接键无法排序怎么办? Spark 会选择哪种连接类型?

【问题讨论】:

  • 请详细说明连接键无法排序...代码...

标签: apache-spark join apache-spark-sql


【解决方案1】:

Spark 3.0 及更高版本支持这些类型的连接:

  • 广播哈希联接 (BHJ)
  • 随机散列连接
  • 随机排序合并连接 (SMJ)
  • 广播嵌套循环连接 (BNLJ)
  • 笛卡尔积加入

SparkStrategies.scala 的源代码中最好地概述了他们的选择:

  /**
   * Select the proper physical plan for join based on join strategy hints, the availability of
   * equi-join keys and the sizes of joining relations. Below are the existing join strategies,
   * their characteristics and their limitations.
   *
   * - Broadcast hash join (BHJ):
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *     BHJ usually performs faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation and it could cause
   *     OOM or perform badly in some cases, especially when the build/broadcast side is big.
   *
   * - Shuffle hash join:
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *
   * - Shuffle sort merge join (SMJ):
   *     Only supported for equi-joins and the join keys have to be sortable.
   *     Supported for all join types.
   *
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow.
   *
   * - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports only inner like joins.
   */
object JoinSelection extends Strategy with PredicateHelper { ...

如上所述,应用选择的结果不仅取决于表的大小和键的可排序性,还取决于连接类型(INNERLEFT/RIGHTFULL)和连接键条件(等值与非等值/θ)。总体而言,在您的情况下,您可能会查看 Shuffle Hash 或 Broadcast Nested Loop。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-07-15
    • 2017-10-22
    • 2019-04-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多