【问题标题】:When are Keys Not Sortable in Sort Merge Join in Spark?Spark中的Sort Merge Join中的键何时不可排序?
【发布时间】:2022-02-05 22:11:32
【问题描述】:

当我阅读有关 Sort Merge Join 的文章时,它说这是在 Broadcast join 之后 Spark 中最受欢迎的一个,但前提是加入键是可排序的。我的问题是什么时候加入键是不可排序的?任何数据类型都可以排序。你能帮我理解一个键可能无法排序的场景吗?

【问题讨论】:

    标签: apache-spark join optimization pyspark bigdata


    【解决方案1】:

    https://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read。很棒的网站。

    并非所有类型都可以排序。例如 CalendarIntervalType。

    引用:

    "for not sortable keys the sort merge join" should "not be used" in {
    import sparkSession.implicits._
    // Here we explicitly define the schema. Thanks to that we can show
    // the case when sort-merge join won't be used, i.e. when the key is not sortable
    // (there are other cases - when broadcast or shuffle joins can be chosen over sort-merge
    //  but it's not shown here).
    // Globally, a "sortable" data type is:
    // - NullType, one of AtomicType
    // - StructType having all fields sortable
    // - ArrayType typed to sortable field
    // - User Defined DataType backed by a sortable field
    // The method checking sortability is   org.apache.spark.sql.catalyst.expressions.RowOrdering.isOrderable
    // As  you see, CalendarIntervalType is not included in any of above points,
    // so even if the data structure is the same (id + login for customers, id + customer id + amount for orders)
    // with exactly the same number of rows, the sort-merge join won't be applied here.
    

    这是一个旧帖子,因为 v3 可以进行比较。 https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/types/CalendarIntervalType.html

    但它证明了这一点。

    另外,非 equi 连接呢?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-10-14
      • 2020-11-26
      • 2013-09-29
      • 1970-01-01
      • 2016-03-24
      • 1970-01-01
      • 2017-02-01
      • 1970-01-01
      相关资源
      最近更新 更多