【问题标题】:Spark Physical Plan false/true for Exchange partitioning用于 Exchange 分区的 Spark 物理计划 false/true
【发布时间】:2021-04-09 17:25:35
【问题描述】:
repartitionedDF.explain

为物理计划显示此内容

== Physical Plan ==
Exchange hashpartitioning(purchase_month#25, 10), false, [id=#6]
+- LocalTableScan [item#23, price#24, purchase_month#25]

我注意到在某些情况下假也可以是真的。

这意味着什么?我知道一次,但忘记了。

【问题讨论】:

    标签: apache-spark sql-execution-plan


    【解决方案1】:

    经过一番挖掘,我相信它指的是noUserSpecifiedNumPartition 变量。如果您进行重新分区,此布尔变量将为false,因为您指定了分区数。否则为true。尝试做一个简单的orderBy,我认为你应该得到true

    我发现了这个

    println(df.repartition('series).orderBy('series).queryExecution.executedPlan.prettyJson)
    

    灵感来自this answer。它给出的输出(仅截断相关部分):

    {
      "class" : "org.apache.spark.sql.execution.exchange.ShuffleExchangeExec",
      "num-children" : 1,
      "outputPartitioning" : [ {
        "class" : "org.apache.spark.sql.catalyst.plans.physical.RangePartitioning",
        "num-children" : 1,
        "ordering" : [ 0 ],
        "numPartitions" : 200
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder",
        "num-children" : 1,
        "child" : 0,
        "direction" : {
          "object" : "org.apache.spark.sql.catalyst.expressions.Ascending$"
        },
        "nullOrdering" : {
          "object" : "org.apache.spark.sql.catalyst.expressions.NullsFirst$"
        },
        "sameOrderExpressions" : {
          "object" : "scala.collection.immutable.Set$EmptySet$"
        }
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "series",
        "dataType" : "string",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
          "id" : 16,
          "jvmId" : "35ee1aa5-f899-4fca-a8a6-a06c3eaabe5c"
        },
        "qualifier" : [ ]
      } ],
      "child" : 0,
      "noUserSpecifiedNumPartition" : true
    }, {
      "class" : "org.apache.spark.sql.execution.exchange.ShuffleExchangeExec",
      "num-children" : 1,
      "outputPartitioning" : [ {
        "class" : "org.apache.spark.sql.catalyst.plans.physical.HashPartitioning",
        "num-children" : 1,
        "expressions" : [ 0 ],
        "numPartitions" : 200
      }, {
        "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
        "num-children" : 0,
        "name" : "series",
        "dataType" : "string",
        "nullable" : true,
        "metadata" : { },
        "exprId" : {
          "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
          "id" : 16,
          "jvmId" : "35ee1aa5-f899-4fca-a8a6-a06c3eaabe5c"
        },
        "qualifier" : [ ]
      } ],
      "child" : 0,
      "noUserSpecifiedNumPartition" : false
    }
    

    truefalse 与物理计划很好地对应:

    df.repartition('series).orderBy('series).explain
    == Physical Plan ==
    *(1) Sort [series#16 ASC NULLS FIRST], true, 0
    +- Exchange rangepartitioning(series#16 ASC NULLS FIRST, 200), true, [id=#192]
       +- Exchange hashpartitioning(series#16, 200), false, [id=#190]
          +- FileScan csv [series#16,timestamp#17,value#18] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<series:string,timestamp:string,value:string>
    

    【讨论】:

    • 问个简单的问题,你知道[...].queryExecution.executedPlan.prettyJson 是否可用于pyspark?
    • @Kafels 你可以使用df._jdf.queryExecution().executedPlan().prettyJson()访问java对象
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-03
    • 1970-01-01
    • 2016-09-27
    • 2020-12-10
    相关资源
    最近更新 更多