【问题标题】:Weird Behavior of GeoMesa HBase QueryGeoMesa HBase 查询的奇怪行为
【发布时间】:2020-06-09 17:58:44
【问题描述】:

我有一个关于 HBase 查询的问题。我看到很多数据被扫描以进行小型空间查询。我在 OSMNodes 表上触发了地理空间查询。下面是查询和表的详细信息。我看到 HBase 上的读取请求总数 (5,553,421,708) 并看到大多数区域和区域服务器上的请求。我们为什么要为这个查询扫描每个区域(整个表)?

**Query**:

"DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"
**Table Schema:**

geomesa-hbase describe-schema -c atlas -f OSMNodes
INFO  Describing attributes of feature 'OSMNodes'
geometry           | Point     (Spatio-temporally indexed)
ingestionTimestamp | Timestamp (Spatio-temporally indexed)
nextTimestamp      | Timestamp 
serializerVersion  | String    
featurePayload     | String    

User data:
  geomesa.index.dtg    | ingestionTimestamp
  geomesa.indices      | z3:6:3:geometry:ingestionTimestamp,id:4:3:
  geomesa.stats.enable | true
  geomesa.z.splits     | 60
**Query Plan (through GeoMesa Cli):**
geomesa-hbase explain -c atlas -f OSMNodes -q "DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"

Planning 'OSMNodes' (DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00
  Original filter: (DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31') AND nextTimestamp > '2020-05-27 16:59:31'
  Hints: bin[false] arrow[false] density[false] stats[false] sampling[none]
  Sort: none
  Transforms: none
  Strategy selection:
    Query processing took 17ms for 1 options
    Filter plan: FilterPlan[Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00][nextTimestamp > 2020-05-27T16:59:31+00:00]]
    Strategy selection took 1ms for 1 options
  Strategy 1 of 1: Z3Index(geometry,ingestionTimestamp)
    Strategy filter: Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00][nextTimestamp > 2020-05-27T16:59:31+00:00]
    Geometries: FilterValues(List(POLYGON ((-122.3317610175119 47.607282, -122.33177379496394 47.60715226835226, -122.33181163628976 47.607027522218985, -122.33187308726842 47.606912555524126, -122.33195578637329 47.606811786373285, -122.33205655552413 47.60672908726843, -122.33217152221899 47.60666763628976, -122.33229626835225 47.606629794963936, -122.332426 47.606617017511894, -122.33255573164774 47.606629794963936, -122.33268047778101 47.60666763628976, -122.33279544447586 47.60672908726843, -122.33289621362671 47.606811786373285, -122.33297891273158 47.606912555524126, -122.33304036371024 47.607027522218985, -122.33307820503606 47.60715226835226, -122.3330909824881 47.607282, -122.33307820503606 47.60741173164774, -122.33304036371024 47.60753647778101, -122.33297891273158 47.60765144447587, -122.33289621362671 47.60775221362671, -122.33279544447586 47.60783491273157, -122.33268047778101 47.60789636371024, -122.33255573164774 47.60793420503606, -122.332426 47.6079469824881, -122.33229626835225 47.60793420503606, -122.33217152221899 47.60789636371024, -122.33205655552413 47.60783491273157, -122.33195578637329 47.60775221362671, -122.33187308726842 47.60765144447587, -122.33181163628976 47.60753647778101, -122.33177379496394 47.60741173164774, -122.3317610175119 47.607282))),true,false)
    Intervals: FilterValues(List((-∞,2020-05-27T16:59:31Z]),true,false)
    Plan: ScanPlan
      Tables: atlas_OSMNodes_z3_geometry_ingestionTimestamp_v6
      Ranges (7440): [%00;%0a;E$A%08;%00;%00;%00;%00;%00;::%00;%0a;E$A%0c;], [%01;%0a;E$A%08;%00;%00;%00;%00;%00;::%01;%0a;E$A%0c;], [%02;%0a;E$A%08;%00;%00;%00;%00;%00;::%02;%0a;E$A%0c;], [%03;%0a;E$A%08;%00;%00;%00;%00;%00;::%03;%0a;E$A%0c;], [%04;%0a;E$A%08;%00;%00;%00;%00;%00;::%04;%0a;E$A%0c;]
      Scans (120): ['%0a;ElA%98;%00;%00;%00;%00;%00;::'%0a;Ema%8c;], [:%0a;ElA%98;%00;%00;%00;%00;%00;:::%0a;Ema%8c;], [%14;::%14;%0a;ElA%8c;], [(%0a;ElA%98;%00;%00;%00;%00;%00;::(%0a;Ema%8c;], [%12;%0a;ElA%98;%00;%00;%00;%00;%00;::%12;%0a;Ema%8c;]
      Column families: d
      Remote filters: MultiRowRangeFilter, Z3HBaseFilter[(epoch,2629:2629),(zt,0:2009670),(zxy,335934:1603233:335941:1603248)], CqlFilter[(DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00]
    Plan creation took 135ms
  Query planning took 433ms

其他查询 [1]

在进行实验时,我添加了较低的时间戳,它将延迟从 2-3 小时减少到了 10-20 分钟。

geomesa-hbase explain -c atlas -f OSMNodes -q "DWITHIN(geometry, POINT(-122.332426 47.607282), 50, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31' AND ingestionTimestamp >= '2019-05-27 16:59:31' AND nextTimestamp > '2020-05-27 16:59:31'"


Planning 'OSMNodes' ((DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00) AND ingestionTimestamp >= 2019-05-27T16:59:31+00:00) AND nextTimestamp > 2020-05-27T16:59:31+00:00
  Original filter: ((DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND ingestionTimestamp <= '2020-05-27 16:59:31') AND ingestionTimestamp >= '2019-05-27 16:59:31') AND nextTimestamp > '2020-05-27 16:59:31'
  Hints: bin[false] arrow[false] density[false] stats[false] sampling[none]
  Sort: none
  Transforms: none
  Strategy selection:
    Query processing took 24ms for 1 options
    Filter plan: FilterPlan[Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)][nextTimestamp > 2020-05-27T16:59:31+00:00]]
    Strategy selection took 2ms for 1 options
  Strategy 1 of 1: Z3Index(geometry,ingestionTimestamp)
    Strategy filter: Z3Index(geometry,ingestionTimestamp)[DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)][nextTimestamp > 2020-05-27T16:59:31+00:00]
    Geometries: FilterValues(List(POLYGON ((-122.3317610175119 47.607282, -122.33177379496394 47.60715226835226, -122.33181163628976 47.607027522218985, -122.33187308726842 47.606912555524126, -122.33195578637329 47.606811786373285, -122.33205655552413 47.60672908726843, -122.33217152221899 47.60666763628976, -122.33229626835225 47.606629794963936, -122.332426 47.606617017511894, -122.33255573164774 47.606629794963936, -122.33268047778101 47.60666763628976, -122.33279544447586 47.60672908726843, -122.33289621362671 47.606811786373285, -122.33297891273158 47.606912555524126, -122.33304036371024 47.607027522218985, -122.33307820503606 47.60715226835226, -122.3330909824881 47.607282, -122.33307820503606 47.60741173164774, -122.33304036371024 47.60753647778101, -122.33297891273158 47.60765144447587, -122.33289621362671 47.60775221362671, -122.33279544447586 47.60783491273157, -122.33268047778101 47.60789636371024, -122.33255573164774 47.60793420503606, -122.332426 47.6079469824881, -122.33229626835225 47.60793420503606, -122.33217152221899 47.60789636371024, -122.33205655552413 47.60783491273157, -122.33195578637329 47.60775221362671, -122.33187308726842 47.60765144447587, -122.33181163628976 47.60753647778101, -122.33177379496394 47.60741173164774, -122.3317610175119 47.607282))),true,false)
    Intervals: FilterValues(List([2019-05-27T16:59:31Z,2020-05-27T16:59:31Z]),true,false)
    Plan: ScanPlan
      Tables: atlas_OSMNodes_z3_geometry_ingestionTimestamp_v6
      Ranges (404100): [%00;%0a;4$A%08;%00;%00;%00;%00;%00;::%00;%0a;4$A%0c;], [%01;%0a;4$A%08;%00;%00;%00;%00;%00;::%01;%0a;4$A%0c;], [%02;%0a;4$A%08;%00;%00;%00;%00;%00;::%02;%0a;4$A%0c;], [%03;%0a;4$A%08;%00;%00;%00;%00;%00;::%03;%0a;4$A%0c;], [%04;%0a;4$A%08;%00;%00;%00;%00;%00;::%04;%0a;4$A%0c;]
      Scans (4080): [2%0a;/la%08;%00;%00;%00;%00;%00;::2%0a;0da%9c;], [%18;%0a;$mA%08;%00;%00;%00;%00;%00;::%18;%0a;%eA%9c;], [%03;%0a;@%e%08;%00;%00;%00;%00;%00;::%03;%0a;@me%9c;], [%0f;%0a;%eE%08;%00;%00;%00;%00;%00;::%0f;%0a;&-E%9c;], ['%0a;Bda%08;%00;%00;%00;%00;%00;::'%0a;C,a%9c;]
      Column families: d
      Remote filters: MultiRowRangeFilter, Z3HBaseFilter[(epoch,2577:2629),(zt,1410483:2097151,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0:2009670),(zxy,335934:1603233:335941:1603248)], CqlFilter[(DWITHIN(geometry, POINT (-122.332426 47.607282), 50.0, meters) AND (ingestionTimestamp >= 2019-05-27T16:59:31+00:00 AND ingestionTimestamp <= 2020-05-27T16:59:31+00:00)) AND nextTimestamp > 2020-05-27T16:59:31+00:00]
    Plan creation took 475ms
  Query planning took 813ms

造成这么大差异的原因是什么?

【问题讨论】:

    标签: hbase geomesa


    【解决方案1】:

    首先,感谢您提供索引信息和查询解释器输出。这有助于我们更容易地回答。

    使用 z3 索引时,如果日期范围只有一个上限(或类似的下限),则涉及索引空间中的范围。对于每个拆分,将扫描相同模式的 z3 范围,因此拥有 60 个拆分将导致需要扫描大量范围,并且这些范围可能会分布在 HBase 集群中。

    有一些可能的尝试: 1. 以更少的范围重新摄取 2. 添加一个 z2(空间)索引来帮助处理这些类型的查询(空间谓词将返回少量记录,然后将被进一步过滤) 3. 弄清楚是否可以添加时间下限(诚然,这可能是不可能的。在某些用例中,它确实有意义。)

    【讨论】:

    • 如果我运行 OtherQuery[1] 而不是上面的一个,摄取时间戳总是 ingestionTimestamp >= '2019-05-27 16:59:31' 怎么办?它会提高性能吗?查询计划附在上面。
    • ingestionTimestamp = '2019-05-27 16:59:31' 在逻辑上意味着 ingestionTimestamp 正好是 '2020-05- 27 16:59:31'。如果您想按这样的确切日期进行查询,您可以考虑在 ingestionTimestamp 上有一个属性索引。 (您可以将 z2 作为复合“二级”索引。这样您就可以尽可能快地查询准确的时间和几何形状。
    • 除此之外,我猜 OtherQuery[1] 的性能可能会更快,因为它应该扫描更少的数据。也就是说,与扫描次数相比,范围数非常高。这意味着可能会扫描一些不必要的数据。 GeoMesa 试图避免创建过多的扫描。它加入范围以实现这一目标。在这些情况下,进行 60 次拆分可能会导致速度减慢。也就是说,这只是一个可供试验的变量!
    猜你喜欢
    • 2016-02-24
    • 2010-11-26
    • 2012-09-24
    • 2012-06-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多