AWS SDK CloudSearch 分页答案

【问题标题】：AWS SDK CloudSearch paginationAWS SDK CloudSearch 分页
【发布时间】：2015-07-22 05:57:28
【问题描述】：

我正在使用 PHP AWS SDK 与 CloudSearch 进行通信。根据this post，可以使用cursor 或start 参数进行分页。但是当你有超过 10,000 次点击时，你不能使用start。

当使用start时，我可以指定['start' => 1000, 'size' => 100]直接进入第10页。
如何使用cursor 到达第 1000 页（或任何其他随机页面）？也许有什么方法可以计算这个参数？

【问题讨论】：

刚刚通过搜索找到了这个，我仍在研究它，但我作为临时修复的分页解决方案是获取 10,000 个不带字段的块，所以我只得到文档 ID。然后，一旦我计算出我需要的页面偏移量的最后一个 10,000 个特征 ID 块，我然后拼接结果数组，因此它只返回一组较小的结果。我这里还有一层缓存，所以后续调用已经缓存了游标结果。

标签： php amazon-web-services pagination amazon-cloudsearch

【解决方案1】：

我会喜欢有更好的方法，但是这里......

我发现游标的一件事是，在同一数据集上搜索时，它们会为重复的搜索请求返回相同的值，因此不要将它们视为会话。虽然您的数据没有更新，但您可以有效地缓存分页的各个方面以供多个用户使用。

我提出了这个解决方案，并用 75,000 多条记录对其进行了测试。

1) 确定你的开始是否会低于 10k 限制，如果是，请使用非光标搜索，否则在超过 10K 时，首先使用 initial 光标执行搜索大小为 10K 并返回 _no_fields。这给出了我们的起始偏移量，没有字段加快了我们必须消耗的数据量，无论如何我们都不需要这些 ID

2) 计算出您的目标偏移量，并计划将光标定位在目标结果页面之前需要多少次迭代。然后我使用我的请求作为缓存哈希来迭代和缓存结果。

对于我的迭代，我从 10K 块开始，然后将大小减小到 5k，然后是 1k 块，因为我开始“更接近”目标偏移量，这意味着后续分页使用的是更接近最后一个块的前一个光标.

例如，这可能是这样的：

获取 10000 条记录（初始光标）
获取 5000 条记录
获取 5000 条记录
获取 5000 条记录
获取 5000 条记录
获取 1000 条记录
获取 1000 条记录

这将帮助我到达 32,000 偏移标记附近的块。如果我需要达到 33,000，我可以使用我的缓存结果来获取将返回前 1000 的游标并从该偏移量重新开始......

获取 10000 条记录（缓存）
获取 5000 条记录（缓存）
获取 5000 条记录（缓存）
获取 5000 条记录（缓存）
获取 5000 条记录（缓存）
获取 1000 条记录（缓存）
获取 1000 条记录（缓存）
获取 1000 条记录 （使用缓存游标工作）

3) 现在我们位于目标结果偏移量的“附近”，您可以开始将页面大小指定到目标位置之前。然后执行最终搜索以获取实际的结果页面。

4) 如果您从索引中添加或删除文档，您将需要一种机制来使之前的缓存结果无效。我通过存储索引上次更新时间的时间戳并将其用作缓存键生成例程的一部分来完成此操作。

重要的是缓存方面，您应该构建一个使用请求数组作为缓存哈希键的缓存机制，以便可以轻松创建/引用它。

对于非种子缓存，此方法是SLOW，但如果您可以预热缓存并且仅在索引文档发生更改时将其过期（然后再次预热），您的用户将无法分辨。

此代码创意适用于每页 20 个项目，我很想对此进行研究，看看我如何才能更智能/更高效地编写代码，但概念就在那里...

// Build $request here and set $request['start'] to be the offset you want to reach

// Craft getCache() and setCache() functions or methods for cache handling.

// have $cloudSearchClient as your client

if(isset($request['start']) === true and $request['start'] >= 10000)
{
  $originalRequest = $request;
  $cursorSeekTarget = $request['start'];
  $cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
  $cursorSeekOffset = 0;
  $request['return'] = '_no_fields';
  $request['cursor'] = 'initial';
  unset($request['start'],$request['facet']);
  // While there is outstanding work to be done...
  while( $cursorSeekAmount > 0 )
  {
    $request['size'] = $cursorSeekAmount;
    // first hit the local cache
    if(empty($result = getCache($request)) === true)
    {
      $result = $cloudSearchClient->Search($request);
      // store the results in the cache
      setCache($request,$result);
    }
    if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
    {
      // prepare the next request with the cursor
      $request['cursor'] = $hits['cursor'];
    }
    $cursorSeekOffset = $cursorSeekOffset + $request['size'];
    if($cursorSeekOffset >= $cursorSeekTarget)
    {
      $cursorSeekAmount = 0; // Finished, no more work
    }
    // the first request needs to get 10k, but after than only get 5K
    elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
    {
      $cursorSeekAmount = 5000;
    }
    elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
    {
      $cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
      // if we still need to seek more than 5K records, limit it back again to 5K
      if($cursorSeekAmount > 5000)
      {
        $cursorSeekAmount = 5000;
      }
      // if we still need to seek more than 1K records, limit it back again to 1K
      elseif($cursorSeekAmount > 1000)
      {
        $cursorSeekAmount = 1000;
      }
    }
  }
  // Restore aspects of the original request (the actual 20 items)
  $request['size'] = 20;
  $request['facet'] = $originalRequest['facet'];
  unset($request['return']); // get the default returns
  if(empty($result = getCache($request)) === true)
  {
    $result = $cloudSearchClient->Search($request);
    setCache($request,$result);
  }
}
else
{
  // No cursor required
  $result = $cloudSearchClient->Search( $request );
}

请注意，这是使用自定义 AWS 客户端而不是官方 SDK 类完成的，但请求和搜索结构应该具有可比性。

【讨论】：

我刚刚还发现，无需缓存搜索操作的结果，只需从响应中缓存光标即可进行后续迭代。