使用 Microsoft (Bing) 认知搜索 API (v5) 的 totalEstimatedMatches 行为答案

【问题标题】：totalEstimatedMatches behavior with Microsoft (Bing) Cognitive search API (v5)使用 Microsoft (Bing) 认知搜索 API (v5) 的 totalEstimatedMatches 行为
【发布时间】：2019-03-08 09:32:20
【问题描述】：

最近将一些 Bing Search API v2 代码转换为 v5 并且它可以工作，但我对“totalEstimatedMatches”的行为感到好奇。这是一个例子来说明我的问题：

我们网站上的用户搜索特定字词。 API 查询返回 10 个结果（我们的页面大小设置），totalEstimatedMatches 设置为 21。因此我们指示 3 个页面的结果并让用户页面通过。

当他们到达第 3 页时，totalEstimatedMatches 返回 22 而不是 21。似乎很奇怪，对于这么小的结果集，它不应该已经知道它是 22，但是我可以忍受。所有结果都正确显示。

现在如果用户再次从第 3 页返回到第 2 页，totalEstimatedMatches 的值又是 21。这让我有点惊讶，因为一旦结果集被分页，API 可能应该知道有 22 个而不是 21 个结果。

自 80 年代以来，我一直是一名专业的软件开发人员，所以我知道这是与 API 设计相关的细节问题之一。显然，它没有缓存确切数量的结果，或者其他什么。我只是不记得 V2 搜索 API 中的那种行为（我意识到这是第 3 方代码）。结果的数量非常可靠。

这会让我以外的任何人感到有点意外吗？

【问题讨论】：

凹凸^。在我的 q=... 参数中使用 OR 运算符时，我注意到了类似的行为。

标签： bing-api microsoft-cognitive

【解决方案1】：

原来这就是响应 JSON 字段 totalEstimatedMatches 包含单词 ...Estimated... 而不仅仅是称为 totalMatches 的原因：

“...搜索引擎索引不支持对总匹配的准确估计。”

取自：News Search API V5 paging results with offset and count

正如人们所料，您返回的结果越少，您在totalEstimatedMatches 值中看到的错误百分比就越大。同样，您的查询越复杂（例如运行复合查询，例如 ../search?q=(foo OR bar OR foobar)&...，实际上是将 3 个搜索打包成 1 个），该值似乎表现出的变化就越多。

也就是说，我已经设法（至少初步）通过设置 offset == totalEstimatedMatches 并创建一个简单的等效检查函数来弥补这一点。

下面是python中的一个小例子：

while True:
    if original_totalEstimatedMatches < new_totalEstimatedMatches:
       original_totalEstimatedMatches = new_totalEstimatedMatches.copy()

       #set_new_offset_and_call_api() is a func that does what it says.
       new_totalEstimatedMatches = set_new_offset_and_call_api()
    else:
        break

【讨论】：

所以换句话说，你自己在代码中隐藏了值。
是的。不幸的是，由于 Bing 提供的数字只是一个估计值，因此获得精确值的责任似乎在于消费者/中介。

【解决方案2】：

重新访问 API &，我想出了一种无需使用 "totalEstimatedMatches" 返回值即可有效分页的方法：

class ApiWorker(object):
    def __init__(self, q):
        self.q = q
        self.offset = 0
        self.result_hashes = set()
        self.finished = False

    def calc_next_offset(self, resp_urls):
       before_adding = len(self.result_hashes)
       self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
       after_adding = len(self.result_hashes)
       if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
           self.complete = True
       else:
           self.offset += len(new_results)

    def page_through_results(self, *args, **kwargs):
        while not self.finished:
            new_resp_urls = ...<call_logic>...
            self.calc_next_offset(new_resp_urls)
            ...<save logic>...
        print(f'All unique results for q={self.q} have been obtained.')

一旦获得完整的重复响应，此^ 将停止分页。

【讨论】：