按部分 url 匹配答案

【问题标题】：Matching by part url按部分 url 匹配
【发布时间】：2016-10-07 15:46:35
【问题描述】：

我有两个索引 - 一个包含带有 _id=<url of the document> 的“文档”对象，例如http://site/folder/document_name.doc ;另一个包含带有_id=<url of the folder> 的“文件夹”对象，例如http://site/folder

在我的 node.js 脚本中，我需要将文档与文件夹匹配，即搜索所有文件夹 url，然后为每个文件夹搜索所有以文件夹 url 开头的文档

我似乎无法构造正确的查询来返回_id 以http://site/folder 开头的所有文档

有什么想法吗？

【问题讨论】：

标签： elasticsearch

【解决方案1】：

我认为更好的解决方案是不要使用_id 来解决这个问题。

相反，索引字段名为 path（或您想要的任何名称），并考虑使用带有一些创意令牌过滤器的 Path Hierarchy Tokenizer。

这样您就可以使用 Elasticsearch/Lucene 对 URL 进行标记。

例如：https://site/folder 被标记为两个标记：

site
site/folder

然后，您可以通过搜索正确的标记来查找 site 文件夹中包含的任何文件或文件夹：site。

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "http_dropper": {
          "type": "pattern_replace",
          "pattern": "^https?:/{0,}(.*)",
          "replacement": "$1"
        },
        "empty_dropper": {
          "type": "length",
          "min": 1
        },
        "qs_dropper": {
          "type": "pattern_replace",
          "pattern": "(.*)[?].*",
          "replacement": "$1"
        },
        "trailing_slash_dropper": {
          "type": "pattern_replace",
          "pattern": "(.*)/+$",
          "replacement": "$1"
        }
      },
      "analyzer": {
        "url": {
          "tokenizer": "path_hierarchy",
          "filter": [
            "http_dropper",
            "qs_dropper",
            "trailing_slash_dropper",
            "empty_dropper",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "type" : {
      "properties": {
        "url" : {
          "type": "string",
          "analyzer": "url"
        },
        "type" : {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

您可能或不想要我添加的trailing_slash_dropper。在那里使用lowercase 标记过滤器也可能是值得的，但这实际上可能会使某些 URL 标记从根本上不正确（例如，mysite.com/bucket/AaDsaAe31AcxX 可能真的关心这些字符的大小写）。您可以通过_analyze 端点将分析仪用于试驾：

GET /test/_analyze?analyzer=url&text=http://test.com/text/a/?value=xyz&abc=value

注意：我使用的是 Sense，所以它会为我进行 URL 编码。这将产生三个令牌：

{
  "tokens": [
    {
      "token": "test.com",
      "start_offset": 0,
      "end_offset": 15,
      "type": "word",
      "position": 0
    },
    {
      "token": "test.com/text",
      "start_offset": 0,
      "end_offset": 20,
      "type": "word",
      "position": 0
    },
    {
      "token": "test.com/text/a",
      "start_offset": 0,
      "end_offset": 22,
      "type": "word",
      "position": 0
    }
  ]
}

将它们捆绑在一起：

POST /test/type
{
  "type" : "dir",
  "url" : "https://site"
}

POST /test/type
{
  "type" : "dir",
  "url" : "https://site/folder"
}

POST /test/type
{
  "type" : "file",
  "url" : "http://site/folder/document_name.doc"
}

POST /test/type
{
  "type" : "file",
  "url" : "http://other/site/folder/document_name.doc"
}

POST /test/type
{
  "type" : "file",
  "url" : "http://other_site/folder/document_name.doc"
}

POST /test/type
{
  "type" : "file",
  "url" : "http://site/mirror/document_name.doc"
}

GET /test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "url": "http://site/folder"
          }
        }
      ],
      "filter": [
        {
          "term": {
            "type": "file"
          }
        }
      ]
    }
  }
}

对此进行测试很重要，这样您就可以看到匹配的内容以及这些匹配的顺序。当然，它会找到您希望它找到的文档（并将其放在顶部！），但它也会找到您可能不希望找到的其他一些文档，例如 http://site/mirror/document_name.doc，因为它共享基本令牌：site。您可以使用多种策略来排除这些文档如果排除它们很重要。

您可以利用标记化来执行类似于 Google 的结果过滤，例如如何通过 Google 搜索特定域：

匹配查询站点：elastic.co

然后您可以（手动）解析site:elastic.co 并将elastic.co 作为边界网址：

{
  "term" : {
    "url" : "elastic.co"
  }
}

请注意，这与搜索 URL 不同。您明确地说“仅包含在其 url 中包含此 exact 令牌的文档”。您可以使用site:elastic.co/blog 等更进一步，因为存在确切的令牌。但是，请务必注意，如果您尝试使用 site:elastic.co/blog/，则将找不到任何文档，因为在给定令牌过滤器的情况下，该令牌不存在。

【讨论】：

很好的答案，谢谢-我已经成功使用了！