【问题标题】:Elastic Search Intersection Query弹性搜索交叉点查询
【发布时间】:2020-03-07 02:03:28
【问题描述】:

我想获取按总计数排序的用户列表的常用词。

示例: 我有一个用户使用的词的索引。

文档:

[
  {
    user_id: 1,
    word: 'food',
    count: 2
  },
  {
    user_id: 1,
    word: 'thor',
    count: 1
  },
  {
    user_id: 1,
    word: 'beer',
    count: 7
  },
  {
    user_id: 2,
    word: 'summer',
    count: 12
  },
  {
    user_id: 2,
    word: 'thor',
    count: 4
  },
  {
    user_id: 1,
    word: 'beer',
    count: 2
  },
  ..otheruserdetails..
]

输入:user_ids: [1, 2]

想要的输出:

[
  {
    'word': 'beer',
    'total_count': 9
  },
  {
    'word': 'thor',
    'total_count': 5
  }
]

到目前为止我所拥有的:

  1. 在 user_id 列表中使用 user_id 获取所有文档(bool 应该查询)
  2. 在应用层处理文档。
    • 遍历每个关键字
      • 检查每个 user_id 是否存在关键字
      • 如果是,请查找计数
      • 否则,处理并转到下一个关键字

但是,这是不可行的,因为 word 文档会变得庞大,而应用层将跟不上。有什么方法可以将其移至 ES 查询?

【问题讨论】:

  • LMK 如果“intersection”不是这个关键字的正确词。我一直在尝试使用“intersection”一词在搜索引擎中搜索解决方案。

标签: elasticsearch elasticsearch-aggregation elasticsearch-query


【解决方案1】:

您可以使用Terms aggregationValue Count aggregation

可以将“术语聚合”视为“分组依据”。输出将给出一个唯一的 userId 列表,用户下所有单词的列表以及每个单词的最终计数

{
  "from": 0, 
  "size": 10, 
  "query": {
    "terms": {
      "user_id": [
        "1",
        "2"
      ]
    }
  },
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id",
        "size": 10
      },
      "aggs": {
        "words": {
          "terms": {
            "field": "word.keyword",
            "size": 10
          },
          "aggs": {
            "word_count": {
              "value_count": {
                "field": "word.keyword"
              }
            }
          }
        }
      }
    }
  }
}

结果

    "hits" : [
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "gFRzr3ABAWOsYG7t2tpt",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 1,
          "word" : "thor",
          "count" : 1
        }
      },
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "flRzr3ABAWOsYG7t0dqI",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 1,
          "word" : "food",
          "count" : 2
        }
      },
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "f1Rzr3ABAWOsYG7t19ps",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 2,
          "word" : "thor",
          "count" : 4
        }
      },
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "gVRzr3ABAWOsYG7t8NrR",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 1,
          "word" : "food",
          "count" : 2
        }
      },
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "glRzr3ABAWOsYG7t-Npj",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 1,
          "word" : "thor",
          "count" : 1
        }
      },
      {
        "_index" : "index89",
        "_type" : "_doc",
        "_id" : "g1Rzr3ABAWOsYG7t_9po",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 2,
          "word" : "thor",
          "count" : 4
        }
      }
    ]
  },
  "aggregations" : {
    "users" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 1,
          "doc_count" : 4,
          "words" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "food",
                "doc_count" : 2,
                "word_count" : {
                  "value" : 2
                }
              },
              {
                "key" : "thor",
                "doc_count" : 2,
                "word_count" : {
                  "value" : 2
                }
              }
            ]
          }
        },
        {
          "key" : 2,
          "doc_count" : 2,
          "words" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "thor",
                "doc_count" : 2,
                "word_count" : {
                  "value" : 2
                }
              }
            ]
          }
        }
      ]
    }
  }

【讨论】:

    【解决方案2】:

    您可以对用户使用聚合和过滤器,如下所示:

    {
      "size": 0,
      "aggs": {
        "words_stats": {
          "filter": {
            "terms": {
              "user_id": [
                "1",
                "2"
              ]
            }
          }, 
          "aggs": {
            "words": {
              "terms": {
                "field": "word.keyword"
              },
              "aggs": {
                "total_count": {
                  "sum": {
                    "field": "count"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    结果将是:

    {
      "key" : "beer",
      "doc_count" : 2,
      "total_count" : {
        "value" : 9.0
      }
    },
    {
      "key" : "thor",
      "doc_count" : 2,
      "total_count" : {
        "value" : 5.0
      }
    },
    {
      "key" : "food",
      "doc_count" : 1,
      "total_count" : {
        "value" : 2.0
      }
    },
    {
      "key" : "summer",
      "doc_count" : 1,
      "total_count" : {
        "value" : 12.0
     }
    }
    

    【讨论】:

    • 我用它作为参考,因为我想检查“交叉点”,我不得不使用min_doc_count 来检查它是否存在于多个用户中。并在total_count 上订购以进行排序。
    【解决方案3】:

    这是我必须做的:

    我参考了@Rakesh Chandru 和@jaspreet chahal 的答案'并想出了这个。此查询处理intersectionsorting

    流程:

    • 按用户 ID 过滤
    • group_by(terms aggs) 关键字(示例中的单词),
    • 按汇总(总和)计数排序
    {
        size: 0, // because we do not want result of filtered records
        query: {
            terms: { user_id: user_ids } // filter by user_ids
        },
        aggs: {
            group_by_keyword: {
                terms: {
                    field: "keyword", // group by keyword
                    min_doc_count: 2, // where count >= 2
                    order: { agg_count: "desc" }, // order by count
                    size
                },
                aggs: { 
                    agg_count: {
                        sum: {
                            field: "count" // aggregating count
                        }
                    }
                }
            }
        }
    }
    

    【讨论】:

      猜你喜欢
      • 2014-11-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-10-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多