使用 ArangoDB AQL 计算字符串出现次数答案

【问题标题】：Counting string occurrences with ArangoDB AQL使用 ArangoDB AQL 计算字符串出现次数
【发布时间】：2020-01-14 15:11:09
【问题描述】：

要计算包含特定属性值的对象的数量，我可以执行以下操作：

FOR t IN thing
  COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
  FILTER other != false
  RETURN otherCount

但是如何计算同一查询中的其他三个事件，而不会导致子查询多次通过同一数据集运行？

我尝试过类似的方法：

FOR t IN thing
  COLLECT 
    other = t.name = "Other",
    some = t.name = "Some",
    thing = t.name = "Thing"
  WITH COUNT INTO count
  RETURN {
   other, some, thing,
   count
  }

但我无法理解结果：我一定是以错误的方式处理这个问题？

【问题讨论】：

您是否真的想计算某些短语在单个属性的较大字符串值中出现的频率？您的查询看起来不会做任何类似的事情，实际上我很惊讶它并非无效（即由于x = y = z而引发语法错误）
是的，我想计算不同短语作为单个属性的子字符串出现的频率。如果确切的值就足够了，那么执行以下操作就足够了：FOR t IN thing COLLECT name = t.name WITH COUNT INTO count RETURN { name, count }.

标签： arangodb aql

【解决方案1】：

拆分计数

您可以按短语拆分字符串并从计数中减去 1。这适用于任何子字符串，另一方面意味着它不尊重单词边界。

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(SPLIT(t.name, "Some"))-1
  LET Other = LENGTH(SPLIT(t.name, "Other"))-1
  LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
  RETURN {
   Some, Other, Thing
}

结果：

[
  {
    "Some": 3,
    "Other": 2,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

您可以使用SPLIT(LOWER(t.name), LOWER("...")) 使其不区分大小写。

收集单词

TOKENS() 函数可用于将输入拆分为单词数组，然后可以对其进行分组和计数。请注意，我稍微更改了输入。输入 "SomeSome" 将不会被计算在内，因为 "somesome" != "some"（此变体是单词而不是基于子字符串）。

LET things = [
    {name: "Here are SOME some and Some Other Things. More Other!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")

FOR t IN things
  LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
  LET counts = MERGE(FOR w IN whitelisted
    COLLECT word = w WITH COUNT INTO count
    RETURN { [word]: count }
  )
  RETURN {
    name: t.name,
    some: counts.some || 0,
    other: counts.other || 0,
    things: counts.things ||0
  }

结果：

[
  {
    "name": "Here are SOME some and Some Other Things. More Other!",
    "some": 3,
    "other": 2,
    "things": 0
  },
  {
    "name": "There are no such substrings in here.",
    "some": 0,
    "other": 0,
    "things": 0
  },
  {
    "name": "some-Other-here-though!",
    "some": 1,
    "other": 1,
    "things": 0
  }
]

这确实使用了 COLLECT 的子查询，否则它将计算整个输入的出现总数。

白名单步骤不是绝对必要的，您也可以让它计算所有单词。对于较大的输入字符串，它可能会节省一些内存，而不是为您不感兴趣的单词执行此操作。

如果您想精确匹配单词，您可能需要为该语言创建一个单独的Analyzer with stemming disabled。您也可以关闭normalization ("accent": true, "case": "none")。另一种方法是将REGEX_SPLIT() 用于典型的空白和标点符号以进行更简单的标记，但这取决于您的用例。

其他解决方案

我认为不可能在没有子查询的情况下使用 COLLECT 独立计算每个输入对象，除非您想要总计数。

拆分有点小技巧，但是您可以将 SPLIT() 替换为 REGEX_SPLIT() 并将搜索短语包装在 \b 中以仅在单词边界位于两侧时才匹配。那么它应该只匹配单词（或多或少）：

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
  LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
  LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
  RETURN {
   Some, Other, Thing
}

结果：

[
  {
    "Some": 1,
    "Other": 1,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

更优雅的解决方案是使用 ArangoSearch 进行字数统计，但它没有让您检索单词出现频率的功能。它可能已经在内部跟踪了这一点（分析器功能“频率”），但它绝对不会在此时公开。

【讨论】：

感谢您的见解和示例！