【问题标题】:How to improve the retrieve Query performance in ArangoDB 2.7如何提高 ArangoDB 2.7 中的检索查询性能
【发布时间】:2016-01-26 02:06:49
【问题描述】:

我是 python 和 ArangoDB 的初学者。我已将数据存储在 ArangoDB 中的单个集合名称“DSP”上。 我的查询是:

for k in 
    (for t in DSP return [t.data])
        for z in k
           for p in z
              filter p.name == "name" || 
                     p.content == "pdf" ||
                     p.content == "xml" ||
                     p.name == "Book"
              return p

以及其中存储的json数据: 格式如

{"data": [{"content": "Java", "type": "string", "name": "name", "key": 1}, {"content": "D:/Java", "type": "string", "name": "location", "key": 1}, {"content": "File folder", "type": "string", "name": "type", "key": 1}, {"content": 1896038645, "type": "int", "name": "size", "key": 1}, {"content": 7, "type": "string", "name": "child_folder_count", "key": 1}, {"content": 7, "type": "string", "name": "child_file_count", "key": 1}, {"content": "parse_dir.py", "type": "string", "name": "name", "key": 101}, {"content": "D:/Java/parse_dir.py", "type": "string", "name": "location", "key": 101}, {"content": "py", "type": "string", "name": "mime-type", "key": 101}, {"content": 4032, "type": "string", "name": "size", "key": 101}, {"content": "Wed Dec 30 21:36:32 2015", "type": "string", "name": "created_date", "key": 101}, {"content": "Wed Dec 30 21:42:38 2015", "type": "string", "name": "modified_date", "key": 101}, {"content": "result.json", "type": "string", "name": "name", "key": 102}, {"content": "D:/Java/result.json", "type": "string", "name": "location", "key": 102}, {"content": "json", "type": "string", "name": "mime-type", "key": 102}, {"content": 1134450, "type": "string", "name": "size", "key": 102}, {"content": "Wed Dec 30 21:36:45 2015", "type": "string", "name": "created_date", "key": 102}, {"content": "Wed Dec 30 21:36:45 2015", "type": "string", "name": "modified_date", "key": 102}, {"content": "rmi1.rar", "type": "string", "name": "name", "key": 103}, {"content": "D:/Java/rmi1.rar", "type": "string", "name": "location", "key": 103}, {"content": "rar", "type": "string", "name": "mime-type", "key": 103}, {"content": 165116, "type": "string", "name": "size", "key": 103}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 103}, {"content": "Tue Aug 30 16:18:34 2011", "type": "string", "name": "modified_date", "key": 103}, {"content": "servlet.rar", "type": "string", "name": "name", "key": 104}, {"content": "D:/Java/servlet.rar", "type": "string", "name": "location", "key": 104}, {"content": "rar", "type": "string", "name": "mime-type", "key": 104}, {"content": 782, "type": "string", "name": "size", "key": 104}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 104}, {"content": "Tue Aug 30 16:18:30 2011", "type": "string", "name": "modified_date", "key": 104}, {"content": "crawler projects", "type": "string", "name": "name", "key": 2}, {"content": "D:/Java/crawler projects", "type": "string", "name": "location", "key": 2}, {"content": "File folder", "type": "string", "name": "type", "key": 2}, {"content": 1886842316, "type": "int", "name": "size", "key": 2}, {"content": 5, "type": "string", "name": "child_folder_count", "key": 2}, {"content": 5, "type": "string", "name": "child_file_count", "key": 2}, {"content": ".metadata", "type": "string", "name": "name", "key": 3}, {"content": "D:/Java/crawler projects/.metadata", "type": "string", "name": "location", "key": 3}, {"content": "File folder", "type": "string", "name": "type", "key": 3}, {"content": 10131546, "type": "int", "name": "size", "key": 3}, {"content": 2, "type": "string", "name": "child_folder_count", "key": 3}, {"content": 2, "type": "string", "name": "child_file_count", "key": 3}, {"content": ".lock", "type": "string", "name": "name", "key": 301}, {"content": "D:/Java/crawler projects/.metadata/.lock", "type": "string", "name": "location", "key": 301}, {"content": "", "type": "string", "name": "mime-type", "key": 301}, {"content": 0, "type": "string", "name": "size", "key": 301}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 301}, {"content": "Mon May 30 12:21:45 2011", "type": "string", "name": "modified_date", "key": 301}, {"content": ".log", "type": "string", "name": "name", "key": 302}, {"content": "D:/Java/crawler projects/.metadata/.log", "type": "string", "name": "location", "key": 302}, {"content": "", "type": "string", "name": "mime-type", "key": 302}, {"content": 598, "type": "string", "name": "size", "key": 302}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 302}, {"content": "Mon May 30 15:29:18 2011", "type": "string", "name": "modified_date", "key": 302}, {"content": "version.ini", "type": "string", "name": "name", "key": 303}, {"content": "D:/Java/crawler projects/.metadata/version.ini", "type": "string", "name": "location", "key": 303}, {"content": "ini", "type": "string", "name": "mime-type", "key": 303}, {"content": 26, "type": "string", "name": "size", "key": 303}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 303}, {"content": "Mon May 30 15:29:18 2011", "type": "string", "name": "modified_date", "key": 303}, {"content": ".mylyn", "type": "string", "name": "name", "key": 4}, {"content": "D:/Java/crawler projects/.metadata/.mylyn", "type": "string", "name": "location", "key": 4}, {"content": "File folder", "type": "string", "name": "type", "key": 4}, {"content": 920, "type": "int", "name": "size", "key": 4}, {"content": 1, "type": "string", "name": "child_folder_count", "key": 4}, {"content": 1, "type": "string", "name": "child_file_count", "key": 4}, {"content": ".tasks.xml.zip", "type": "string", "name": "name", "key": 401}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/.tasks.xml.zip", "type": "string", "name": "location", "key": 401}, {"content": "zip", "type": "string", "name": "mime-type", "key": 401}, {"content": 250, "type": "string", "name": "size", "key": 401}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 401}, {"content": "Mon May 30 12:23:18 2011", "type": "string", "name": "modified_date", "key": 401}, {"content": "repositories.xml.zip", "type": "string", "name": "name", "key": 402}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/repositories.xml.zip", "type": "string", "name": "location", "key": 402}, {"content": "zip", "type": "string", "name": "mime-type", "key": 402}, {"content": 420, "type": "string", "name": "size", "key": 402}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 402}, {"content": "Mon May 30 12:23:18 2011", "type": "string", "name": "modified_date", "key": 402}, {"content": "tasks.xml.zip", "type": "string", "name": "name", "key": 403}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/tasks.xml.zip", "type": "string", "name": "location", "key": 403}, {"content": "zip", "type": "string", "name": "mime-type", "key": 403}, {"content": 250, "type": "string", "name": "size", "key": 403}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 403}, {"content": "Mon May 30 15:31:16 2011", "type": "string", "name": "modified_date", "key": 403}, {"content": "contexts", "type": "string", "name": "name", "key": 5}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/contexts", "type": "string", "name": "location", "key": 5}, {"content": "File folder", "type": "string", "name": "type", "key": 5}, {"content": 0, "type": "int", "name": "size", "key": 5}, {"content": 0, "type": "string", "name": "child_folder_count", "key": 5}]

当我添加大约 100 个 json 文档的 json 文档时,每个文档大约 15 MB,或者添加更多 n 更多过滤条件。查询需要超过 1 分钟的时间,并且有时浏览器没有响应。

我在 Intel core i3 2.4 GHz、4 GB RAM 和 160GB SATA 硬盘上做这个实验。

请告诉我,首先,如何提高查询的性能?我是否需要更改我的存储结构或更改我的查询语法。以及如何对具有相同key的多个文档进行join操作,例如“获取xml类型文档的名称”。

【问题讨论】:

    标签: python-2.7 arangodb aql


    【解决方案1】:

    应该有几种方法可以提高此查询的性能:

    • 通过子查询从集合 DSP 中选择所有文档,然后对其进行迭代 (for k in (for t in DSP return [t.data]) for z in k for p in z filter p.name == "name" ...) 可能比直接使用文档效率低。尝试将 4 个 FOR 循环和子查询替换为 FOR k IN DSP FOR p IN k.data FILTER p.name == "name" ...)

    • 如果您查看查询的explain 输出,它将显示不会使用任何索引。如果您在集合中有很多文档并且只想通过查询检索其中的一些文档,那么索引将有助于提高性能。我建议在data[*].name 上使用数组索引,在data[*].content 上使用数组索引。您可以像这样设置它们:db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].name" ] }); db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].content" ] });。注意:这些类型的索引需要 ArangoDB 2.8。有了这些索引,查询也可以简化为:FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content...。请注意,索引只会帮助您快速找到包含搜索数据的文档,而不是包含它的文档部分。

    • 调整文档结构可能会有所帮助。您当前的结构似乎在每个文档中包含多个 contentname 值,例如[ {"content": "Java", "type": "string", "name": "name", "key": 1}, {"content": "D:/Java", "type": "string", "name": "location", "key": 1} ]。看起来每个文档只有一个 data 属性,它是这些结构的数组。您可以尝试将每个数组值保存为单独的文档,而不是使用此结构。例如,{"content": "Java", "type": "string", "name": "name", "key": 1} 将成为自己的文档,{"content": "D:/Java", "type": "string", "name": "location", "key": 1} 将成为另一个文档等。这似乎是明智的,因为您的子结构似乎已经具有 key 属性,并且几个数组值似乎引用相同key 值。转换将允许将可能非常大的文档拆分为更小的块,这不仅会使 AQL 运行得更快(因为它在访问文档时需要解压的数据要少得多),而且还可以让您摆脱返回结果时所有嵌套循环并定位到相关的内部数组值。

    如果您调整文档结构,您的查询可以大大简化为仅FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p,如果使用索引应该很快。

    【讨论】:

    • 谢谢,stj... 我正在使用 2.7 版本的 ArangoDB,它不允许在健全的集合上创建多个哈希索引。当我在 FOR k IN DSP FOR p IN k.data filter p.name == "modified_date" || 中使用查询时.type == "string" 返回p格式,时间减半..
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-13
    • 2012-10-28
    • 1970-01-01
    • 2017-05-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多