快速序列化 trie答案

【问题标题】：Fast serialization of a trie快速序列化 trie
【发布时间】：2014-04-25 04:47:16
【问题描述】：

我的应用程序的一部分同时使用了trie 到chunk 字词。例如，["Summer", "in", "Los", "Angeles"] 变为 ["Summer", "in", "Los Angeles"]。

现在，这个 trie 在应用程序启动时从 a large database 填充，以 SQL 形式存储在本地。这需要很长时间，大约15s。我想减少应用程序的启动时间，所以我研究了序列化 Trie。不幸的是，pickling 太慢了 - 比从数据库中加载所有内容都慢。

有没有更快的方法来序列化我的 trie？

Trie 类如下所示：

class Trie:
    def __init__(self):
        self.values = set()
        self.children = dict()

    def insert(self, key, value):
        """Insert a (key,value) pair into the trie.  
        The key should be a list of strings.
        The value can be of arbitrary type."""
        current_node = self
        for key_part in key:
            if key_part not in current_node.children:
                current_node.children[key_part] = Trie()
            current_node = current_node.children[key_part]
        current_node.values.add(value)

    def retrieve(self, key):
        """Returns either the value stored at the key, or raises KeyError."""
        current_node = self
        for key_part in key:
            current_node = current_node.children[key_part]
        return current_node.values

有什么方法可以改变它，使它更可序列化？

【问题讨论】：

我曾经做过这样的事情来节省内存 (stackoverflow.com/questions/2574357/…)，但是使用像 mongoDB 这样的优化数据库和像 Lucene 这样的索引 API，我会避免构建一个新的结构来索引和检索东西。跨度>
+1 for MongoDB，我实际上正在考虑放弃关系数据库。

标签： python serialization nlp pickle trie

【解决方案1】：

我知道我没有给出 python 答案，但这仍然可能有用：

创建、压缩和存储 trie 确实是一项艰巨的任务。我花了很多时间思考自动建议的数据结构，据我所知，最优雅的解决方案是由 Giuseppe Ottaviano 和 partly described in my blog article

尽管在 python 中实现 Ottaviano as described in his paper 的整个解决方案没有意义，但您仍然可以按照基本思想将完整的 trie 存储为一大块内存并且只有引用到接下来要跳的位置。

通过这种方式，您可以轻松地将这个数组或内存块序列化到硬盘上。我对 python 并不完全确定，但我认为这个操作应该可以工作并且比序列化数据结构要快得多。

我知道存在 Ottavianos 工作的 c 实现，您甚至可以使用 python c 绑定。

【讨论】：

【解决方案2】：

我最终将 trie 存储在 MongoDB 中。

存在网络开销，但如果数据库位于本地主机上，这还不错。

【讨论】：