Python 更快地替代字典？ [复制]答案

【问题标题】：Python faster alternative to dictionary? [duplicate]Python 更快地替代字典？ [复制]
【发布时间】：2014-11-08 03:10:42
【问题描述】：

我正在使用Naive Bayes classifier 制作一个简单的情感挖掘系统。

为了训练我的分类器，我有一个文本文件，其中每一行都包含一个标记列表（从推文生成）和相关的情绪（0 代表 -ve，4 代表积极）。

例如：

0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that 's a bummer . You shoulda got David Carr of Third Day to do it . ; D
0 spring break in plain city ... it 's snowing
0 @ alydesigns i was out most of the day so did n't get much done
0 some1 hacked my account on aim now i have to make a new one
0 really do n't feel like getting up today ... but got to study to for tomorrows practical exam ...

现在，我要做的是对每个标记，计算它在正面推文中出现的次数，以及在负面推文中出现的次数。然后我计划使用这些计数来计算概率。我正在使用内置字典来存储这些计数。键是令牌，值是大小为 2 的整数数组。

问题是这段代码开始时非常快，但越来越慢，当它处理了大约 20 万条推文时，它变得非常慢 - 大约每秒 1 条推文。由于我的训练集有 160 万条推文，这太慢了。我的代码是这样的：

def compute_counts(infile):
    f = open(infile)
    counts = {}
    i = 0
    for line in f:
        i = i + 1
        print(i)
        words = line.split(' ')
        for word in words[1:]:
            word = word.replace('\n', '').replace('\r', '')
            if words[0] == '0':
                if word in counts.keys():
                    counts[word][0] += 1
                else:
                    counts[word] = [1, 0]
            else:
                if word in counts.keys():
                    counts[word][1] += 1
                else:
                    counts[word] = [0, 1]
    return counts

我可以做些什么来加快这个过程？更好的数据结构？

编辑：不是重复的，问题不是在一般情况下比 dict 更快的东西，而是在这个特定的用例中。

【问题讨论】：

使用 counts[key] = counts.get(key, default=None) 代替检查 key 是否存在价值。
您可以使用两个collections.Counter 而不是一个列表字典。

标签： python performance dictionary nlp

【解决方案1】：

不要使用if word in counts.keys() 如果这样做，您最终会按顺序查看键，这是 dict 应该避免的。

只需输入if word in counts。

或者使用defaultdict。 https://docs.python.org/2/library/collections.html#collections.defaultdict

【讨论】：

在 Python 2 中，dict.keys() 创建一个列表，这个操作可能与搜索一样昂贵。不是字典慢。
defaultdict 就像一个魅力。早些时候我花了大约 4 个小时来处理 200k 行，但现在整个 160 万行在一分钟内完成。谢谢！