sort() 排序不一致答案

【问题标题】：Inconsistent sorting with sort()sort() 排序不一致
【发布时间】：2016-10-27 12:45:36
【问题描述】：

我有以下函数来计算字符串中的单词并提取顶部的“n”：

功能

def count_words(s, n):
"""Return the n most frequently occuring words in s."""

    #Split words into list
    wordlist = s.split()

    #Count words
    counts = Counter(wordlist)

    #Get top n words
    top_n = counts.most_common(n)

    #Sort by first element, if tie by second
    top_n.sort(key=lambda x: (-x[1], x[0]))

    return top_n

所以它按出现次数排序，如果并列，则按字母顺序排列。以下示例：

print count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)

有效（显示[('cat', 4), ('mat', 4), ('bat', 3)]）

print count_words("betty bought a bit of butter but the butter was bitter", 3)

不起作用（显示[('butter', 2), ('a', 1), ('bitter', 1)]，但应该有betty 而不是bitter，因为它们是绑定的并且be... 在bi... 之前）

print count_words("betty bought a bit of butter but the butter was bitter", 6)

有效（按预期在bitter 之前显示[('butter', 2), ('a', 1), ('betty', 1), ('bitter', 1), ('but', 1), ('of', 1)] 和betty）

是什么原因造成的（可能是字长？），我该如何解决？

【问题讨论】：

您可以通过执行.most_common() 来修复它，然后对结果进行排序和切片，而不是将n 提供给most_common。

标签： python sorting counter

【解决方案1】：

问题不是sort 调用，而是most_common。 Counter 被实现为哈希表，因此它使用的顺序是任意。当你请求most_common(n) 时，它会返回n 最常用的词，如果有关系，它只是任意决定返回哪个！

解决这个问题的最简单方法是避免使用most_common，直接使用列表：

top_n = sorted(counts.items(), key=lambda x: (-x[1], x[0]))[:n]

【讨论】：

请不要使用sorted()，然后剪切。如果桶的数量很大，这会浪费大量的时间。使用heapq.nsmallest() 并且只执行 O(NlogK) 步骤而不是 O(NlogN)。
@MartijnPieters 我很清楚这一点，heapq 也很清楚。但我认为这并不重要，因为它是对数因子的变化，因此需要指数级大数据才能发挥作用......而且对于指数级大数据，无论如何你都会遇到很多问题，尤其是使用像@这样的东西987654332@ 作为一个哈希表浪费了相当多的空间......
你真的不需要一个巨大的哈希来看到差异加起来；见this quick timing test。这是从 53 个字母数中选出前 3 个字母，并且 heapq 已经获胜。 heapq 不会变慢。当增加到 1000 个独特的单词时，差异是惊人的。
请注意，N 是乘 N 的对数或 K 的对数。log(1000) 仍为 9.97，而 log(3) 仅为 1.58。这意味着 heapq.nsmallest() 方法的速度提高了一倍以上。

【解决方案2】：

您要求的是前 3 名，因此您在按照特定排序顺序挑选项目之前就删除了数据。

不要让most_common() 预先排序然后重新排序，而是使用heapq 按您的自定义标准排序（前提是n 小于实际存储桶的数量）：

import heapq

def count_words(s, n):
    """Return the n most frequently occuring words in s."""
    counts = Counter(s.split())
    key = lambda kv: (-kv[1], kv[0])
    if n >= len(counts):
        return sorted(counts.items(), key=key)
    return heapq.nsmallest(n, counts.items(), key=key)

在 Python 2 上，您可能希望对上述调用使用 iteritems() 而不是 items()。

这将重新创建Counter.most_common() method，但使用更新后的密钥。与原版一样，使用heapq 确保这与 O(NlogK) 性能有关，而不是 O(NlogN)（N 是存储桶的数量，而 K 是您想要查看的最高元素数）。

演示：

>>> count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
[('cat', 4), ('mat', 4), ('bat', 3)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 3)
[('butter', 2), ('a', 1), ('betty', 1)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 6)
[('butter', 2), ('a', 1), ('betty', 1), ('bit', 1), ('bitter', 1), ('bought', 1)]

还有一个快速的性能比较（在 Python 3.6.0b1 上）：

>>> from collections import Counter
>>> from heapq import nsmallest
>>> from random import choice, randrange
>>> from timeit import timeit
>>> from string import ascii_letters
>>> sentence = ' '.join([''.join([choice(ascii_letters) for _ in range(randrange(3, 15))]) for _ in range(1000)])
>>> counts = Counter(sentence)  # count letters
>>> len(counts)
53
>>> key = lambda kv: (-kv[1], kv[0])
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=100000)
2.119404911005404
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=100000)
1.9657367869949667
>>> counts = Counter(sentence.split())  # count words
>>> len(counts)
1000
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=10000)  # note, 10 times fewer
6.689963405995513
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=10000)
2.902360848005628

【讨论】：

【解决方案3】：

您可以通过执行.most_common()，然后对结果进行排序和切片来解决它，而不是将n 提供给most_common：

def count_words(s, n):
    """Return the n most frequently occuring words in s."""

    #Split words into list
    wordlist = s.split()

    #Count words
    counts = Counter(wordlist)

    #Sort by frequency
    top = counts.most_common()

    #Sort by first element, if tie by second
    top.sort(key=lambda x: (-x[1], x[0]))

    return top[:n]

【讨论】：

这完全取消了most_common() 包含的优化；让most_common() 进行排序然后丢弃大部分是没有意义的。
好点，heapq 是更好的方法，但我认为这有助于理解问题。