【问题标题】：Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python从键值对的元组列表中获取计数最少的项目的键 - Python
【发布时间】：2018-02-12 11:16:04
【问题描述】：

输入是一个未排序的元组列表：

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

目标是找到计数最少的最后一个n 键，即所需的输出：

['orh', 'si', 'tai', 'titlo', 'da']

我试过这样做：

首先将元组列表转换为字典
将字典转换成计数器
然后从Counter.most_common() 中找到[-n:] 元组列表
将元组列表从[-n:] 转换为字典
获取密钥，然后将其转换为列表

即

n = 5
list(dict(Counter(dict(x)).most_common()[-n:]).keys())

有没有更简单的方法来获得相同的输出？

我也可以这样做：

from operator import itemgetter
output, *_ = zip(*sorted(x, key=itemgetter(1))[n:])
list(output)

但现在我只是将Counter.most_common 换成了sorted 和itemgetter。然后我仍然需要 zip(*list) 通过从 zip 之后的每个元组列表中解压缩第一个值来提取键。

一定有更简单的方法。

注意

请注意，问题不是要排序，而是要在给定的原始元组列表中提取列表第一个元素。并且提取的标准是基于在第 2 个元素中具有最低值的最后第 n 项。

answers from the possible duplicate linked 仍然需要解包已排序元组列表并提取第一个元素列表的前 n 个的步骤。

【问题讨论】：

Possible duplicate
这些值'kup', 'gor', 'beer', 'hor', 'jia' 与您的输入元组有何关联？
您的代码没有产生您想要的输出。
@Lomtrur，在按值排序之后，还有更多步骤来获取最后/前几个键 =)
@Goyo ， RomanPerekhrest，请原谅复制+粘贴的错误。更正了输出。

标签： python list dictionary counter

【解决方案1】：

目标是找到计数最少的最后一个 n 键

鉴于此目标的定义，您的两个解决方案都不适合。在与Counter 的组合中，您使用dict，这将使键的顺序未定义，您将不会获得最后一个键，而是一些n 具有最小值的键。第二种解决方案的切片不正确，如果已修复，它会返回第一个具有最小值的 n 键。

考虑到sorted 的实现是stable，可以这样重写以适应目标：

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

但是使用heapq 是一个更好的主意，它是解决诸如“来自可迭代的n 个最小/最大值”之类问题的stdlib 工具（正如Martijn Pieters 指出的那样，nlargest 和nsmallest 也很稳定并且文档确实是这么说的，但是以隐含的方式）。特别是如果您必须处理的实际列表很大（对于较小的n，它应该比sorted 快于docs describe）。

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

您可以进一步提高性能，但代价是顺序（排序稳定性），即一些价值最小的 n 键而不是最后一个价值最小的 n 键。为此，您需要保留一个“heapified”列表并将其用作您的内部数据结构，而不是普通的 list（如果您不更改列表并且只需要一次底部 n，它不会给出性能益处）。您可以从列表中推送和弹出，例如：

_p2_heap = None

def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)]

这是您可以用来对解决方案进行基准测试的完整模块。

import heapq
from collections import Counter  

l = [
    ('herr', 1), ('dapao', 1),
    ('cino', 1), ('o', 38),
    ('tiao', 2), ('tut', 1),
    ('poh', 6), ('micheal', 1),
    ('orh', 1), ('horlick', 3),
    ('si', 1), ('tai', 1),
    ('titlo', 1), ('siew', 17),
    ('da', 1), ('halia', 2)
]
n = 5    

def author_1():
    return list(dict(Counter(dict(l)).most_common()[-n:]).keys())

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

_p2_heap = None    
def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]

这里是timeit 结果：

$ python -m timeit -s "import tst" "tst.author_1()"
100000 loops, best of 3: 7.72 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100000 loops, best of 3: 3.7 usec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
100000 loops, best of 3: 5.51 usec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
100000 loops, best of 3: 3.96 usec per loop

但如果我们制作l = l * 1000，差异就会变得明显：

$ python -m timeit -s "import tst" "tst.author_1()"
1000 loops, best of 3: 263 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100 loops, best of 3: 2.72 msec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
1000 loops, best of 3: 1.65 msec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
1000 loops, best of 3: 767 usec per loop

【讨论】：

您的prop_2 测试存在致命缺陷，因为_p2_heap 是全局的并且重用于定时测试。 heapq.nsmallest() 函数已经创建了该堆，而 prop_2 更快的唯一原因是全局允许它缓存试运行的状态。
也就是说，这绝对是一个 heapq 问题，其中使用 heapq.nsmallest 为您提供 O(K log N) 解决方案（其中 K 是输入的大小，N 是所需的数量项目）。排序为您提供 O(K log K) 解决方案，heapq 将轻松击败它。
@MartijnPieters 如果您对答案给予足够的关注，您可能会注意到如果您保留“heapified”列表，则可以提高性能。您可以从列表中推送和弹出。这显然是为了表明，如果将其用作内部数据结构（您称之为 cached），它具有性能优势。 heapq 文档还说对于 nsmallest 和 nlargest 如果需要重复使用这些函数，请考虑将可迭代对象转换为实际堆。这得到了证明。
我确实误解了您所说的“保留一个堆积的列表”的意思。请对此进行扩展，以明确表示只有当您需要在更新输入序列的情况下多次生成前 N 个时才能获得性能改进。
其实看实现我忘了nlargest和nsmallest已经添加了一个计数器来打破关系，使这些实现稳定 . nlargest() 倒计时，nsmallest() 倒计时。只有当你使用文档中的heapsort 函数时，才会有一个不稳定的排序。

【解决方案2】：

只需使用堆，它就会为您提供所需的输出。

import heapq

x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]

heap = [(item[1],-index,item[0]) for index, item in enumerate(x)]
heapq.heapify(heap)

print(list(map(lambda item : item[2], heapq.nsmallest(5, heap))))

heapq.nsmallest(n, iterable, key=None)有一个关键参数，你可以像我一样在-index里面使用它。

【讨论】：

这绝对是 heapq 问题。

【解决方案3】：

[k for k,v in sorted(x, key=lambda x: x[1])[:n]]

x 是键、元组计数列表，n 是所需的键数。

您还可以调整排序标准以包括键本身 - 如果它们的顺序很重要

[k for k,v in sorted(x, key=lambda x: (x[1], x[0]))[:n]]

【讨论】：

【解决方案4】：

编辑 @alvas：

mi = min(x, key =lambda x:x[1])[1]
r = [a[0] for a in x if a[1] == mi][-5:]

会产生你想要的输出

你可以用这个：

sorted(x, key=lambda x: x[1])

请参考此（可能重复）

Sort a list of tuples by 2nd item (integer value)

【讨论】：

这将按键的顺序对值进行排序，他可以使用sorted(x, key=lambda x: x[1])[-n:]循环并获取最后n个键
@vaultah 已编辑。这会产生 alvas 期望的输出。
@vaultah 你能提供完整的名单吗，我会检查的。此外，这会给那些计数最少的人，如果 alvas 需要按升序排列，他将不得不采用第二种排序方法。
@YuvrajJaiswal 不，这两种解决方案都会产生错误的答案。

【解决方案5】：

如果您不想重新发明轮子，可以使用pandas。性能应该很好，因为它基于 NumPy，它在底层使用 C 而不是纯 Python。

简答

df = pd.DataFrame(x, columns=['name', 'count'])
df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n)
print df['name'].tolist()

结果

['orh', 'si', 'tai', 'titlo', 'da']

使用 cmets 扩展的工作示例

import pandas as pd

n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

# Put the data in a dataframe.
df = pd.DataFrame(x, columns=['name', 'count'])

# Get the last n rows having the smallest 'count'.
# Mergesort is used instead of quicksort (default) since a stable sort is needed
# to get the *last* n smallest items instead of just *any* n smallest items.
df = df.sort_values(by='count', kind='mergesort', ascending=False).tail(n)

# Print the 'name' column as a list (since a list is what you asked for).
print df['name'].tolist()

【讨论】：

【解决方案6】：

[i[0] for i in sorted(x.__reversed__(), key=lambda x: x[1])[:n]]

与@Stacksonstacks 的答案几乎完全相同，只是这实际上为您提供了“所需的输出”（如果您输入 n = 5）

【讨论】：

【解决方案7】：

这个任务你真的不需要任何导入，你也可以通过以下方式完成：

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

n = 5
result = [name[0] for name in sorted(x, key=lambda i: i[1], reverse=True)[-n:]]
print(result)

输出：

['orh', 'si', 'tai', 'titlo', 'da']

【讨论】：

【解决方案8】：

这是我的建议：

n = 5
output=[]

# Search and store the n least numbers
leastNbs = [a[1] for a in sorted(x, key=lambda x: x[1])[:n]]

# Iterate over the list of tuples starting from the end
# in order to find the tuples including one of the n least numbers
for x,nb in reversed(x):
    if nb in leastNbs:
        output.append(x)  # Store the string in output
        print(x)

# Keep only the n last strings (starting from the end)
output = list(reversed(output[:n]))

print(output)

【讨论】：

【解决方案9】：

这是一个干净、简单的方法，不使用 python 成语：

m = x[0][1]
l = []

for elem in x:
    if m > elem[1]:
        l = [elem[0]]
        m = elem[1]
    elif m == elem[1]:
        l.append(elem[0])

print(l[-5:])

这有点像最小值搜索和过滤的融合。 m 存储到目前为止的最小值，l 存储具有该最小值的元素列表。当您找到较低的值时，您会重置它们。

这可以修改为只容纳5个元素，因此最终不需要拼接。

【讨论】：

【解决方案10】：

纯 Python 解决方案

由于我们试图找到n 元素按从小到大的顺序，我们不能简单地过滤掉那些不具有最小第二个元素的元素。我们还有第二个目标是尝试保持顺序 - 这仅消除了对每个元组的第二个元素的排序。

我的解决方案很复杂 O(n) - 这是您可以在这里做的最好的事情，因为我们正在创建一个依赖于预先存在的列表的新列表。

它通过创建x 中每个元组的第一个n 元素的set（无序）来工作 - 在x 被反转（[::-1]）之后，然后根据第二个元素排序。这有一个巧妙的技巧，因为我们在转换为集合之前进行切片，所以在这些具有等效第二个元素的元组中仍然存在顺序。

现在，使用set 的简洁之处在于查找是O(1)（即时），因为元素按hashes 的顺序存储，因此调用__contains__ 比使用list 快得多.

我们最终需要使用一个list-comprehension来进行x的最终过滤：

>>> n = 5
>>> s = {i[0] for i in sorted(x[::-1], key=lambda t: t[1])[:n]}
>>> [i for i, _ in x if i in s]
['orh', 'si', 'tai', 'titlo', 'da']

另外一个测试表明它可以与n = 11一起工作

['herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da', 'halia']

【讨论】：

@vaultah 怎么样？
@vaultah 哦！我的错，我没有意识到 OP 的意思至少是 in order 或最小的，而不是字面上的 the minimum。

【解决方案11】：

使用list comprehension 和sorted：

[key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:]

或

[key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1]

其中n 是您想要的结果中的键数。请注意，将后者与 [::-1] 一起使用会更昂贵，因为它会再次对列表进行切片以将其反转。

from timeit import default_timer

def timeit(method, *args, **kwargs):
    start = default_timer()
    result = method(*args, **kwargs)
    end = default_timer()
    print('%s:\n(timing: %fs)\n%s\n' % (method.__name__, (end - start), result))

def with_copy(x, n):
    return [key for key,value in sorted(reversed(x), key=lambda y: y[1])][:n][::-1]

def without_copy(x, n):
    return [key for key,value in sorted(x, key=lambda y: y[1], reverse=True)][-n:]

x = [('herr', 1), ('dapao', 1), ('cino', 1), ('o', 38), ('tiao', 2),
     ('tut', 1), ('poh', 6), ('micheal', 1), ('orh', 1), ('horlick', 3),
     ('si', 1), ('tai', 1), ('titlo', 1), ('siew', 17), ('da', 1),
     ('halia', 2)]
n = 5
timeit(with_copy, x, n)
timeit(without_copy, x, n)
n = 11
timeit(with_copy, x, n)
timeit(without_copy, x, n)

`n = 5` 的结果：

with_copy:
(timing: 0.000026s)
['orh', 'si', 'tai', 'titlo', 'da']

without_copy:
(timing: 0.000018s)
['orh', 'si', 'tai', 'titlo', 'da']

`n = 11` 的结果：

with_copy:
(timing: 0.000019s)
['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']

without_copy:
(timing: 0.000013s)
['halia', 'herr', 'dapao', 'cino', 'tut', 'micheal', 'orh', 'si', 'tai', 'titlo', 'da']

【讨论】：

【解决方案12】：

在这个解决方案中不需要排序

小解决方案：

import numpy as np 
n = 5
x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

x = np.array(x)  # make the list a numpy array
names = x[:, 0]   
numbers = x[:, 1].astype(int)
least_count = np.take(names, np.where(numbers == np.min(numbers)))[0][-n:]
print(least_count)

上述解决方案的输出：
```
['orh', 'si', 'tai', 'titlo', 'da']
```

使用 cmets 解决方案的说明

import numpy as np 

x = [('herr', 1),
 ('dapao', 1),
 ('cino', 1),
 ('o', 38),
 ('tiao', 2),
 ('tut', 1),
 ('poh', 6),
 ('micheal', 1),
 ('orh', 1),
 ('horlick', 3),
 ('si', 1),
 ('tai', 1),
 ('titlo', 1),
 ('siew', 17),
 ('da', 1),
 ('halia', 2)]

x = np.array(x)  # make the list a numpy array
# ==========================================
# split the array into names and numbers
# ==========================================
names = x[:, 0]   
numbers = x[:, 1].astype(int)

mini = np.min(numbers)  # find the minimum in the numbers array
idx = np.where(numbers == mini)   # Find the indices where minimum occurs in the numbers array
least_count = np.take(names, idx)[0] # Use the indices found from numbers array in the above line to access names array
print(least_count)
least_count = least_count.tolist()  # to convert the numpy array to list
n = 5   # say n is 5
print(least_count[-n:]) # now you can do simple slicing to extract the last n element

上述解释的输出：

['herr' 'dapao' 'cino' 'tut' 'micheal' 'orh' 'si' 'tai' 'titlo' 'da']
['orh', 'si', 'tai', 'titlo', 'da']

【讨论】：

注意

简答

结果

使用 cmets 扩展的工作示例

n = 5 的结果：

n = 11 的结果：

`n = 5` 的结果：

`n = 11` 的结果：