使用字典计算列表中的重复（重复）答案

【问题标题】：Counting repeated (duplicated) in a list using dictionary使用字典计算列表中的重复（重复）
【发布时间】：2017-01-13 02:17:47
【问题描述】：

我正在编写一个程序来识别 Excel 电子表格中特定列（名为“StrId”）中的重复值及其计数。除了找到重复，我还需要知道每个值重复了多少次。

Excel 数据被处理为字典列表（每行一个字典），标题作为键，数据作为值，例如 [{'StrId' : 1, 'ProjId' : 358}][{'StrId' : 2, 'ProjId' : 984...}] 等

我的计划是首先识别每个字典中的“StrId”键，将它们放在一个列表中，然后在该列表中创建另一个字典以传递值并在超过 1 个值时分开，计算出现的那些不止一次。

这是我的代码。现在，它会显示带有第一个值的“KeyError”消息，然后停止。

如果有任何帮助，我将不胜感激。提前致谢

from openpyxl import load_workbook
workbook = load_workbook('./fullallreadyconversionxmlclean4.xlsx')
sheet = workbook['Full-All']
headers = ["StrId", "ProjectId", "TweetText", "Label"]

excel_data = []
for row_num, row in enumerate(sheet):
    if row_num is 0:
        continue
    row_data = {}
    for col_num, cell in enumerate(row):
        if col_num > len(headers) - 1:
            continue
        key = headers[col_num]
        value = cell.value
        row_data[key] = value
    excel_data.append(row_data)    


for row in excel_data:
    for key in row:    
        if key is 'StrId':
            value = row[key]
            list_ids = []
            list_ids.append(value)

            dup_dic = {}           
            for  value in list_ids:
                if value in list_ids:
                    dup_dic[value] +=1
                else:
                    dup_dic[value] =1                

                print dup_dic

【问题讨论】：

print value 的输出是什么？
Projld 是否与此相关，或者您正在尝试查找例如1 在Strld 列中出现的次数？
print value 显示键的值列表。但是，在我将值放入 list_ids 之后，它们显示“u”（对于 unicode）。我不知道为什么
ProjId 不相关，但它是从 Excel 转换为 dics 列表的信息的一部分
为什么子列表里面有dict，可以有多个吗？

标签： python excel dictionary duplicates

【解决方案1】：

这是一个可能的解决方案：

from collections import defaultdict

excel_data = [
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 1, 'ProjId': 358},
    {'StrId': 2, 'ProjId': 984},
    {'StrId': 1, 'ProjId': 358},
]

output = defaultdict(int)

for row in excel_data:
    if 'StrId' in row:
        output[row['StrId']] += 1

print output

如果您对上述代码有任何疑问，请查看collections.defaultdict

【讨论】：

这个答案有效，尽管它不按重复次数排序。非常感谢。

【解决方案2】：

您可以为此使用 Python 的 Counter。我假设您的 excel_data 是一个列表列表，每个列表有一个字典，但如果不是这样，请告诉我。

from collections import Counter

excel_data = [
    [{'StrId': 1, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
]

# create a list of all values
flattened_values = [list_dict[0]['StrId'] for list_dict in excel_data]

# pass them to counter to get a dict of value to count
counter = Counter(flattened_values)  # Counter({2: 3, 1: 1})

# use dictionary comprehension to create a dict from this counter with only
# values with count > 1 to find duplicates
repetitions = {
    val: count for val, count in counter.iteritems() if count > 1
}  # {2: 3}

【讨论】：

我收到了KeyError: 0的消息，你能告诉我为什么吗？
这意味着您的数据不像您发布的那样结构化。而不是列表列表，它可能是一个字典列表？请发布您的数据的实际结构。
我在问题中添加了结构化数据的代码。我认为这是我的帖子中提到的字典列表。非常感谢您的帮助！

【解决方案3】：

如果子列表可以包含多个字典，您可以使用 itertools.chain 展平子列表：

from collections import Counter
excel_data = [
    [{'StrId': 1, 'ProjId': 358},{'StrId': 5, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984},{'StrId': 3, 'ProjId': 358}],
    [{'StrId': 2, 'ProjId': 984}],
    [{'StrId': 2, 'ProjId': 984}],
]

from collections import Counter
from itertools import chain
print(Counter(map(itemgetter("StrId"), chain(*excel_data))))

但您似乎有一个字典列表，因此您可以删除链：

from collections import Counter

print(Counter(map(itemgetter("StrId"), excel_data)))

在比较字符串时不要使用 if is，is 检查和对象的身份，使用== 即if key == 'StrId' 但只进行查找会更有意义，即@987654325 @。也给你变量更好的名字，row 对于dict来说不是一个很好的名字。

【讨论】：

谢谢帕德莱克。您的代码是该问题的最佳解决方案。我也会听从你的建议。再次感谢。