如何计算对话中每个角色所说的单词数并将计数存储在字典中？答案

【问题标题】：How do I count the number of words spoken by each character in a dialogue and store the count in a dictionary?如何计算对话中每个角色所说的单词数并将计数存储在字典中？
【发布时间】：2021-07-02 01:43:00
【问题描述】：

我正在尝试计算字符 "Michael" 和 "Jim" 在以下对话中说出的单词数，并将它们存储在类似于 {"Michael:":15, "Jim:":10} 的字典中。

string = "Michael: All right Jim. Your quarterlies look very good. How are things at the library? Jim: Oh, I told you. I couldn’t close it. So… Michael: So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? Jim: Actually, you called me in here, but yeah. Michael: All right. Well, let me show you how it’s done."

我想创建一个包含字符名称作为键的空字典，将字符串按" " 拆分，然后通过使用键作为参考来计算字符名称之间结果列表元素的数量，然后存储计数词作为值。这是我目前使用的代码：

dict = {"Michael:" : 0,
        "Jim:" : 0}

list = string.split(" ")

indices = [i for i, x in enumerate(list) if x in dict.keys()]
nums = []
for i in range(1,len(indices)):
    nums.append(indices[i] - indices[i-1])
print(nums)

结果是一个打印为 [15, 10, 15, 9] 的列表

我想我需要以下帮助：

如果可能的话，一个更好的方法
当该行是对话的最后一行时，一种计算角色说出的单词数的方法
一种通过自动计算角色说出的单词数来更新字典的方法

最后一点至关重要，因为我试图复制这个过程以获得一集的引语。

提前谢谢你！

【问题讨论】：

不使用内置函数作为变量
@Sujay 的意思是 string 是一个 std 库模块，因此您可以通过将其用作变量名使其不可用（是的，您可以 import string as still_available_string）。
@JLPeyret，还有list 和dict
对，请注意，因为我只使用了 OPs 字符串定义。
@beginnerprogrammerforever 好...接受答案或为您认为有帮助的人点赞是这里感谢人们的常用方式。

标签： python string dictionary parsing

【解决方案1】：

遍历单词，不断增加适当的计数。

dialogue_dict = {"Michael:" : 0, "Jim:" : 0}

words = string.split(" ")
current_character = None
for word in words:
    if word in dialogue_dict:
        current_character = word
    elif current_character:
        dialogue_dict[current_character] += 1

顺便说一句，不要使用 list 和 dict 作为变量名，这会用这些名称覆盖内置函数。

【讨论】：

谢谢，巴马尔。我有一些后续问题以确保我清楚地理解这一点 - 1. 你为什么不使用 ``` if word in dialog_dict.keys(): ``` ？我们不应该只看钥匙吗？
当 dict 用作可迭代对象时，它只返回键。所以in dialogue_dict 和in dialogue_dict.keys() 是一样的。
当你做for key in dialogue_dict:时你可以看到同样的事情

【解决方案2】：

使用regex to split by character names, keeping the character separators,
然后使用 chunks of 2 迭代字符/行对。
- 使用collections.defaultdict(int)在0处自动添加一个新字符并为当前行添加单词split，

string_ = "Michael: All right Jim. Your quarterlies look very good. How are things at the library? Jim: Oh, I told you. I couldn’t close it. So… Michael: So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? Jim: Actually, you called me in here, but yeah. Michael: All right. Well, let me show you how it’s done."

import re
from collections import defaultdict

#This assumes a character name has no blanks and is followed by a `:`
pat = re.compile("([A-Z][a-z'-]+:)")

#splitting like returns the delimeters (characters) as well
li = [v for v in pat.split(string_) if v]

# split 2 by 2
def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

#use a defaultdict to start new characters at 0
#collections.Counter could also work
counter = defaultdict(int)

pairs = chunks(li,2)
for character, line in pairs:
    counter[character.rstrip(":")] += len(line.split())
 
print(f"{counter=}")

输出：

counter=defaultdict(<class 'int'>, {'Michael': 38, 'Jim': 17})

【讨论】：

【解决方案3】：

我们可以使用正则表达式来做到这一点。无需提供演讲者姓名

import re

string = "Michael: All right Jim. Your quarterlies look very good. How are things at the library? Jim: Oh, I told you. I couldn’t close it. So… Michael: So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? Jim: Actually, you called me in here, but yeah. Michael: All right. Well, let me show you how it’s done."
dialog_count = {}

#extract speakers using regex
speakers = re.findall(r'\w+:',string)
#split sentences using regex
sentencs = re.split(r'\w+:',string)
speakers = filter(lambda x: x.strip()!='' ,speakers)
sentencs = filter(lambda x: x.strip()!='' ,sentencs)

#remap each speaker to it's sentence
dialogs = zip(list(speakers),list(sentencs))

#count total words
for speaker,dialog in dialogs:
    dialog = dialog.split(" ")
    dialog = list(filter(lambda x: x.strip()!='',dialog))
    dialog_count[speaker] = dialog_count.get(speaker,0) + len(dialog)
print(dialog_count)

{'Michael:': 38, 'Jim:': 17}

【讨论】：