【问题标题】：Counting the words a character said in a movie script数一个角色在电影剧本中说的话
【发布时间】：2018-04-17 14:55:56
【问题描述】：

在一些帮助下，我已经设法发现了口语。现在我正在寻找由选定的人说出的文本。所以我可以输入 MIA 并得到她在电影中所说的每一个字像这样：

name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

这样我以后就可以数单词了。

这就是电影剧本的样子

An awkward beat. They pass a wooden SALOON -- where a WESTERN
 is being shot. Extras in COWBOY costumes drink coffee on the
 steps.
                     Revision                        25.


                   MIA (CONT'D)
      I love this stuff. Makes coming to work
      easier.

                   SEBASTIAN
      I know what you mean. I get breakfast
      five miles out of the way just to sit
      outside a jazz club.

                   MIA
      Oh yeah?

                   SEBASTIAN
      It was called Van Beek. The swing bands
      played there. Count Basie. Chick Webb.
             (then,)
      It's a samba-tapas place now.

                   MIA
      A what?

                   SEBASTIAN
      Samba-tapas. It's... Exactly. The joke's on
      history.

【问题讨论】：

标签： python python-3.x text count movie

【解决方案1】：

我会首先询问用户脚本中的所有名称。然后问他们想要的词是哪个名字。我会逐字搜索文本，直到找到想要的名称，然后将以下单词复制到变量中，直到找到与脚本中其他人匹配的名称。现在人们可以说出另一个角色的名字，但如果您假设说话人的标题要么全部大写，要么在一行中，那么文本应该很容易过滤。

for word in script:
    if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
        recording = True
    elif word in character_names and word.isupper():  # you may want to check that this is on its own line as well.
        recording = False

    if recording:
        spoken_text += word + " "

【讨论】：

这是一个粗略的算法，可能需要对不想要的东西进行改进，例如 (CONT'D) 等。

【解决方案2】：

我将概述如何生成一个 dict，它可以为您提供所有说话者所说的单词数，以及一个近似于您现有实现的单词数。

一般用途

如果我们将一个单词定义为字符串中沿 ' '（空格）分割的任意字符块...

import re

speaker = '' # current speaker
words = 0 # number of words on line
word_count = {} # dict of speakers and the number of words they speak

for line in script.split('\n'):
    if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
            speaker = line.split(' (')[0][19:]
    if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
            words = len(line.split())
            if speaker in word_count:
                 word_count[speaker] += words
            else:
                 word_count[speaker] = words

如果 John Doe 说出 55 个单词，则生成格式为 {'JOHN DOE':55} 的字典。

示例输出：

>>> word_count['MIA']

13

您的实施

这是上述过程的一个版本，它与您的实现近似。

import re

def wordsspoken(script,name):
    word_count = 0
    for line in script.split('\n'):
        if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
            speaker = line.split(' (')[0][19:]
        if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
            if speaker == name:
                word_count += len(line.split())
    print(word_count)

def main():
    name = input("Enter name:")
    wordsspoken(script, name)
    name1 = input("Enter another name:")
    wordsspoken(script, name1)

【讨论】：

我收到此错误：回溯（最近一次调用最后一次）：文件“/Users/*path*.py”，第 19 行，在 wordsspoken(script, name)文件“/Users/*path*.py”，第 13 行，如果说话者 == name: UnboundLocalError: local variable 'speaker' referenced before assignment 你知道要改变什么吗？
如果你给wordsspoken 一个脚本，在介绍演讲者之前阅读第一行对话，就会发生这种情况。例如，如果您使用 MIA (CONT'D) 之后的所有内容而不是整个脚本。此代码不考虑没有说话人的对话，但您可以通过分配通用名称或丢弃没有说话人的对话行来做到这一点。

【解决方案3】：

如果你想只通过一次脚本来计算你的计数（我想这可能会很长），你可以只跟踪哪个角色在说话；像一个小状态机一样设置东西：

import re
from collections import Counter, defaultdict

words_spoken = defaultdict(Counter)
currently_speaking = 'Narrator'

for line in SCRIPT.split('\n'):
    name = line.replace('(CONT\'D)', '').strip()
    if re.match('^[A-Z]+$', name):
        currently_speaking = name
    else:
        words_spoken[currently_speaking].update(line.split())

您可以使用更复杂的正则表达式来检测说话者何时发生变化，但这应该可以解决问题。

demo

【讨论】：

在剧本创作中，(CONT'D) 的其他内容可以在对话中角色名称后的括号中。我会更改 line.replace 声明以反映这一点。
@Will 唯一的问题似乎是 dict.显示每个单词一次。但我最终需要数数
@duhaime pinged
只需在 words_spoken 初始化中将 set 替换为 Counter（只是调整了帖子以反映这一点）

【解决方案4】：

上面有一些好主意。以下在 Python 2.x 和 3.x 中应该可以正常工作：

import codecs
from collections import defaultdict

speaker_words = defaultdict(str)

with codecs.open('script.txt', 'r', 'utf8') as f:
  speaker = ''
  for line in f.read().split('\n'):
    # skip empty lines
    if not line.split():
      continue

    # speakers have their names in all uppercase
    first_word = line.split()[0]
    if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
      # remove the (CONT'D) from a speaker string
      speaker = line.split('(')[0].strip()

    # check if this is a dialogue line
    elif len(line) - len(line.lstrip()) == 6:
      speaker_words[speaker] += line.strip() + ' '

# get a Python-version-agnostic input
try:
  prompt = raw_input
except:
  prompt = input

speaker = prompt('Enter name: ').strip().upper()
print(speaker_words[speaker])

示例输出：

Enter name: sebastian
I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.

【讨论】：