【问题标题】:Counting the words a character said in a movie script数一个角色在电影剧本中说的话
【发布时间】:2018-04-17 14:55:56
【问题描述】:

在一些帮助下,我已经设法发现了口语。 现在我正在寻找由选定的人说出的文本。 所以我可以输入 MIA 并得到她在电影中所说的每一个字 像这样:

name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

这样我以后就可以数单词了。

这就是电影剧本的样子

An awkward beat. They pass a wooden SALOON -- where a WESTERN
 is being shot. Extras in COWBOY costumes drink coffee on the
 steps.
                     Revision                        25.


                   MIA (CONT'D)
      I love this stuff. Makes coming to work
      easier.

                   SEBASTIAN
      I know what you mean. I get breakfast
      five miles out of the way just to sit
      outside a jazz club.

                   MIA
      Oh yeah?

                   SEBASTIAN
      It was called Van Beek. The swing bands
      played there. Count Basie. Chick Webb.
             (then,)
      It's a samba-tapas place now.

                   MIA
      A what?

                   SEBASTIAN
      Samba-tapas. It's... Exactly. The joke's on
      history.

【问题讨论】:

    标签: python python-3.x text count movie


    【解决方案1】:

    我会首先询问用户脚本中的所有名称。然后问他们想要的词是哪个名字。我会逐字搜索文本,直到找到想要的名称,然后将以下单词复制到变量中,直到找到与脚本中其他人匹配的名称。现在人们可以说出另一个角色的名字,但如果您假设说话人的标题要么全部大写,要么在一行中,那么文本应该很容易过滤。

    for word in script:
        if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
            recording = True
        elif word in character_names and word.isupper():  # you may want to check that this is on its own line as well.
            recording = False
    
        if recording:
            spoken_text += word + " "
    

    【讨论】:

    • 这是一个粗略的算法,可能需要对不想要的东西进行改进,例如 (CONT'D) 等。
    【解决方案2】:

    我将概述如何生成一个 dict,它可以为您提供所有说话者所说的单词数,以及一个近似于您现有实现的单词数。

    一般用途

    如果我们将一个单词定义为字符串中沿 ' '(空格)分割的任意字符块...

    import re
    
    speaker = '' # current speaker
    words = 0 # number of words on line
    word_count = {} # dict of speakers and the number of words they speak
    
    for line in script.split('\n'):
        if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
                speaker = line.split(' (')[0][19:]
        if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
                words = len(line.split())
                if speaker in word_count:
                     word_count[speaker] += words
                else:
                     word_count[speaker] = words
    

    如果 John Doe 说出 55 个单词,则生成格式为 {'JOHN DOE':55} 的字典。

    示例输出:

    >>> word_count['MIA']
    
    13
    

    您的实施

    这是上述过程的一个版本,它与您的实现近似。

    import re
    
    def wordsspoken(script,name):
        word_count = 0
        for line in script.split('\n'):
            if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
                speaker = line.split(' (')[0][19:]
            if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
                if speaker == name:
                    word_count += len(line.split())
        print(word_count)
    
    def main():
        name = input("Enter name:")
        wordsspoken(script, name)
        name1 = input("Enter another name:")
        wordsspoken(script, name1)
    

    【讨论】:

    • 我收到此错误:回溯(最近一次调用最后一次):文件“/Users/*path*.py”,第 19 行,在 wordsspoken(script, name)文件“/Users/*path*.py”,第 13 行,如果说话者 == name: UnboundLocalError: local variable 'speaker' referenced before assignment 你知道要改变什么吗?
    • 如果你给wordsspoken 一个脚本,在介绍演讲者之前阅读第一行对话,就会发生这种情况。例如,如果您使用 MIA (CONT'D) 之后的所有内容而不是整个脚本。此代码不考虑没有说话人的对话,但您可以通过分配通用名称或丢弃没有说话人的对话行来做到这一点。
    【解决方案3】:

    如果你想只通过一次脚本来计算你的计数(我想这可能会很长),你可以只跟踪哪个角色在说话;像一个小状态机一样设置东西:

    import re
    from collections import Counter, defaultdict
    
    words_spoken = defaultdict(Counter)
    currently_speaking = 'Narrator'
    
    for line in SCRIPT.split('\n'):
        name = line.replace('(CONT\'D)', '').strip()
        if re.match('^[A-Z]+$', name):
            currently_speaking = name
        else:
            words_spoken[currently_speaking].update(line.split())
    

    您可以使用更复杂的正则表达式来检测说话者何时发生变化,但这应该可以解决问题。

    demo

    【讨论】:

    • 在剧本创作中,(CONT'D) 的其他内容可以在对话中角色名称后的括号中。我会更改 line.replace 声明以反映这一点。
    • @Will 唯一的问题似乎是 dict.显示每个单词一次。但我最终需要数数
    • @duhaime pinged
    • 只需在 words_spoken 初始化中将 set 替换为 Counter(只是调整了帖子以反映这一点)
    【解决方案4】:

    上面有一些好主意。以下在 Python 2.x 和 3.x 中应该可以正常工作:

    import codecs
    from collections import defaultdict
    
    speaker_words = defaultdict(str)
    
    with codecs.open('script.txt', 'r', 'utf8') as f:
      speaker = ''
      for line in f.read().split('\n'):
        # skip empty lines
        if not line.split():
          continue
    
        # speakers have their names in all uppercase
        first_word = line.split()[0]
        if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
          # remove the (CONT'D) from a speaker string
          speaker = line.split('(')[0].strip()
    
        # check if this is a dialogue line
        elif len(line) - len(line.lstrip()) == 6:
          speaker_words[speaker] += line.strip() + ' '
    
    # get a Python-version-agnostic input
    try:
      prompt = raw_input
    except:
      prompt = input
    
    speaker = prompt('Enter name: ').strip().upper()
    print(speaker_words[speaker])
    

    示例输出:

    Enter name: sebastian
    I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.
    

    【讨论】:

      猜你喜欢
      • 2018-09-26
      • 2015-11-13
      • 1970-01-01
      • 2021-03-27
      • 1970-01-01
      • 2012-01-18
      • 2021-11-09
      • 2015-07-31
      • 1970-01-01
      相关资源
      最近更新 更多