揭开电影剧本的对话以计算角色所说的话答案

【问题标题】：Uncover a dialogue of a movie script to count the words spoken by characters揭开电影剧本的对话以计算角色所说的话
【发布时间】：2018-09-26 08:40:34
【问题描述】：

我正在开展一个关于女性在电影中的意义的项目。因此，我正在分析电影剧本，以获得主要男性角色/主要女性角色的口语比例。

我在过滤 NAMES 和指导指导。

我想过正则表达式，但我不喜欢它。

例如：

Mia works, photos of Hollywood icons on the wall behind her, as --

                        CUSTOMER #1
           This doesn't taste like almond milk.

                        MIA
           Don't worry, it is. I know sometimes it --

                        CUSTOMER #1
           Can I see the carton?

 Mia hands it over. The Customer looks.

                        CUSTOMER #1 (CONT'D)
           I'll have a black coffee.

我不知道如何处理语音文本后面的空白新行。任何想法如何将完整的电影脚本简化为唯一的对话脚本，我可以在其中计算单词并使用数据？

from nltk.tokenize import word_tokenize

f = open("/...//La_la_land_script.txt", "r")
script = f.read()

我正在将电影脚本加载到 python 中

def deletebraces (str):
    klammerauf = str.find('(')
    klammerzu = str.find(')')

    while (klammerauf != -1 and klammerzu != -1):

            if (klammerauf<klammerzu):
                str = str[:klammerauf] + str[klammerzu+1:]

            klammerauf = str.find('(')
            klammerzu = str.find(')')
    return str

此函数删除所有括号

def removing(list):
    for i in list:
        if i == '?':
            list.remove('?')
        if i == '!':
            list.remove('!')
        if i == '.':
            list.remove('.')
        if i == ',':
            list.remove(',')
        if i == '...':
            list.remove('...')
    return list

此函数删除所有其他符号

def countingwords(list):
    woerter = 0
    for i in list:
        woerter = woerter + 1
    return woerter;

这个函数统计单词

script = deletebraces(script)

def wordsspoken(script, name):

    a = 0
    e = 0
    all = -len(name)-1

    if script.find(name)==-1:
        print("This character does not speak")

检查是否有同名的字符

    else:
        while(a != -1 and e != -1):

            a = script.find(name+'\n            ') + len(name)
            print(a)
            temp = script[a:]
            t = temp.split("\n")

            text = t[1]

            print(text)
            textlist = word_tokenize(text)

            removing(textlist)                

            more = countingwords(textlist)

            all = all + more

            script = script[a+e:]
            a = script.find(name +'\n           ')
            temp = script[a:]
            e = temp.find(' \n')

我在这里尝试发现，但它根本不起作用

    print(name + " sagt " + str(all) + " Wörter.")

f.close()


name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

【问题讨论】：

您想删除说话人标识符并获得只包含语音台词的文本文件吗？
您能与我们分享您的代码吗？到目前为止，您尝试了什么，您面临的确切问题是什么？既然你是新来的，你能看看如何提问stackoverflow.com/help/how-to-ask吗？也可以参观一下，以了解该网站的内容（提示：我们不会为您编写代码，这也不是论坛，这是问答网站）
@rsm 当然可以，我会看看！添加代码
太棒了 :) 现在，您能否进一步改进您的问题，并将您的代码格式化为更具可读性（即通过删除不必要的空行并确保缩进没问题？script = deletebraces(script) 看看开始的地方）。
另外 - 你能提供你的代码描述吗？它有什么作用，为什么？有一些功能可以删除一些字符，计算单词等 - 它与您阅读口语文本的问题有什么关系？再一次 - 您面临的确切问题是什么？

标签： python text nltk movie

【解决方案1】：

正如@AdrianMcCarthy 所指出的，文件中的空格包含解析语音行所需的所有信息。这是在 Python 中处理任务的一种方法：

import codecs

# script.txt contains the sample text you posted
with codecs.open('script.txt', 'r', 'utf8') as f:

  # read the file content
  f = f.read()

  # store all the clean text that's accumulated
  spoken_text = ''

  # split the file into a list of strings, with each line a member in the list
  for line in f.split('\n'):

    # split the line into a list of words in the line
    words = line.split()

    # if there are no words, do nothing
    if not words:
      continue

    # if this line is a person identifier, do nothing
    if len(words[0]) > 1 and all([i.isupper() for i in words[0]]):
      continue

    # if there's a good amount of whitespace to the left, this is a spoken line
    if len(line) - len(line.lstrip()) > 4:
      spoken_text += line.strip() + ' '

print(spoken_text)

【讨论】：

这对我帮助很大！例如，现在我正在寻找一种方法来获取 MIA 所说的文本。我在哪里要求呢？
@tgtrmr 我会为该任务提出一个新问题。如果你这样做了，请随时在评论中联系我，我会回复！
我如何在另一个问题中 ping 你？和@duhaime 一样吗？
@tgtrmr 完全正确！您可能需要在评论中这样做。我不确定问题正文中使用 @username 语法的引用是否会 ping username...

【解决方案2】：

在脚本格式中，缩进包含大部分信息。标记化可能会减少空白，导致您需要的信息丢失。

我会使用缩进将舞台方向与对话之前标记化分开。如果该行从第一列开始，则为舞台方向。如果它从一条线的中间附近开始并且全部大写，则它是即将说话的角色的名字。缩进的字符名称后面的行（但没有字符名称那么多）是对话。

对话有时会嵌入一些次要的舞台指示（例如，指示角色是在窃窃私语还是大喊大叫）。这些通常比对话本身缩进更深，并用括号括起来。

后期制作脚本（例如，剪辑师、声音剪辑师、出版物等）通常对这些规则非常严格，因此缩进会非常可靠。早期的草稿和规范脚本也经常有错误，但好莱坞在一个脚本软件包上非常标准化，所以我希望任何现代的东西仍然是高度可靠的。请注意，半小时情景喜剧脚本通常是双倍行距，但在其他方面遵循与电视剧和电影相同的格式规则。

【讨论】：