试图使用一个函数的输出来影响下一个函数来计算文本文件中的单词答案

【问题标题】：Trying to use output of one function to influence the next function to count words in text file试图使用一个函数的输出来影响下一个函数来计算文本文件中的单词
【发布时间】：2015-06-05 08:14:28
【问题描述】：

在通过仅包含字母和单个空格来“清理”此文本文件之后，我正在尝试使用一个函数来计算文本文件中的单词数。所以我有我的第一个函数，我想清理文本文件，然后我有我的下一个函数来实际返回前一个函数结果的长度（清理文本）。这是这两个函数。

def cleanUpWords(file):
    words = (file.replace("-", " ").replace("  ", " ").replace("\n", " "))
    onlyAlpha = ""
    for i in words:
        if i.isalpha() or i == " ":
            onlyAlpha += i
    return onlyAlpha

所以 words 是没有双空格、连字符、换行符的文本文件。然后，我取出所有数字，然后返回清理后的 onlyAlpha 文本文件。现在，如果我输入 return len(onlyAlpha.split()) 而不是 return onlyAlpha ...它会给我文件中正确的单词数量（我知道，因为我有答案）。但是如果我这样做，并尝试将它分成两个功能，它会搞砸单词的数量。这就是我要说的（这是我的字数统计功能）

def numWords(newWords):
    '''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
    return len(newWords.split())

newWords 我在 main() 中定义，其中 `newWords = cleanUpWords(harper)-----harper 是一个运行另一个读取函数的变量（除此之外）。

def main():
    harper = readFile("Harper's Speech.txt")    #readFile function reads
    newWords = cleanUpWords(harper)
    print(numWords(harper), "Words.")

鉴于所有这些，请告诉我为什么如果我将其拆分为两个函数会给出不同的答案。

作为参考，这里是对单词进行计数，但不拆分单词清理和单词计数功能，numWords 现在清理和计数，不推荐。

def numWords(file):
    '''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
    words = (file.replace("-", " ").replace("  ", " ").replace("\n", " "))
    onlyAlpha = ""
    for i in words:
        if i.isalpha() or i == " ":
            onlyAlpha += i
    return len(onlyAlpha.split())

def main():
    harper = readFile("Harper's Speech.txt")
    print(numWords(harper), "Words.")

希望我提供了足够的信息。

【问题讨论】：

如我所料，快速测试从两种配方中得到了相同的结果。你能删掉文件处理部分并提供一个失败的示例输入吗？

标签： python file python-3.x io count

【解决方案1】：

问题很简单：您将其拆分为两个函数，但您完全忽略第一个函数的结果，而是计算清理之前的单词数！

将您的 main 函数更改为此，然后它应该可以工作。

def main():
    harper = readFile("Harper's Speech.txt")
    newWords = cleanUpWords(harper)
    print(numWords(newWords), "Words.") # use newWords here!

另外，您的cleanUpWords 功能可以稍微改进一下。它仍然可以在文本中留下双倍或三倍的空格，你也可以让它更短一些。或者，您可以使用正则表达式：

import re
def cleanUpWords(string):
    only_alpha = re.sub("[^a-zA-Z]", " ", string)
    single_spaces = re.sub("\s+", " ", only_alpha)
    return single_spaces

或者您可以先过滤掉所有非法字符，然后将单词拆分并用一个空格将它们重新连接在一起。

def cleanUpWords(string):
    only_alpha = ''.join(c for c in string if c.isalpha() or c == ' ')
    single_spaces = ' '.join(only_alpha.split())
    return single_spaces

例如，您的原始函数会留下一些双空格：

>>> s = "text with    triple spaces and other \n sorts \t of strange ,.-#+ stuff and 123 numbers"
>>> cleanUpWords(s)
text with triple spaces and other sorts of strange stuff and numbers

（当然，如果你无论如何都打算分词，双空格也不是问题。）

【讨论】：

非常感谢，这真的很有帮助。但是为什么您所做的 main() 更改起作用了，我做错了什么？
@DamianConnors 好吧，你做了print(numWords(harper), "Words.")，即你计算harper 中的单词，这是文件在完成任何清理之前的文本。请注意，cleanUpWords 函数不会修改harper 字符串（这甚至不可能，因为字符串是不可变的），而是返回一个不同的字符串，即newWords。因此，要计算清理字符串中的单词，您必须使用 numWords(newWords) 而不是 numWords(harper)。