如何在python中使用for循环从字符串中打印每个唯一单词的频率答案

【问题标题】：How to print frequency of each unique word from a string with for loop in python如何在python中使用for循环从字符串中打印每个唯一单词的频率
【发布时间】：2019-03-14 23:27:18
【问题描述】：

该段落包含空格和随机标点符号，我在 for 循环中通过 .replace 删除了它们。然后我通过 .split() 将段落放入列表中以获取 ['the', 'title', 'etc']。然后我做了两个函数计算单词来计算每个单词，但我不希望它计算每个单词，所以我创建了另一个函数来创建一个唯一列表。但是，我需要创建一个 for 循环来打印出每个单词以及它说了多少次，输出是这样的

The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.

我也很难理解 for 循环的本质。我读到我们应该只使用 for 循环进行计数，而使用 while 循环进行任何其他事情，但 while 循环也可用于计数。

    paragraph = """  The titled track “Heart Attack” does not interpret the 
    feelings of being in love in a serious way, 
    but with Chuu’s own adorable emoticon like ways. The music video has 
    references to historical and fictional 
    figures such as the artist Rene Magritte!!....  """


for r in ((",", ""), ("!", ""), (".", ""), ("  ", "")):
    paragraph = paragraph.replace(*r)

paragraph_list = paragraph.split()


def count_words(word, word_list):

    word_count = 0
    for i in range(len(word_list)):
        if word_list[i] == word:
            word_count += 1
    return word_count

def unique(word):
    result = []
    for f in word:
        if f not in result:
            result.append(f)
    return result
unique_list = unique(paragraph_list)

【问题讨论】：

如果您打算拆分数据，我认为您不想去掉空格。
开头和结尾有两个空格，所以 (" ", "") 只是删除了两个空格。是的，对不起，我应该提到这一点。
您忘记删除引号和换行符
set() 创建一个集合并将其丢弃。您需要将其分配给变量才能使用它。您也可以从unique() 返回集合本身。无需转换为列表，因为您也可以枚举和查找集合中的元素。
def unique(word): result = [] for f in word: if f not in result: result.append(f) return result 似乎没有 set() 就可以工作

标签： python string python-3.x loops

【解决方案1】：

最好使用re 和get 并带有默认值：

paragraph = """  The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

import re

word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
    word_count[w] = word_count.get(w, 0) + 1
del word_count['']

for k, v in word_count.items():
    print("The word {} appears {} time(s) in the paragraph".format(k, v))

输出：

The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...

可以讨论如何处理Chuu’s，我决定不拆分’，但如果你愿意，以后可以添加。

更新：

以下行使用正则表达式拆分paragraph.lower()。优点是可以描述多个分隔符

re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

关于这一行：

word_count[w] = word_count.get(w, 0) + 1

word_count 是一个字典。使用get 的优点是您可以定义一个默认值，以防w 不在字典中。该行基本上更新了单词w的计数

【讨论】：

我想知道为什么我投了反对票。感谢您的反馈，我很乐意改进答案。
谢谢，这非常有用。我没有阅读有关 python 中的字典的信息。我不知道您可以在没有函数的情况下轻松地执行此功能，而只需使用 for 循环。然而，这里发生了什么？ word_count[w] = word_count.get(w, 0) + 1 和 re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

【解决方案2】：

请注意，您的示例文本很简单，但标点符号规则可能很复杂或未正确遵守。什么是包含 2 个相邻空格的文本（是的，它不正确但经常出现）？如果作者更习惯法语并在冒号或分号之前和之后写空格怎么办？

我认为's 构造需要特殊处理。怎么样："""John has a bicycle. Mary says that her one is nicer that John's.""" 恕我直言，John 这个词在这里出现了两次，而您的算法将看到 1 John 和 1 Johns。

此外，由于 Unicode 文本现在在 WEB 页面上很常见，您应该准备好寻找空格和标点符号的高代码等效项：

“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
  U+00A0 NO-BREAK SPACE

另外，根据这个older question，去除标点符号的最好方法是translate。链接问题使用 Python 2 语法，但在 Python 3 中您可以这样做：

paragraph = paragraph.strip()                   # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' '))  # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph)  # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()

【讨论】：