【问题标题】:How to print frequency of each unique word from a string with for loop in python如何在python中使用for循环从字符串中打印每个唯一单词的频率
【发布时间】:2019-03-14 23:27:18
【问题描述】:

该段落包含空格和随机标点符号,我在 for 循环中通过 .replace 删除了它们。然后我通过 .split() 将段落放入列表中以获取 ['the', 'title', 'etc']。然后我做了两个函数计算单词来计算每个单词,但我不希望它计算每个单词,所以我创建了另一个函数来创建一个唯一列表。但是,我需要创建一个 for 循环来打印出每个单词以及它说了多少次,输出是这样的

The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.

我也很难理解 for 循环的本质。我读到我们应该只使用 for 循环进行计数,而使用 while 循环进行任何其他事情,但 while 循环也可用于计数。

    paragraph = """  The titled track “Heart Attack” does not interpret the 
    feelings of being in love in a serious way, 
    but with Chuu’s own adorable emoticon like ways. The music video has 
    references to historical and fictional 
    figures such as the artist Rene Magritte!!....  """


for r in ((",", ""), ("!", ""), (".", ""), ("  ", "")):
    paragraph = paragraph.replace(*r)

paragraph_list = paragraph.split()


def count_words(word, word_list):

    word_count = 0
    for i in range(len(word_list)):
        if word_list[i] == word:
            word_count += 1
    return word_count

def unique(word):
    result = []
    for f in word:
        if f not in result:
            result.append(f)
    return result
unique_list = unique(paragraph_list)

【问题讨论】:

  • 如果您打算拆分数据,我认为您不想去掉空格。
  • 开头和结尾有两个空格,所以 (" ", "") 只是删除了两个空格。是的,对不起,我应该提到这一点。
  • 您忘记删除引号和换行符
  • set() 创建一个集合并将其丢弃。您需要将其分配给变量才能使用它。您也可以从unique() 返回集合本身。无需转换为列表,因为您也可以枚举和查找集合中的元素。
  • def unique(word): result = [] for f in word: if f not in result: result.append(f) return result 似乎没有 set() 就可以工作

标签: python string python-3.x loops


【解决方案1】:

最好使用reget 并带有默认值:

paragraph = """  The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

import re

word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
    word_count[w] = word_count.get(w, 0) + 1
del word_count['']

for k, v in word_count.items():
    print("The word {} appears {} time(s) in the paragraph".format(k, v))

输出:

The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...

可以讨论如何处理Chuu’s,我决定不拆分,但如果你愿意,以后可以添加。

更新:

以下行使用正则表达式拆分paragraph.lower()。优点是可以描述多个分隔符

re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

关于这一行:

word_count[w] = word_count.get(w, 0) + 1

word_count 是一个字典。使用get 的优点是您可以定义一个默认值,以防w 不在字典中。该行基本上更新了单词w的计数

【讨论】:

  • 我想知道为什么我投了反对票。感谢您的反馈,我很乐意改进答案。
  • 谢谢,这非常有用。我没有阅读有关 python 中的字典的信息。我不知道您可以在没有函数的情况下轻松地执行此功能,而只需使用 for 循环。然而,这里发生了什么? word_count[w] = word_count.get(w, 0) + 1re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()
【解决方案2】:

请注意,您的示例文本很简单,但标点符号规则可能很复杂或未正确遵守。什么是包含 2 个相邻空格的文本(是的,它不正确但经常出现)?如果作者更习惯法语并在冒号或分号之前和之后写空格怎么办?

我认为's 构造需要特殊处理。怎么样:"""John has a bicycle. Mary says that her one is nicer that John's.""" 恕我直言,John 这个词在这里出现了两次,而您的算法将看到 1 John 和 1 Johns

此外,由于 Unicode 文本现在在 WEB 页面上很常见,您应该准备好寻找空格和标点符号的高代码等效项:

“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
  U+00A0 NO-BREAK SPACE

另外,根据这个older question,去除标点符号的最好方法是translate。链接问题使用 Python 2 语法,但在 Python 3 中您可以这样做:

paragraph = paragraph.strip()                   # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' '))  # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph)  # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()

【讨论】:

    猜你喜欢
    • 2021-11-19
    • 1970-01-01
    • 2021-10-26
    • 2017-03-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-11-12
    • 2021-08-31
    相关资源
    最近更新 更多