计算 txt 磁贴中的唯一单词答案

【问题标题】：Count unique words in txt tile计算 txt 磁贴中的唯一单词
【发布时间】：2019-09-11 12:26:26
【问题描述】：

我有一个莎士比亚十四行诗之一的输入文件 (sonnet.txt)。我需要编写短代码来计算十四行诗中唯一单词的数量。我的代码必须删除标点符号并忽略小写/大写。

sonnet.txt 的内容

How heavy do I journey on the way,
When what I seek, my weary travel's end,
Doth teach that ease and that repose to say,
Thus far the miles are measured from thy friend!
The beast that bears me, tired with my woe,
Plods dully on, to bear that weight in me,
As if by some instinct the wretch did know
His rider loved not speed being made from thee.
The bloody spur cannot provoke him on,
That sometimes anger thrusts into his hide,
Which heavily he answers with a groan,
More sharp to me than spurring to his side;
For that same groan doth put this in my mind,
My grief lies onward, and my joy behind.

我正在使用 set() 函数并将结果存储在变量 unique_words 中。最终目标是使用 len(unique_words) 计算该集合的长度。

但是，我的代码删除了后跟标点符号的单词（例如，',' ';' '!'）。我曾尝试使用过滤器功能来删除非字母字符，但我仍然丢失了后跟标点符号的单词。

是否有不同的字符串方法可以与 filter() 结合以获得所需的输出？

提前感谢您的帮助。

unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    line = [word.lower() for word in line.split()]
    line = [word for word in filter(str.isalpha, line)]
    unique_words.update(line)

sonnet.close()

print("{} unique words".format(len(unique_words)))

第一个理解的结果是

['how', 'heavy', 'do', 'i', 'journey', 'on', 'the', 'way,']

但是当我第二次迭代时，这是我得到的输出：

['how', 'heavy', 'do', 'i', 'journey', 'on', 'the']

【问题讨论】：

您的代码完全按照它在锡上所说的：您使用的是filter，它...嗯，过滤结果以排除不是.isalpha 的元素。因此，它会过滤掉包括空格在内的所有内容 - 结果是一组字符（实际上不是您所说的，不确定您是如何得到这些结果的）。
尝试在文本行上使用替换方法来替换不带空格的撇号、句号等（例如“”）。然后将所有字符串字符小写并将单词放入列表中。
嗨@jun 谢谢你的建议！我用替换所有我想摆脱的字符，它起作用了:)

标签： python

【解决方案1】：

str.isalpha 返回 true - 如果字符串中的所有字符都是字母。

输入 - 'Mike' 输出 - true
输入 - 'charlie mike' 输出 - 错误
输入 - 'charlie!,' 输出 - 假

在您将 isalpha 应用于“方式”的情况下，返回 false。所以最好在开始时使用 string.punctuation 删除标点符号，而无需使用过滤器。

import string
unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    line ="".join([c for c in line if c not in string.punctuation])
    line = [word.lower() for word in line.split()]
    unique_words.update(line)

sonnet.close()

print("{} unique words".format(len(unique_words)))

如果您需要将“我的”和“我的”都添加到唯一的单词列表中，请不要使用 word.lower()

【讨论】：

【解决方案2】：

我宁愿这样做：

import re
from collections import Counter

words = re.findall( r'\w+', text )
counter = Counter( words )
print len(counter)   # prints 95

如果我使用以下方法将所有单词转换为小写：

words = [w.lower() for w in words]

在计数之前，结果是90。

【讨论】：

【解决方案3】：

尽可能接近您的示例，但要解决问题：

unique_words = set()

sonnet = open("sonnet.txt", "r")

for line in sonnet:
    words = ''.join(filter(lambda x: x.isalpha() or x.isspace(), line)).split()
    unique_words.update(words)

sonnet.close()

print("{} unique words".format(len(unique_words)))

您不仅要检查.isalpha()，还希望保留空格，因此将它们组合在一个lambda 函数中以按照您的意图使用filter。然后，生成的过滤器生成器由''.join(generator) 转换回字符串，并且该行被拆分（在其中的空格上）。

为了清楚起见，结果被称为words，而不是覆盖循环变量line，并将单词添加到结果中。

输出：

94 unique words

【讨论】：

嗨@Grismar 感谢您向我展示了 lambda 函数。我不知道它，但现在它已添加到我的工具箱中:)

【解决方案4】：

import string

l = []
with open("sonnet.txt","r") as f:
     s = f.read().strip()
     l = l + s.translate(str.maketrans('', '', string.punctuation)).split()

print(len(set(l)))

从这个post中删除标点符号。我将大小写不同的单词视为不同的单词。如果我们想考虑大小写变化，我们可以简单地修改这一行。

s = f.read().strip() 到 s.f.read().strip().lower()

【讨论】：

嗨阿米特，如果我的任务允许我会导入字符串:(
嗨 string.punctuation 只是一个字符串 !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
@bravocharliemike 如果您不能使用 string.punctuation，您可以将其替换为相应的字符串值 '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
或者您可以为所有要忽略的标点符号添加自己的自定义字符串。