【发布时间】:2019-09-11 12:26:26
【问题描述】:
我有一个莎士比亚十四行诗之一的输入文件 (sonnet.txt)。我需要编写短代码来计算十四行诗中唯一单词的数量。我的代码必须删除标点符号并忽略小写/大写。
sonnet.txt 的内容
How heavy do I journey on the way,
When what I seek, my weary travel's end,
Doth teach that ease and that repose to say,
Thus far the miles are measured from thy friend!
The beast that bears me, tired with my woe,
Plods dully on, to bear that weight in me,
As if by some instinct the wretch did know
His rider loved not speed being made from thee.
The bloody spur cannot provoke him on,
That sometimes anger thrusts into his hide,
Which heavily he answers with a groan,
More sharp to me than spurring to his side;
For that same groan doth put this in my mind,
My grief lies onward, and my joy behind.
我正在使用 set() 函数并将结果存储在变量 unique_words 中。最终目标是使用 len(unique_words) 计算该集合的长度。
但是,我的代码删除了后跟标点符号的单词(例如,',' ';' '!')。我曾尝试使用过滤器功能来删除非字母字符,但我仍然丢失了后跟标点符号的单词。
是否有不同的字符串方法可以与 filter() 结合以获得所需的输出?
提前感谢您的帮助。
unique_words = set()
sonnet = open("sonnet.txt", "r")
for line in sonnet:
line = [word.lower() for word in line.split()]
line = [word for word in filter(str.isalpha, line)]
unique_words.update(line)
sonnet.close()
print("{} unique words".format(len(unique_words)))
第一个理解的结果是
['how', 'heavy', 'do', 'i', 'journey', 'on', 'the', 'way,']
但是当我第二次迭代时,这是我得到的输出:
['how', 'heavy', 'do', 'i', 'journey', 'on', 'the']
【问题讨论】:
-
您的代码完全按照它在锡上所说的:您使用的是
filter,它...嗯,过滤结果以排除不是.isalpha的元素。因此,它会过滤掉包括空格在内的所有内容 - 结果是一组字符(实际上不是您所说的,不确定您是如何得到这些结果的)。 -
尝试在文本行上使用替换方法来替换不带空格的撇号、句号等(例如“”)。然后将所有字符串字符小写并将单词放入列表中。
-
嗨@jun 谢谢你的建议!我用替换所有我想摆脱的字符,它起作用了:)
标签: python