剥离标点符号并在 Python 中查找唯一单词答案

【问题标题】：stripping punctuation and finding unique words in Python剥离标点符号并在 Python 中查找唯一单词
【发布时间】：2020-02-26 20:20:36
【问题描述】：

所以我的任务是这样的：

编写一个程序，显示在文件 uniq_words.txt 中找到的所有唯一单词的列表。按字母顺序和小写字母顺序打印您的结果。提示：将单词存储为集合的元素；使用 string 模块中的 string.punctuation 删除标点符号。

目前，我拥有的代码是：

def main():
    import string

    with open('uniq_words.txt') as content:
        new = sorted(set(content.read().split()))
        for i in new:
            while i in string.punctuation:
                new.discard(i)
                print(new)

main()

如果我这样运行代码，它会进入一个无限循环，一遍又一遍地打印唯一的单词。在我的集合中仍有 sre 单词显示为“值”。或“从不/有”。如何使用 string.punctuation 模块删除标点符号？还是我从错误的方向接近这个？非常感谢任何建议！

编辑：该链接对我没有帮助，因为给出的方法在列表中不起作用。

【问题讨论】：

这能回答你的问题吗？ Best way to strip punctuation from a string
@GiftZwergrapper 我之前实际上已经阅读过那篇文章，但我认为它不能解决我的问题。

标签： python if-statement while-loop infinite-loop

【解决方案1】：

我的解决方案：

import string
with open('sample_string.txt') as content:
    sample_string = content.read()
    print(sample_string)
    # Sample string: containing punctuation! As well as CAPITAL LETTERS and duplicates duplicates.
    sample_string = sample_string.strip('\n')
    sample_string = sample_string.translate(str.maketrans('', '', string.punctuation)).lower()
    out = sorted(list(set(sample_string.split(" "))))
    print(out)
    # ['and', 'as', 'capital', 'containing', 'duplicates', 'letters', 'punctuation', 'sample', 'string', 'well']

【讨论】：

您好，感谢您的评论。目前我收到此错误：当我尝试运行代码时，AttributeError: 'list' object has no attribute 'translate'。
这可能与您阅读文件的方式有关。在创建错误的行上方尝试此操作：sample_text = ' '.join(sample_text)，这应该将您的列表转换为字符串。
` def main(): import string with open('uniq_words.txt') as content: new1 = content.read().split() new2= ' '.join(new1) nopunc = new2.translate(str.maketrans('', '', string.punctuation)).lower() out = sorted(list(set(nopunc,split(" ")))) main() `这就是你的意思?

【解决方案2】：

这实际上是两个任务，所以让我们把它分成两个问题。我会处理你关于剥离标点符号的问题，因为你在这件事上表现出了自己的努力。对于确定唯一词的问题，请打开一个新问题（在发布新问题之前，请在堆栈溢出上查找类似问题，我很确定您会发现一些有用的东西！）

你正确地发现你最终进入了一个无限循环。这是因为您的 while 循环条件始终为真，一旦 i 是标点符号。从new 中删除i 不会改变这一点。您可以通过使用简单的if-条件来避免这种情况。实际上，您的代码混淆了while 和if 的概念，并且您的场景是为if 语句量身定制的。我认为您认为您需要一个while 循环，因为您心中有迭代的概念。但是您已经在 for 循环中迭代了 content。因此，错误修复将是：

for i in new:
    if i in string.punctuation:
        new.discard(i)

但是，另一种更“pythonic”的方式是使用列表解析而不是for-loop

with open("uniq_words.txt") as content:
    stripped_content = "".join([
        x 
        for x in content.read() 
        if x not in string.punctuation
    ])

【讨论】：

你更“pythonic”的方式不起作用。此代码的语法无效..
@GiftZwergrapper 你是对的，我有点太快了，这是笔误。我更新了它，它现在可以工作了。