如何使用 Python 从 txt 文件中删除特殊字符答案

【问题标题】：How to remove special characters from txt files using Python如何使用 Python 从 txt 文件中删除特殊字符
【发布时间】：2012-08-07 18:38:40
【问题描述】：

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
print "There are" ,len(uniquewords), "unique words in the files." "From directory", pattern

到目前为止，我的代码是这样的。这会计算来自D:\report\shakeall\*.txt的唯一词数和总词数

问题是，例如，此代码识别 code code. 和 code! 不同的词。因此，这不能作为唯一词的确切数量的答案。

我想使用 Windows 文本编辑器从 42 个文本文件中删除特殊字符

或者制定一个例外规则来解决这个问题。

如果使用后者，我应该如何编写代码？

让它直接修改文本文件？还是做一个不计算特殊字符的例外？

【问题讨论】：

How to format code on SO
你可以只做 set() 而不是 set([])

标签： python

【解决方案1】：

import re
string = open('a.txt').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('b.txt', 'w').write(new_str)

它将每个非字母数字字符更改为空白。

【讨论】：

你不应该使用str作为变量名，因为它是一个内置类。

【解决方案2】：

我很新，我怀疑这是否非常优雅，但一种选择是在读入您的字符串并通过 string.translate() 运行它们以去除标点符号后获取它们。 Here is the Python documentation for it 用于 2.7 版（我认为您正在使用）。

就实际代码而言，可能是这样的（但也许比我更好的人可以确认/改进它）：

fileString.translate(None, string.punctuation)

其中“fileString”是您的 open(fp) 读入的字符串。提供“None”来代替转换表（通常用于将某些字符实际更改为其他字符），第二个参数， string.punctuation（一个包含所有标点符号的 Python 字符串常量）是一组将从您的字符串中删除的字符。

如果上述方法不起作用，您可以进行如下修改：

inChars = string.punctuation
outChars = ['']*32
tranlateTable = maketrans(inChars, outChars)
fileString.translate(tranlateTable)

我通过快速搜索找到了一些其他类似问题的答案。我也会把它们链接在这里，以防你能从他们那里得到更多。

Removing Punctuation From Python List Items

Remove all special characters, punctuation and spaces from string

Strip Specific Punctuation in Python 2.x

最后，如果我说的完全错误，请发表评论，我会删除它，这样其他人就不会尝试我所说的而感到沮丧。

【讨论】：

【解决方案3】：

import re

然后替换

[uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]

由

[uniquewords.add(re.sub('[^a-zA-Z0-9]*$', '', x) for x in open(os.path.join(root,name)).read().split()]

这将在将每个单词添加到集合之前从每个单词中去除所有尾随的非字母数字字符。

【讨论】：