Python 从文件中读取并删除非 ascii 字符答案

【问题标题】：Python read from file and remove non-ascii charactersPython 从文件中读取并删除非 ascii 字符
【发布时间】：2014-12-09 17:36:08
【问题描述】：

我有以下程序逐字读取文件并将该单词再次写入另一个文件，但没有第一个文件中的非 ascii 字符。

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

我面临的唯一问题是，使用此代码它不会将新行打印到第二个文件 (d_parsed)。有什么线索吗？？

【问题讨论】：

有什么问题。效果很好。
它不像 outfile.write("\n") 那样换行
如果您在 Windows 上并且您用于查看文件的文本编辑器无法将 \n 识别为一行，则每行末尾可能没有\n -分隔符。
附带说明：此代码不会删除非 ASCII 字符 - 它会删除无法使用 UTF-8 编码解码的字符。

标签： python encoding character-encoding utf

【解决方案1】：

来自docs for codecs.open：

注意：文件总是以二进制模式打开，即使没有指定二进制模式。这样做是为了避免由于使用 8 位值进行编码而导致的数据丢失。这意味着在读写时不会自动转换 '\n'。

我假设您使用的是 Windows，其中换行符序列实际上是 '\r\n'。以文本模式打开的文件会自动从\n 转换为\r\n，但codecs.open 不会发生这种情况。

只需编写"\r\n" 而不是"\n"，它应该可以正常工作，至少在 Windows 上是这样。

【讨论】：

【解决方案2】：

codecs.open() 不支持通用换行符，例如，在 Windows 上阅读时，它不会将 \r\n 转换为 \n。

改用io.open()：

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

顺便说一句，如果你想删除非ASCII字符，你应该使用ascii而不是utf-8。

如果输入编码兼容ascii（比如utf-8），那么你可以用二进制模式打开文件并使用bytes.translate()去除非ascii字符：

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

它不像第一个代码示例那样规范化空格。

【讨论】：

bytes.translate() - 非常好

【解决方案3】：

使用编解码器打开csv文件，然后你可以避免非ascii字符

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)

【讨论】：