【发布时间】:2016-04-08 06:42:05
【问题描述】:
我需要使用 python3 和 nltk 对来自意大利语 wiki 的文本进行规范化,但我遇到了一个问题。大多数单词都可以,但有些单词映射不正确,更准确地说是一些符号。
例如:
'fruibilit\xe3','n\xe2\xba','citt\xe3'
我确定问题出在 à、è 等符号上。
代码:
# coding: utf8
import os
from nltk import corpus, word_tokenize, ConditionalFreqDist
it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw_plus and token.isalpha():
token = token.lower().encode('utf8')
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path)
raw_text = text_file.read().decode('utf8')
norm_tokens = normalize(raw_text)
it_corpora.append(norm_tokens)
print(it_corpora)
我该如何解决这个问题? 我在 Win7(rus) 上运行。
当我尝试这段代码时:
import io
with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
for line in fin:
print (line)
在 PowerShell 中:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>
在 Python 命令行中:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\1\projects\i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>
当我尝试请求时:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>
【问题讨论】:
-
能把原文件发到网上吗?否则很难知道文件的真正编码是什么。
-
open()带encoding=参数只能在python3中使用!在 python2 中使用import io; io.open(filename, 'r', encoding='utf8') -
尝试安装这个:pypi.python.org/packages/source/w/win_unicode_console/…,然后更正您的代码以将
io.open用于 python2 或仅使用 python3 并查看您是否正确阅读和打印文本。 -
alvas,对不起,我的错误。但是当我在 PowerShell 中使用 python3 时,它会引发类似在 Python 命令行中的错误。我已经在 Python IDLE 中尝试过代码,它工作正常。
标签: python-3.x file-io encoding io nltk