【问题标题】:python tokenization UnicodeDecodeErrorpython标记化UnicodeDecodeError
【发布时间】:2016-05-18 03:48:39
【问题描述】:

我正在尝试标记一些文档,但出现此错误

UnicodeDecodeError: 'ascii' 编解码器无法在位置解码字节 0xef 6:序数不在范围内(128)

import nltk
import pandas as pd

df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']

result = [nltk.word_tokenize(sent) for sent in documents]

我认为是 unicode 问题所以我添加了

documents = unicode(documents, 'utf-8')

另一个错误

TypeError: coercing to Unicode: need string or buffer, Series found

print documents

1      Brandon Cachia ,All I know is that,you're so n...
2      Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3                         .........Where is my mind?????
4      Having a philosophical discussion with Trudy D...

【问题讨论】:

    标签: python nlp


    【解决方案1】:

    unicode 对字符串或字节进行操作,但documents 是熊猫系列。

    也许:

    result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-23
      • 2017-03-22
      • 1970-01-01
      • 2018-02-21
      • 1970-01-01
      • 2022-01-15
      相关资源
      最近更新 更多