【问题标题】:Fnd a way to work with porter stemmer and encoding in python找到一种在 python 中使用 porter stemmer 和编码的方法
【发布时间】:2017-04-16 03:47:22
【问题描述】:

我尝试读取文件并使用 PorterStemmer 存储文件文本的词干标记,但出现此错误。

    tokens=preprocessTokens(line)
    File "/home/fl/git/KNN/preprocessDoc.py", line 20, in preprocessTokens
line=line+' '+ps.stem(w)
    File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 664, in stem
    stem = self._step1a(stem)
    File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 289, in _step1a
   if word.endswith('ies') and len(word) == 4:
   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

要解决它,将这两行添加到我的代码中,然后忽略

reload(sys)  
sys.setdefaultencoding('ISO-8859-15')

但是对于某些文件,我收到以下错误。然后我尝试将编码更改为'utf-8'I,我得到了同样的错误。

tokens=preprocessTokens(line.encode('ascii',errors='ignore'))
  File "/home/fl/git/KNN/preprocessDoc.py", line 20, in preprocessTokens
    line=line+' '+ps.stem(w)
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

【问题讨论】:

    标签: python-2.7 encoding ascii porter-stemmer


    【解决方案1】:

    错误消息是字符值大于 127 时出现问题。因此,作为一种解决方法,我将输入字符串逐个字符地与该字符与 127(即:c 和 127)相加,然后将该字符放回字符串。换句话说,重建字符串并强制在 127 ascii 下翻译每个字符,然后继续处理该输入字符串 这是我为我的问题找到的解决方案。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-03-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-10
      • 1970-01-01
      相关资源
      最近更新 更多