NLTK (python) 和希腊语编码答案

【问题标题】：NLTK (python) and Greek encodingNLTK (python) 和希腊语编码
【发布时间】：2014-02-23 04:58:16
【问题描述】：

我尝试在希腊文本中使用 NLTK pagkage，但我处理了一个很大的编码问题。我的代码在下面

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os, string, re, nltk

def find_bigrams(input_list):
   bigram_list = []
   for i in range(len(input_list)-1):
       bigram_list.append((input_list[i], input_list[i+1]))
       return bigram_list

def get_nice_string(list_or_iterator):
   return "[" + " , ".join( str(x) for x in list_or_iterator) + "]"

def stripText(rawText):
   text = rawText
    rules = [
    {r'{[^)]*\}' : ''},             # remove curly brackets
    {r'\([^)]*\)' : ''},            # remove parentheses
    {r'^https?:\/\/.*[\r\n]*' : ''},# remove urls
    {r' +' : ' '},                  # remove multiple whitespaces
    {r'^\s+': ''},                  # remove whitespaces beginning
    {r'\.\.+' : '.'}                # remove multiple fullstops
    ]

for rule in rules:
    for (k, v) in rule.items():
        regex = re.compile(k)
        text = regex.sub(v, text)

sentenceClean = text.translate(string.maketrans('', ''), '{}[]|?"=\'')
return sentenceClean

if __name__ == '__main__':
    f = open('C:\\Users\\Dimitris\\Desktop\\1.txt', 'r').readlines()

    newFile = open('C:\\Users\\Dimitris\\Desktop\\corpus.txt', 'w')
    newFile1 = open('C:\\Users\\Dimitris\\Desktop\\words.txt', 'w')

    words = ['jpg', 'jpeg', 'File', 'Image']

for line in f:
    sentences = stripText(line)
    whitespaces = sentences.count(' ')
    if any(word in sentences for word in words):
        continue
    elif whitespaces < 20:
        continue
    else:
        newFile.write(sentences+'\n')

        b = nltk.word_tokenize(sentences)
        print get_nice_string(b)
        get_nice_string(nltk.bigrams(b))
        print get_nice_string(nltk.bigrams(b))

        newFile1.write(get_nice_string(b))


newFile.close()
newFile1.close()

当我尝试从 nltk.word_tokenize(sentences) 打印输出时，结果类似于 (('\xe5\xe3\xea\xfe\xec\xe9\xe1', '\xe3\xe9') )，但是如果我使用 get_nice_string() 函数并将列表转换为字符串，则结果是正常的希腊文本。到目前为止，一切都很好。

但是无论我使用 find_bigrams() 函数还是 nltk.bigrams() 我都会得到类似上面的字符串 (('\xe5\xe3\xea\xfe\xec\xe9\xe1', '\xe3\xe9')) ，即使我使用 get_nice_string() 函数，为了将列表变成字符串。

另外，我尝试使用 codecs.open() 函数打开文件，像这样

f = codecs.open('C:\\Users\\Dimitris\\Desktop\\1.txt', 'r', 'utf-8').readlines()

但问题仍然存在。

有什么想法吗？

【问题讨论】：

标签： python python-2.7 encoding nlp nltk

【解决方案1】：

首先，NLTK 的word_tokenize() 可能不适合您输入的希腊语数据；默认nltk.tokenize.word_tokenize() 是在英语 Penn Treebank 上训练的，请参阅 https://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.treebank.TreebankWordTokenizer-class.html

我不确定您是否得到了正确的标记，但由于希腊语使用空格作为标记分隔符，NLTK 似乎可以工作，但我会使用 str.split() 代替：

>>> from nltk import word_tokenize
>>> x = "Θέλεις να χορέψεις μαζί μου"
>>> for i in word_tokenize(x):
...     print i
... 
Θέλεις
να
χορέψεις
μαζί
μου
>>> for i in x.split():
...     print i
... 
Θέλεις
να
χορέψεις
μαζί
μου

与其在 NLTK 中使用默认的 word_tokenize()，不如使用 PunktTrainer http://nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktTrainer 重新训练一个 punkt 模型会更好

接下来关于打印utf8字符，见byte string vs. unicode string. Python

最后是 NLTK 弄乱你的二元组的问题，我建议使用你自己的二元组代码，因为 NLTK 主要是在英语输入而不是希腊语输入上测试的，试试：

>>> x = "Θέλεις να χορέψεις μαζί μου"
>>> bigrams = zip(*[x.split()[i:] for i in range(2)])
>>> for i in bigrams:
...     print i[0], i[1]
... 
Θέλεις να
να χορέψεις
χορέψεις μαζί
μαζί μου

【讨论】：

感谢您的回复！但是当我从 .txt 文件导入文本时，我的问题仍然存在。我使用终端没有问题。另外，我可以看到带有解码的单词code>>> print '\xce\x9a\xce\xac\xce\xb9\xcf\x81\xce\xbf'.decode('utf-8') Κάιρο@ 987654333@ 恐怕问题与 nltk.bigrams() 在列表中返回 tubles 的事实有关。但我仍然找不到解决方案。
你可以使用这个：f = codecs.open('C:\\Users\\Dimitris\\Desktop\\1.txt', 'r', 'utf-8-sig').readlines()