【问题标题】:Concordance Unicode characters in Unicode corpus in nltknltk 中 Unicode 语料库中的一致性 Unicode 字符
【发布时间】:2014-01-12 18:42:08
【问题描述】:

我有 Unicode 短语想在 nltk 的 Unicode 语料库中搜索,但问题是我应该在 nltk 中转换我的编码,否则索引结果将为零。但我不知道怎么做?这是我的简单代码:

import nltk
f=open('word-freq-utf8-new.txt','rU')
text=f.read()
text1=text.split()
abst=nltk.Text(text1)
abst.concordance('سلام')

【问题讨论】:

    标签: python python-2.7 unicode nltk


    【解决方案1】:

    nltk 还不能很好地与 unicode 一起工作,尽管他们正在努力。作为一个快速修复,您可以为索引创建一个子类并覆盖 print_concordance 方法,以确保您在正确的时间进行编码/解码以进行处理和显示。这是一个非常快速的解决方法,假设您已经导入了 nltk(我使用的是 unicode 希腊文本的示例部分):

    >>> tokens = re.findall(ur'\w+', t.decode('utf-8'), flags=re.U)    # I did this to make sure I was working with a decoded text. If you are working with an encoded text, skip this. `t` is the equivalent of your `text`.
    
    >>> class ConcordanceIndex2(nltk.ConcordanceIndex):
        'Extends the ConcordanceIndex class.'
        def print_concordance(self, word, width=75, lines=25):
            half_width = (width - len(word) - 2) // 2
            context = width // 4 # approx number of words of context
    
            offsets = self.offsets(word)
            if offsets:
                lines = min(lines, len(offsets))
                print("Displaying %s of %s matches:" % (lines, len(offsets)))
                for i in offsets:
                if lines <= 0:
                    break
                left = (' ' * half_width +
                    ' '.join([x.decode('utf-8') for x in self._tokens[i-context:i]]))    # decoded here for display purposes
                right = ' '.join([x.decode('utf-8') for x in self._tokens[i+1:i+context]])    # decoded here for display purposes
                left = left[-half_width:]
                right = right[:half_width]
                print(' '.join([left, self._tokens[i].decode('utf-8'), right]))    # decoded here for display purposes
                lines -= 1
            else:
                print("No matches")
    

    如果您正在处理已解码的文本,则需要像这样对标记进行编码:

    >>> concordance_index = ConcordanceIndex2([x.encode('utf-8') for x in tokens], key=lambda s: s.lower())    # encoded here to match an encoded text
    >>> concordance_index.print_concordance(u'\u039a\u0391\u0399\u03a3\u0391\u03a1\u0395\u0399\u0391\u03a3'.encode('utf-8'))
    Displaying 1 of 1 matches:
                               ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse
    

    否则,您可以这样做:

    >>> concordance_index = ConcordanceIndex2(tokens, key=lambda s: s.lower())
    >>> concordance_index.print_concordance('\xce\x9a\xce\x91\xce\x99\xce\xa3\xce\x91\xce\xa1\xce\x95\xce\x99\xce\x91\xce\xa3')
    Displaying 1 of 1 matches:
                               ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse
    

    【讨论】:

      猜你喜欢
      • 2018-08-01
      • 1970-01-01
      • 2015-07-22
      • 2015-08-13
      • 2021-11-27
      • 2012-01-30
      • 1970-01-01
      • 2012-12-07
      • 2017-09-14
      相关资源
      最近更新 更多