SpaCy 括号标记化：（LRB，RRB）对未正确标记答案

【问题标题】：SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not tokenized correctlySpaCy 括号标记化：（LRB，RRB）对未正确标记
【发布时间】：2019-06-04 07:40:11
【问题描述】：

当RRB 后面的单词没有用空格隔开时，会被识别为单词的一部分。

In [34]: nlp("Indonesia (CNN)AirAsia ")                                                               
Out[34]: Indonesia (CNN)AirAsia 

In [35]: d=nlp("Indonesia (CNN)AirAsia ")                                                             

In [36]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[36]: 
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'),
 ('(', '(', 'PUNCT', '-LRB-'),
 ('CNN)AirAsia', 'CNN)AirAsia', 'PROPN', 'NNP')]

In [39]: d=nlp("(CNN)Police")                                                                         

In [40]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[40]: [('(', '(', 'PUNCT', '-LRB-'), ('CNN)Police', 'cnn)police', 'VERB', 'VB')]

预期结果是

In [37]: d=nlp("(CNN) Police")                                                                        

In [38]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[38]: 
[('(', '(', 'PUNCT', '-LRB-'),
 ('CNN', 'CNN', 'PROPN', 'NNP'),
 (')', ')', 'PUNCT', '-RRB-'),
 ('Police', 'Police', 'NOUN', 'NNS')]

这是一个错误吗？有解决此问题的建议吗？

【问题讨论】：

作为一种解决方法，您可以预先使用像re.sub(r'\b\)\b', r'\g<0> ', txt)这样的正则表达式对语料库进行预处理
@WiktorStribiżew 很高兴知道这一点！感谢您的详细回答。你认为这是一个需要我们解决问题的错误吗？
我不确定这是否是一个错误，因为您始终可以扩展基本功能，可能有一些原因不将此规则添加到中缀。

标签： python spacy

【解决方案1】：

使用自定义标记器将r'\b\)\b' 规则（请参阅this regex demo）添加到infixes。正则表达式匹配)，该) 前面有任何单词char（字母、数字、_，在Python 3 中，还有一些其他稀有字符），后跟这种类型的char。

您可以进一步自定义此正则表达式，因此很大程度上取决于您希望与 ) 匹配的上下文。

查看完整的 Python 演示：

import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

nlp = spacy.load('en_core_web_sm')

def custom_tokenizer(nlp):
    infixes = tuple([r"\b\)\b"]) +  nlp.Defaults.infixes
    infix_re = spacy.util.compile_infix_regex(infixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Indonesia (CNN)AirAsia ")

print([(t.text, t.lemma_, t.pos_, t.tag_) for t in doc] )

输出：

[('Indonesia', 'Indonesia', 'PROPN', 'NNP'), ('(', '(', 'PUNCT', '-LRB-'), ('CNN', 'CNN', 'PROPN', 'NNP'), (')', ')', 'PUNCT', '-RRB-'), ('AirAsia', 'AirAsia', 'PROPN', 'NNP')]

【讨论】：

实际上，您在示例中开始使用的模型没有针对自定义标记器标记的文本进行训练，因此在训练后附加自定义标记器很可能会对其性能产生负面影响

【解决方案2】：

不需要自定义标记器的替代解决方案

nlp = spacy.blank('en')

infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),

        # Additions to infix rules begin here

        # bracket between characters
        r"\b\)\b"

    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

然后保存此模型并在训练新模型时将其用作基础模型。

【讨论】：

完整示例：spacy.io/usage/linguistic-features#native-tokenizer-additions