Spacy 自定义标记器使用中缀正则表达式仅包含连字符作为标记答案

【问题标题】：Spacy custom tokenizer to include only hyphen words as tokens using Infix regexSpacy 自定义标记器使用中缀正则表达式仅包含连字符作为标记
【发布时间】：2018-06-24 17:45:00
【问题描述】：

我想在 Spacy 中包含连字符，例如：long-term、self-自尊、 等作为单个标记。在查看 StackOverflow 上的一些类似帖子后，Github、documentation 和 elsewhere，我还编写了一个自定义标记器，如下所示：

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

所以对于这句话： '注：自十四世纪以来，“行医”已成为一种职业；更重要的是，这是一个男性主导的职业。'

现在，合并自定义 Spacy Tokenizer 后的令牌是：

'Note', ':', 'Since', 'the', 'teenth', '世纪', 'the', 'practice', 'of', ''药', '”', '有', ';', '成为', 'a', '专业', ',', '和', '更多', '重要', ',', “它是”， 'a'，'男性主导'，'职业'，'。'

此前，此更改之前的标记是：

'Note', ':', 'Since', 'the', 'teenth', 'century', 'the', 'practice', 'of', '“', '医学', '”', 'has', 'become', 'a', 'professional', ';', 'and', 'more', '重要', ',', '它', "的", 'a', '男', ' -', '主宰', '职业', '.'

而且，预期的令牌应该是：

'Note', ':', 'Since', 'the', 'teenth', 'century', 'the', 'practice', 'of', '“', '医学', '”', 'has', 'become', 'a', 'professional', ';', 'and', 'more', 'importantly', ',', 'it', "的", 'a', '男性主导', '职业'，'。'

总结：正如大家所见...

包括连字符和除双引号和撇号以外的其他标点符号...
...但是现在，撇号和双引号没有早期或预期的行为。
我已经为中缀的正则表达式编译尝试了不同的排列和组合，但没有解决这个问题的进展。

【问题讨论】：

需要明确的是，“medicine” 总是用尾随双引号分开标记（错误地，前后都有）：'“medicine' , '”'.你也想解决这个问题。

标签： regex nlp tokenize spacy linguistics

【解决方案1】：

使用默认的 prefix_re 和 suffix_re 给了我预期的输出：

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

['注', ':', '自', '本', '十四', '世纪', '本', '实践', '之', '“', '医药', '” ', '有', '成为', 'a', '职业', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a' , '男性主导', '职业', '.']

如果您想深入了解为什么您的正则表达式不像 SpaCy 那样工作，这里是相关源代码的链接：

这里定义的前缀和后缀：

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

参考此处定义的字符（例如引号、连字符等）：

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

以及用于编译它们的函数（例如，compile_prefix_regex）：

https://github.com/explosion/spaCy/blob/master/spacy/util.py

【讨论】：

尼古拉斯真是太感谢你了！ :) 它现在按预期工作。正如正确指出的那样，问题出在默认的 prefix_re 和 suffix_re 上。还感谢分享标点符号和引号字符（例如引号、连字符等）的引用链接以及编译它们的链接！它们非常方便，有助于翻译涵盖所有极端情况，尤其是跨其他语言！
您推荐的正则表达式拆分“这不可能。”如下; ['This', 'can', "'", 't', 'be', 'it', '.'] 这不是人们（或至少我）所期望的。
您推荐的正则表达式解决了所有提供的问题，但是正如我上面提到的那样，它会产生更多问题。
我个人已经尝试了很多方法来确保“连字符内”的单词不会被分开，但是我总是会在句子或标记拆分方面产生其他问题。
例如；中缀 = 元组([r"(n[o']t|'\w{1,2})\b", r"(?