防止 Spacy 分词器拆分特定字符答案

【问题标题】：Prevent Spacy tokenizer from splitting on specific character防止 Spacy 分词器拆分特定字符
【发布时间】：2021-03-15 10:20:00
【问题描述】：

当使用 spacy 对句子进行标记时，我希望它不会在 / 上拆分为标记

例子：

import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
    print(i)

输出：

Get
10ct
/
liter
off
when
using
our
App

我希望它像Get , 10ct/liter, off, when ....

我能够找到如何添加更多方法来拆分为 spacy 的标记，但不知道如何避免特定的拆分技术。

【问题讨论】：

你只想split the text into tokens with whitespace吗？
@WiktorStribiżew 不，我只是希望它不会在 / 上分裂，其他一切都很好。仅在空格上拆分会产生质量较差的令牌。

标签： python nlp tokenize spacy

【解决方案1】：

我建议使用自定义标记器，请参阅Modifying existing rule sets：

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## =>  ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']

注意注释#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), 行，我只是从[:<>=/] 字符类中取出/ 字符。此规则在 / 处拆分，位于字母/数字和字母之间。

如果您仍需要将'12/ct' 拆分为三个标记，则需要在r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) 行下方添加另一行：

r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),

【讨论】：

请注意，我使用的是 Spacy 3.0.1，因此使用了 en_core_web_trf 模型。您仍然可以使用您的en_core_web_lg 一个。
工作得很好，感谢 Wiktor 的解释！