我建议使用自定义标记器,请参阅Modifying existing rule sets:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## => ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']
注意注释#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), 行,我只是从[:<>=/] 字符类中取出/ 字符。此规则在 / 处拆分,位于字母/数字和字母之间。
如果您仍需要将'12/ct' 拆分为三个标记,则需要在r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) 行下方添加另一行:
r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),