如何标记复合词？答案

【问题标题】：How to tokenize compound words?如何标记复合词？
【发布时间】：2021-01-12 01:04:27
【问题描述】：

拥有["southnorth"] 等原始列表元素，我想根据["south", "north", "island"] 列表添加一个空格。然后，只要我们基于标记化的列表包含['south', 'north']，列表就会从['southnorth'] 更改为['south','north']。

但是，如果有一个列表 ["south", "island"]，那么列表 ["southnorth"] 应该保持原样。

我的想法如下：

list1= ['southnorth']
#list2= ['south','north','island']
list2=['south','island']

str1= " ".join(list1)
str2= " ".join(list2)

Get the alternators to apply regex:
list_compound = sorted(list1 + list2, key=len)
alternators = '|'.join(map(re.escape, list_compound)
regex = re.compile(r''.format(alternators)

str1_split = re.sub(r'({})'.format(alternators),r'\1 ',str1,0, re.IGNORECASE)

str2_split = re.sub(r'({})'.format(alternators),r'\1 ',str2,0, re.IGNORECASE)

但是，上面的方法失败了，因为我需要确保序列的顺序。例如，要分解["southnorth"]，我需要确保另一个列表有["south", "north"]。否则，保持原样。

【问题讨论】：

组合字符串中可以有两个以上的部分吗？
如果你的字符串是southwestnorth怎么办？您希望输出是southwest north 还是south westnorth？
我会保留southwestnorth 的原始形式，因为标记化的唯一方法是south 和north 是连续的。

标签： python python-3.x regex tokenize

【解决方案1】：

不是最漂亮的解决方案，也可能不是最好的解决方案，但这是一个微不足道的蛮力尝试：

def tokenize(word, tokens):
    tokenized_word = word
    for t in tokens:
        tokenized_word = tokenized_word.replace(t, f"{t} ").strip()

    for w in tokenized_word.split(" "):
        if w.strip() not in tokens:
            return word

    return tokenized_word


tokens = ["south", "north", "island"]

assert tokenize("south", tokens) == "south"
assert tokenize("southnorth", tokens) == "south north"
assert tokenize("islandsouthnorth", tokens) == "island south north"
assert tokenize("southwestnorth", tokens) == "southwestnorth"

【讨论】：