如何通过字符串列表搜索字符串模式并返回相应的索引？答案

【问题标题】：How to search for a pattern of strings through a list of strings and return the respective indexes?如何通过字符串列表搜索字符串模式并返回相应的索引？
【发布时间】：2020-04-03 18:58:25
【问题描述】：

我已经想到了一些解决方案来尝试解决问题，但在我看来没有一个是合适的。我来解释一下：

假设我们有以下字符串列表（来自词性标记的 PoS 标记序列）：

['PROPN', 'AUX', 'ADV', 'VERB', 'SCONJ', 'PROPN', 'AUX', 'NOUN', 'CCONJ', 'PROPN', 'AUX' , '名词', 'PUNCT']

我的目标是在列表中找到以下模式：

PROPN - AUX -（中间的任何东西）- PUNCT

通过返回这两个可能的结果：

[0,1,2,3,4,5,6,7,8,9,10,11,12] 和 [9,10,11,12]

我知道其中一种可能的方法是连接列表中的所有字符串并在 python 中使用正则表达式，但该方法会出现问题：

将匹配的索引仅与该字符串的字符索引相关，之后（在我看来）将这些索引转换为单词位置的索引是不够的原始清单。保持在初始列表中完成的标记化的完整性很重要。

如果有人能提出解决这个问题的方法，我将不胜感激。

提前致谢。

【问题讨论】：

请重复介绍，尤其是how to ask。模式搜索是计算领域的一个很好的话题。我们希望看到一个特定的编码问题。寻求帮助来设计您的特定解决方案超出了 Stack Overflowl 的范围
请不要在标题中放置语言标签。
查看正则表达式理论。此问题与在字符串中查找模式ab.*c 相同。

标签： python regex algorithm nlp

【解决方案1】：

这是 Python 中可能的解决方案：

def match_indexes(probelist, head=['PROPN', 'AUX'], tail=['PUNCT']):
    """ returns a list with all indexes in a list of strings that matches 
        prefix and tail 
    """
    result=list()
    step = len(head)
    last = len(probelist) - len(tail)
    if (step + len(tail) <= len(probelist)):
        for i in range(0,last):
            if (probelist[i:i+len(head)] == head):
                for j in range (i+step,last + 1):
                    if probelist[j:j+len(tail)] == tail:
                        result.append(list(range(i,j+len(tail))))
    return result

请注意，所有参数都作为列表传递。你可以从这里开始。

样本输出：

test = ['PROPN', 'AUX', 'ADV', 'VERB', 'SCONJ', 'PROPN', 'AUX', \
          'NOUN', 'CCONJ', 'PROPN', 'AUX', 'NOUN', 'PUNCT']

print( match_indexes(test)) 
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12], [5, 6, 7, 8, 9, 10, 11, 12], [9, 10, 11, 12]]

print(match_indexes(test,head=['NOUN'])) 
[[7, 8, 9, 10, 11, 12], [11,12]]

print(match_indexes(test, head=['NOUN'], tail=["PROPN"]))
[[7, 8, 9]]

print(match_indexes(test, head=['PROPN', 'NOUN'], tail=["PROPN"])) 
[]

test = [] print(match_indexes(test))
[]

【讨论】：

非常感谢@silver。这个解决方案完全符合我的需要。