【问题标题】:How to find multi-word string from string, and label it in python?如何从字符串中找到多词字符串,并在python中标记它?
【发布时间】:2019-04-27 17:15:17
【问题描述】:

例如,句子是"The corporate balance sheets data are available on an annual basis",我需要标记"corporate balance sheets",它是从给定句子中找到的子字符串。

所以,我需要找到的模式是:

"corporate balance sheets"

给定字符串:

"The corporate balance sheets data are available on an annual basis".

我想要的输出标签序列将是:

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

有一堆句子(超过 2GB),还有一堆我需要找到的模式。我不知道如何在 python 中有效地做到这一点。谁能给我一个好的算法?

【问题讨论】:

    标签: python nlp string-matching preprocessor labeling


    【解决方案1】:

    列表理解和使用拆分:

    import re
    lst=[]
    search_word = 'corporate balance sheets'
    p = re.compile(search_word)
    sentence="The corporate balance sheets data are available on an annual basis"
    
    lst=[1 for i in range(len(search_word.split()))]
    vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
    vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
    flattened = [val for sublist in vectlstoflst for val in sublist]
    

    输出:

     [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
    

    Sentence ="公司资产负债表数据可在年表上获得"

    输出

    [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    

    【讨论】:

    • 如果有来自 search_word 的单词但不完全匹配则不起作用?
    • 如果“sheets”独立出现在句末,但不与“corporate balance”一起出现
    • 如果句子是“公司资产负债表数据可在年度表上获得”,则输出为 [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1]
    • @rokrokss,你能举一个不工作的例子吗?
    • 与前面的评论示例一样,最后的“工作表”不应标记。该算法只需要从头到尾标记整个匹配模式。
    【解决方案2】:

    由于子字符串中的所有单词都必须匹配,您可以使用all 来检查并在遍历句子时更新相应的索引:

    def encode(sub, sent):
        subwords, sentwords = sub.split(), sent.split()
        res = [0 for _ in sentwords]    
        for i, word in enumerate(sentwords[:-len(subwords) + 1]):
            if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
                for j in range(len(subwords)):
                    res[i + j] = 1
        return res
    
    
    sub = "corporate balance sheets"
    sent = "The corporate balance sheets data are available on an annual basis"
    print(encode(sub, sent))
    # [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
    
    sent = "The corporate balance data are available on an annual basis sheets"
    print(encode(sub, sent))
    # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多