【问题标题】:Terminology matching in text文本中的术语匹配
【发布时间】:2017-09-03 00:43:44
【问题描述】:

我有一个术语列表如下:

a   
abc
a abc
a a abc
abc

我想匹配文本中的术语并将它们的名称更改为“term1,term2”。但我想找到最长的匹配作为正确的匹配。

Text: I have a and abc maybe abc again and also a a abc.
Output: I have term1 and term2 maybe term2 again and also a term3.

到目前为止,我使用了下面的代码,但没有找到最长的匹配:

for x in terms:
    if x in text:
       do blabla

【问题讨论】:

    标签: python-2.7 pattern-matching


    【解决方案1】:

    您可以使用re.sub

    import re
    
    words = ["a", 
    "abc",
    "a abc",
    "a a abc"
    ]
    
    test_str = "I have a and abc maybe abc again and also a a abc."
    
    for word in sorted(words, key=len, reverse=True):
        term = "\1term%i\2" % (words.index(word)+1)
        test_str = re.sub(r"(\b)%s(\b)"%word, term, test_str)
    
    print(test_str)
    

    它会得到你的“预期”结果(你在例子中犯了一个错误)

    Input: I have a and abc maybe abc again and also a a abc.
    Output: I have term1 and term2 maybe term2 again and also term4.
    

    【讨论】:

      【解决方案2】:

      或使用 re.sub 替换功能:

      import re
      
      text = 'I have a and abc maybe abc again and also a a abc'
      words = ['a', 'abc', 'a abc', 'a a abc']
      regex = re.compile(r'\b' + r'\b|\b'.join(sorted(words, key=len, reverse=True)) + r'\b')
      
      
      def replacer(m):
          print 'replacing : %s' % m.group(0)
          return 'term%d' % (words.index(m.group(0)) + 1)
      
      print re.sub(regex, replacer, text)
      

      结果:

      replacing : a
      replacing : abc
      replacing : abc
      replacing : a a abc
      I have term1 and term2 maybe term2 again and also term4
      

      或使用匿名替换器:

      print re.sub(regex, lambda m: 'term%d' % (words.index(m.group(0)) + 1), text)
      

      【讨论】:

        猜你喜欢
        • 2019-04-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-11-24
        • 2018-04-21
        • 1970-01-01
        相关资源
        最近更新 更多