在python中字符串匹配替换为区分大小写答案

【问题标题】：string matching replace with case sensitive in python在python中字符串匹配替换为区分大小写
【发布时间】：2011-07-29 15:04:07
【问题描述】：

我是 python 的新手，正在尝试做一些新的东西。我在字典中有两个列表。比方说，

List1:                              List2:
Anterior                            cord
cuneate nucleus                     Medulla oblongata
nucleus                             Spinal cord
Intermediolateral nucleus           Spinal 
                                    sksdsj
british                             7

我有一些文本行如下：

<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

我必须从 list1 和 list2 返回那些属于字符串的行。所以，我尝试了以下代码：

result = ""
if list1 in line and list2 in line:
    i1 = re.sub('(?i)(\s+)(%s)(\s+)'%list1, '\\1<e1>\\2</e1>\\3', line)
    i2 = re.sub('(?i)(\s+)(%s)(\s+)'%list2, '\\1<e2>\\2</e2>\\3', i1)
    result = result + i2 + "\n"
    continue

但我得到以下结果：

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate <e1>nucleus</e1> is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

在这里，只有结果第 4 行，我从两个列表中得到了我想要的匹配字符串。但是，我不想得到那些只匹配一个或不匹配字符串的行（例如结果行 - 1 & 3)。此外，如果匹配两个列表中的字符串，是否应该标记它们（例如结果行 2）。

我们将不胜感激任何形式的帮助。

【问题讨论】：

标签： python string-matching

【解决方案1】：

基本上，您希望将一些单词放在<e1> 标签中，而将其他单词放在<e2> 标签中。对吗？

如果是这样，那么这样的事情会做：

#!/usr/bin/python

from __future__ import print_function
import re

text = '''\
<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>'''

list1 = ('Anterior', 'cuneate nucleus', 'Intermediolateral nucleus')
list2 = ('cord', 'Medulla oblongata', 'Spinal cord')

# put phrases in \b so that they match the whole words
re1 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list1), re.IGNORECASE)
re2 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list2), re.IGNORECASE)

for line in text.split("\n"):
    line = re1.sub(r"<e1>\1</e1>", line)
    line = re2.sub(r"<e2>\1</e2>", line)
    print(line)

输出：

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the <e2>spinal cord</e2> and the <e2>medulla oblongata</e2>.</s>
<s id="3691284-1">In the <e2>medulla oblongata</e2>, the arcuate nucleus is a group of neurons located on the <e1>anterior</e1> surface of the medullary pyramids.</s>
<s id="21120-99"><e1>Anterior</e1> horn cells, motoneurons located in the <e2>spinal cord</e2>.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the <e2>spinal cord</e2>, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the <e2>spinal cord</e2>.</s>

【讨论】：

我想将 list1 中的字符串准确地放入标签中，并将 list2 中的字符串放入标签中与行字符串匹配的字符串。
请考虑我在列表中也有我想匹配的数字字符串。所以，我总是需要逃避这部分<s id="697">部分
我给你答案后，你扩展了问题。接受这个答案，然后发布一个新问题。
更新了答案。只是添加了“\b”来匹配整个单词。
我的意思是对于每一行，我将从两个列表中为每个查询获取单个字符串！ ——

【解决方案2】：

这个怎么样：

result = ""
lines = ['<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>',
'<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>',
'<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>',
'<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>']

for line in lines:
    for item1 in list1:
        if line.find(item1) != -1:
            for item2 in list2:
                if line.find(item2) != -1:
                      result = result + line + '\n'
                      break
            break
print result

【讨论】：