使用正则表达式和 python 发现相同相邻的字符串答案

【问题标题】：Discover identically adjacent strings with regex and python使用正则表达式和 python 发现相同相邻的字符串
【发布时间】：2015-11-19 08:39:58
【问题描述】：

考虑一下这段文字：

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

我想用 python 解析这个文本，只保留恰好出现两次并且相邻的字符串。例如，可接受的结果应该是

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

因为趋势是每个字符串都与相同的字符串相邻，就像这样：

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

那么，如何使用正则表达式搜索相邻且相同的字符串？我正在测试我的试验here。谢谢！

【问题讨论】：

标签： python regex regex-negation regex-lookarounds

【解决方案1】：

您可以使用以下正则表达式：

(\b.+)\1

见demo

或者，仅匹配并捕获唯一的子字符串部分：

(\b.+)(?=\1)

Another demo

单词边界\b确保我们只匹配单词的开头，然后匹配除换行符以外的1个或多个字符（在单行模式下，.也会匹配换行符），然后在backreference 的帮助下，我们匹配了与(\b.+) 捕获的完全相同的字符序列。

当使用带有(?=\1) 预读的版本时，匹配的文本不包含重复部分，因为预读不消耗文本并且匹配不包含那些块。

更新

见Python demo:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
    print i.group(1).encode('utf-8')

输出：

zyme
abbrühen

【讨论】：

非常感谢您的正确回答。我想知道是否还有一个功能可以使用正则表达式获取字符串的一半（因为它会给出想要的结果），以便为最终输出保存第二遍数据。再次感谢stribizhev。
对不起，我想我应该从一开始就发布这个：(\b.+)(?=\1)。对吗？
不用这么感谢，点赞就够了:)顺便说一句，你应该把你试过的东西贴出来，因为我看到你尝试了一些东西。
我现在看到，当我在此数据中使用建议的正则表达式进行搜索时：zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM 我得到的是[['zyme'], [], [], [' ', ' ']]，即它也解析逗号。我正在使用这个代码：reg = re.compile(r"(\b.+)(?=\1)") for line in textfile: matches += [(reg.findall(line))] textfile.close()，你认为这可以改进吗？
为什么 'abbrühenabbrühen' 被解析为 'abbr\xc3\xbchen' ？如何避免这些特殊字符以这种方式解析？