使用 findall 的交流发电机中的字符串与正确的字符串不匹配答案

【问题标题】：String not matching correct string in alternators using findall使用 findall 的交流发电机中的字符串与正确的字符串不匹配
【发布时间】：2020-06-23 21:21:23
【问题描述】：

我使用 re.findall 来标记字符串，这些字符串并不总是必须在一个单词之后拆分（一个标记可以有复合词）。我以所描述的方式获得了令牌。但是，它不会保留正则表达式模式中包含的点。

例如，考虑以下代码：

import re
all_domain=['com edu','.com edu','inc.', '.com', 'inc', 'com', '.edu', 'edu']
all_domain.sort(key=len, reverse=True)
domain_alternators = '|'.join(all_domain)

print(domain_alternators)
regex = re.compile(r'\b({}|[a-z-A-Z]+)\b'.format(domain_alternators))
print(regex)
#re.compile('\\b(.com edu|com edu|inc.|.com|.edu|inc|com|edu|[a-z-A-Z]+)\\b')

name= 'BASIC SCHOOL DISTRICT .COM'
result=regex.findall(name.lower())

它应该作为结果返回['basic', 'school', 'district', '.com']，因为.com 在交流发电机中具有更高的优先级（在交流发电机列表中.com 排在com 之前）：

.com edu|com edu|inc.|.com|.edu|inc|com|edu

我怎样才能得到['basic', 'school', 'district', '.com'] 而不是得到['basic', 'school', 'district', 'com']

谢谢

【问题讨论】：

当你有一个像.com 这样的字符串时，. 之前没有\b。来自文档：\b is defined as the boundary between a \w and a \W character

标签： python regex findall

【解决方案1】：

你应该：

转义替代项，以便. 可以匹配一个点（即使用'|'.join(map(re.escape,all_domain))）
使用明确的单词边界，左侧(?<!\w)和右侧(?!\w)，因为\b的含义取决于上下文，请参阅Regular Expression Word Boundary and Special Characters和regex to match word boundary beginning with special characters，这样的问题还有很多。

使用

import re
all_domain=['com edu','.com edu','inc.', '.com', 'inc', 'com', '.edu', 'edu']
all_domain.sort(key=len, reverse=True)
domain_alternators = '|'.join(map(re.escape,all_domain)) # <-- HERE
regex = re.compile(r'(?<!\w)({}|[a-z-A-Z]+)(?!\w)'.format(domain_alternators))  # <-- HERE

name= 'BASIC SCHOOL DISTRICT .COM'
result=regex.findall(name.lower())
print(result) # => ['basic', 'school', 'district', '.com']

见Python demo

【讨论】：