匹配所有名称末尾正好 5 位数字答案

【问题标题】：Match all names with exactly 5 digits at the end匹配所有名称末尾正好 5 位数字
【发布时间】：2014-11-09 18:37:05
【问题描述】：

我有一个这样的文本文件：

john123:
1
2
coconut_rum.zip

bob234513253:
0
jackdaniels.zip
nowater.zip 
3

judy88009:
dontdrink.zip
9

tommi54321:
dontdrinkalso.zip
92

...

我有数百万这样的条目。

我想提取 5 位数字的姓名和号码。我试过这个：

matches = re.findall(r'\w*\d{5}:',filetext2)

但它给我的结果有至少 5 位数。

['bob234513253:', 'judy88009:', 'tommi54321:']

Q1：如何找到正好 5 位数字的名字？

Q2：我想将与这些名称关联的 zip 文件附加 5 位数字。如何使用正则表达式做到这一点？

【问题讨论】：

标签： python regex file

【解决方案1】：

那是因为\w 包含数字字符：

>>> import re
>>> re.match('\w*', '12345')
<_sre.SRE_Match object at 0x021241E0>
>>> re.match('\w*', '12345').group()
'12345'
>>>

你需要更具体一些，告诉 Python 你只想要字母：

matches = re.findall(r'[A-Za-z]*\d{5}:',filetext2)

关于您的第二个问题，您可以使用以下内容：

import re
# Dictionary to hold the results
results = {}
# Break-up the file text to get the names and their associated data.
# filetext2.split('\n\n') breaks it up into individual data blocks (one per person).
# Mapping to str.splitlines breaks each data block into single lines.
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
    # See if the name matches our pattern.
    if re.match('[A-Za-z]*\d{5}:', name):
        # Add the name and the relevant data to the file.
        # [:-1] gets rid of the colon on the end of the name.
        # The list comprehension gets only the file names from the data.
        results[name[:-1]] = [x for x in data if x.endswith('.zip')]

或者，没有所有的 cmets：

import re
results = {}
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
    if re.match('[A-Za-z]*\d{5}:', name):
        results[name[:-1]] = [x for x in data if x.endswith('.zip')]

下面是一个演示：

>>> import re
>> filetext2 = '''\
... john123:
... 1
... 2
... coconut_rum.zip
...
... bob234513253:
... 0
... jackdaniels.zip
... nowater.zip
... 3
...
... judy88009:
... dontdrink.zip
... 9
...
... tommi54321:
... dontdrinkalso.zip
... 92
... '''
>>> results = {}
>>> for name, *data in map(str.splitlines, filetext2.split('\n\n')):
...     if re.match('[A-Za-z]*\d{5}:', name):
...         results[name[:-1]] = [x for x in data if x.endswith('.zip')]
...
>>> results
{'tommi54321': ['dontdrinkalso.zip'], 'judy88009': ['dontdrink.zip']}
>>>

请记住，一次读取所有文件内容的效率不是很高。相反，您应该考虑制作一个生成器函数来一次生成一个数据块。此外，您可以通过预编译 Regex 模式来提高性能。

【讨论】：

您可能应该将括号从字符串的开头包裹到冒号之前，以便冒号不包含在用户名字符串中。
谢谢。如何使用此正则表达式和此用户名下的 zip 文件制作列表？
@new_coder - 抱歉耽搁了；突然出现了一些重要的事情。我编辑的帖子回答了您的第二个问题。
嗨。还有一件事。如果我不想硬编码数字 5 怎么办？像这样 -----------------> if re.match('[A-Za-z]*\d{ num}:', name): #where num = 5 可以这样做吗？
@new_coder - 您可以使用string formatting 插入您想要的任何数字：'[A-Za-z]*\d{{{num}}}:'.format(num=5) 产生'[A-Za-z]*\d{5}:'。请注意，您需要额外的花括号，因为{...} 表示格式字段。

【解决方案2】：

import re

results = {}

with open('datazip') as f:
    records = f.read().split('\n\n')

for record in records:
    lines = record.split()
    header = lines[0]

    # note that you need a raw string
    if re.match(r"[^\d]\d{5}:", header[-7:]):

        # in general multiple hits are possible, so put them into a list
        results[header] = [l for l in lines[1:] if l[-3:]=="zip"]

print results

输出

{'tommi54321:': ['dontdrinkalso.zip'], 'judy88009:': ['dontdrink.zip']}

我尽量保持简单，如果你的输入很长，你应该按照 iCodez 的建议，实现一个生成器，一次生成一条记录 yields，而对于正则表达式匹配，我尝试了一些优化搜索只有标题的最后 7 个字符。

附录：记录生成器的简单实现

import re

def records(f):
    record = []
    for l in f:
        l = l.strip()
        if l:
            record.append(l)
        else:
            yield record
            record = []
    yield record

results = {}
for record in records(open('datazip')):
    head = record[0]
    if re.match(r"[^\d]\d{5}:", head[-7:]):
        results[head] = [ r for r in record[1:] if r[-3:]=="zip"]
print results

【讨论】：

【解决方案3】：

您需要将正则表达式限制在单词的末尾，这样它就不会使用\b进一步匹配

[a-zA-Z]+\d{5}\b

例如见http://regex101.com/r/oC1yO6/1

正则表达式会匹配

judy88009:

tommi54321:

python 代码会是这样的

>>> re.findall(r'[a-zA-Z]+\d{5}\b', x)
['judy88009', 'tommi54321']

【讨论】：

输出

评论

附录：记录生成器的简单实现