【问题标题】:Differences in re.findall and re.finditer -- bug in Python 2.7 re module?re.findall 和 re.finditer 的区别——Python 2.7 re 模块中的错误?
【发布时间】:2013-12-23 20:22:59
【问题描述】:

在演示 Python 的正则表达式功能时,我编写了一个小程序来比较 re.search()re.findall()re.finditer() 的返回值。我知道re.search() 每行只能找到一个匹配项,而re.findall() 只返回匹配的子字符串,而不返回任何位置信息。然而,我惊讶地发现匹配的子字符串在三个函数之间可能不同。

代码(available on GitHub):

#! /usr/bin/env python
# -*- coding: utf-8 -*-

# License: CC-BY-NC-SA 3.0

import re
import codecs

# download kate_chopin_the_awakening_and_other_short_stories.txt
# from Project Gutenberg:
# http://www.gutenberg.org/ebooks/160.txt.utf-8
# with wget:
# wget http://www.gutenberg.org/ebooks/160.txt.utf-8 -O kate_chopin_the_awakening_and_other_short_stories.txt


# match for something o'clock, with valid numerical time or
# any English word with proper capitalization

oclock = re.compile(r"""
                    (
                          [A-Z]?[a-z]+ # word mit max. 1 capital letter
                        | 1[012]       # 10,11,12
                        | [1-9]        # 1,2,3,5,6,7,8,9
                    )
                    \s
                    o'clock""",
                    re.VERBOSE)

path = "kate_chopin_the_awakening_and_other_short_stories.txt"

print
print "re.search()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')

with  codecs.open(path,mode='r',encoding='utf-8') as f:
    for lineno, line in enumerate(f):
        atime = oclock.search(line)
        if  atime:
            print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
                                            atime.start(),
                                            atime.end(),
                                            atime.group())


print
print "re.findall()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with  codecs.open(path,mode='r',encoding='utf-8') as f:
    for lineno, line in enumerate(f):
        times = oclock.findall(line)
        if times:
            print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
                                            '',
                                            '',
                                            ' '.join(times))


print
print "re.finditer()"
print
print u"{:>6} {:>6} {:>6}\t{}".format("Line","Start","End","Match")
print u"{:=>6} {:=>6} {:=>6}\t{}".format('','','','=====')
with  codecs.open(path,mode='r',encoding='utf-8') as f:
    for lineno, line in enumerate(f):
        times = oclock.finditer(line)
        for m in times:
            print u"{:>6} {:>6} {:>6}\t{}".format(lineno,
                                            m.start(),
                                            m.end(),
                                            m.group())

和输出(在 Python 2.7.3 和 2.7.5 上测试):

re.search()

  Line  Start    End    Match
====== ====== ======    =====
   248      7     21    eleven o'clock
  1520     24     35    one o'clock
  1975     21     33    nine o'clock
  2106      4     16    four o'clock
  4443     19     30    ten o'clock

re.findall()

  Line  Start    End    Match
====== ====== ======    =====
   248                  eleven
  1520                  one
  1975                  nine
  2106                  four
  4443                  ten

re.finditer()

  Line  Start    End    Match
====== ====== ======    =====
   248      7     21    eleven o'clock
  1520     24     35    one o'clock
  1975     21     33    nine o'clock
  2106      4     16    four o'clock
  4443     19     30    ten o'clock

我在这里遗漏了什么?为什么re.findall() 不返回o'clock 位?

【问题讨论】:

  • 一个简单问题的硬朗读(以确定我的问题是否会重复)。难道不能用简单的示例文字将冗长的示例代码归结为 3 行左右的简单代码,以便向读者很好地展示问题吗?

标签: python regex python-2.7


【解决方案1】:

根据re.findall documentation

...如果模式中存在一个或多个组,则返回列表;如果模式有多个组,这将是一个元组列表。

pattern 只包含一组; findall 返回组列表。


>>> import re
>>> re.findall('abc', 'abc')
['abc']
>>> re.findall('a(b)c', 'abc')
['b']
>>> re.findall('a(b)(c)', 'abc')
[('b', 'c')]

使用非捕获版本的括号:

>>> re.findall('a(?:b)c', 'abc')
['abc']

【讨论】:

    猜你喜欢
    • 2023-03-14
    • 1970-01-01
    • 2014-01-01
    • 2021-01-11
    • 1970-01-01
    • 1970-01-01
    • 2017-08-28
    • 2011-12-27
    • 2015-12-04
    相关资源
    最近更新 更多