【问题标题】：Match more than 2 spaces with PyParsing使用 PyParsing 匹配超过 2 个空格
【发布时间】：2013-07-29 03:36:50
【问题描述】：

我有如下字符串：

date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90

我只需要提取最后的日期和值。

如果我使用标准程序匹配多个单词，pyparsing 匹配“不重要”列的最后一个数字作为“值”。

    anything = pp.Forward()
    anything << anyword + (value | anything)
    myParser = date + anything

我认为最好的方法是强制 pyparsing 匹配至少 2 个空格，但我真的不知道如何。有什么建议吗？

【问题讨论】：

标签： python regex pattern-matching match pyparsing

【解决方案1】：

说明

要匹配 2 个或更多空格，您可以使用 \s{2,}

这个表达式将：

捕获日期字段
捕获倒数第二个字段

^(\d{2}\.\d{2}\.\d{2})[^\r\n]*\s(\S+)\s{2,}\S+\s*(?:[\r\n]|\Z)

示例

Live Demo

示例文本

date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90

匹配项

[0][0] = 11.11.13            useless . useless,21 useless 2        14.21    asmdakldm

[0][3] = 11.11.13
[0][4] = 14.21

[1][0] = 21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90
[1][5] = 21.12.12
[1][6] = 41

【讨论】：

【解决方案2】：

这个示例文本是柱状的，所以在这里 pyparsing 有点过分了。你可以写：

fieldslices = [slice(0,8), # dateslice
               slice(58,58+8), # valueslice
              ]

for line in sample:
    date,value = (line[x] for x in fieldslices)
    print date,value.strip()

然后得到：

date     value
11.11.13 14.21
21.12.12 41

但既然你特别想要一个 pyparsing 解决方案，那么对于如此多的东西，你可以使用 GoToColumn 类：

from pyparsing import *

dateExpr = Regex(r'(\d\d\.){2}\d\d').setName("date")
realNum = Regex(r'\d+\.\d*').setName("real").setParseAction(lambda t:float(t[0]))
intNum = Regex(r'\d+').setName("integer").setParseAction(lambda t:int(t[0]))
valueExpr = realNum | intNum

patt = dateExpr("date") + GoToColumn(59) + valueExpr("value")

GoToColumn 类似于SkipTo，但不是前进到表达式的下一个实例，而是前进到特定的列号（其中列号是从 1 开始的，而不是像字符串切片中那样从 0 开始)。

现在是应用于您的示例文本的解析器：

# Normally, input would be from some text file
# infile = open(sourcefile)
# but for this example, create iterator from the sample 
# text instead
sample = """\
date                Not Important                         value    NotImportant2
11.11.13            useless . useless,21 useless 2        14.21    asmdakldm
21.12.12            fmpaosmfpoamsp 4                      41       ajfa9si90
""".splitlines()

infile = iter(sample)

# skip header line
next(infile) 

for line in infile:
    result = patt.parseString(line)
    print result.dump()
    print

打印：

['11.11.13', 'useless . useless,21 useless 2        ', 14.210000000000001]
- date: 11.11.13
- value: 14.21

['21.12.12', 'fmpaosmfpoamsp 4                      ', 41]
- date: 21.12.12
- value: 41

请注意，这些值已经从字符串转换为 int 或 float 类型；您可以为自己编写一个解析操作，将您的 dd.mm.yy 日期转换为 Python 日期时间。还要注意相关结果名称是如何定义的；这些允许您按名称访问各个字段，例如 print result.date。

我还注意到您假设要获得一个或多个元素的序列，您使用了以下构造：

anything = pp.Forward()
anything << anyword + (value | anything)

虽然这确实有效，但它会创建一个运行时开销很大的递归表达式。 pyparsing 提供了一个迭代等价物，OneOrMore:

anything = OneOrMore(anyword)

或者，如果您更喜欢更新的 '*' 运算符形式：

anything = anyword*(1,)

请浏览 pyparsing API 文档，这些文档包含在 pyparsing 的源代码分发中，或在线 http://packages.python.org/pyparsing/。

欢迎使用 Pyparsing！

【讨论】：