如何处理此文本文件并解析我需要的内容？答案

【问题标题】：How can I process this text file and parse what I need?如何处理此文本文件并解析我需要的内容？
【发布时间】：2009-08-07 20:06:29
【问题描述】：

我正在尝试解析 Python doctest 模块的输出并将其存储在 HTML 文件中。

我有类似这样的输出：

**********************************************************************
File "example.py", line 16, in __main__.factorial
Failed example:
    [factorial(n) for n in range(6)]
Expected:
    [0, 1, 2, 6, 24, 120]
Got:
    [1, 1, 2, 6, 24, 120]
**********************************************************************
File "example.py", line 20, in __main__.factorial
Failed example:
    factorial(30)
Expected:
    25252859812191058636308480000000L
Got:
    265252859812191058636308480000000L
**********************************************************************
1 items had failures:
   2 of   8 in __main__.factorial
***Test Failed*** 2 failures.

每个失败之前都有一行星号，用于分隔每个测试失败。

我想做的是去掉失败的文件名和方法，以及预期和实际结果。然后我想使用它创建一个 HTML 文档（或将其存储在一个文本文件中，然后进行第二轮解析）。

我怎样才能只使用 Python 或一些 UNIX shell 实用程序组合来做到这一点？

编辑：我制定了以下 shell 脚本，它与我想要的每个块匹配，但我不确定如何将每个 sed 匹配重定向到它自己的文件。

python example.py | sed -n '/.*/,/^\**$/p' > `mktemp error.XXX`

【问题讨论】：

如果去掉文件、方法、预期和实际结果，还剩下什么？
好吧，我只是无法将它们解析成单独的块，因为到目前为止我只能一次抓取整个块，而不是单个字段。

标签： python parsing shell doctest

【解决方案1】：

您可以编写一个 Python 程序来区分这一点，但最好的办法是研究修改 doctest 以首先输出您想要的报告。来自 doctest.DocTestRunner 的文档：

                                  ... the display output
can be also customized by subclassing DocTestRunner, and
overriding the methods `report_start`, `report_success`,
`report_unexpected_exception`, and `report_failure`.

【讨论】：

我一定会看看这个！

【解决方案2】：

这是一个快速而肮脏的脚本，它将输出解析为包含相关信息的元组：

import sys
import re

stars_re = re.compile('^[*]+$', re.MULTILINE)
file_line_re = re.compile(r'^File "(.*?)", line (\d*), in (.*)$')

doctest_output = sys.stdin.read()
chunks = stars_re.split(doctest_output)[1:-1]

for chunk in chunks:
    chunk_lines = chunk.strip().splitlines()
    m = file_line_re.match(chunk_lines[0])

    file, line, module = m.groups()
    failed_example = chunk_lines[2].strip()
    expected = chunk_lines[4].strip()
        got = chunk_lines[6].strip()

    print (file, line, module, failed_example, expected, got)

【讨论】：

【解决方案3】：

我在 pyparsing 中编写了一个快速解析器来完成它。

from pyparsing import *

str = """
**********************************************************************
File "example.py", line 16, in __main__.factorial
Failed example:
    [factorial(n) for n in range(6)]
Expected:
    [0, 1, 2, 6, 24, 120]
Got:
    [1, 1, 2, 6, 24, 120]
**********************************************************************
File "example.py", line 20, in __main__.factorial
Failed example:
    factorial(30)
Expected:
    25252859812191058636308480000000L
Got:
    265252859812191058636308480000000L
**********************************************************************
"""

quote = Literal('"').suppress()
comma = Literal(',').suppress()
in_ = Keyword('in').suppress()
block = OneOrMore("**").suppress() + \
        Keyword("File").suppress() + \
        quote + Word(alphanums + ".") + quote + \
        comma + Keyword("line").suppress() + Word(nums) + comma + \
        in_ + Word(alphanums + "._") + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine  

all = OneOrMore(Group(block))

result = all.parseString(str)

for section in result:
    print section

给予

['example.py', '16', '__main__.factorial', '    [factorial(n) for n in range(6)]', '    [0, 1, 2, 6, 24, 120]', '    [1, 1, 2, 6, 24, 120]']
['example.py', '20', '__main__.factorial', '    factorial(30)', '    25252859812191058636308480000000L', '    265252859812191058636308480000000L']

【讨论】：

为什么str文本前后都有3个"标记？对不起，我的Python真的没那么好
三引号仅表示可以跨越多行的文本字符串。

【解决方案4】：

这可能是我写过的最不优雅的 python 脚本之一，但它应该有框架来做你想做的事，而不需要求助于 UNIX 实用程序和单独的脚本来创建 html。它未经测试，但只需稍作调整即可工作。

import os
import sys

#create a list of all files in directory
dirList = os.listdir('')

#Ignore anything that isn't a .txt file.
#
#Read in text, then split it into a list.
for thisFile in dirList:
    if thisFile.endswith(".txt"):
        infile = open(thisFile,'r')

        rawText = infile.read()

        yourList = rawText.split('\n')

        #Strings
        compiledText = ''
        htmlText = ''

        for i in yourList:

            #clunky way of seeing whether or not current line  
            #should be included in compiledText

            if i.startswith("*****"):
                compiledText += "\n\n--- New Report ---\n"

            if i.startswith("File"):
                compiledText += i + '\n'

            if i.startswith("Fail"):
                compiledText += i + '\n'

            if i.startswith("Expe"):
                compiledText += i + '\n'

            if i.startswith("Got"):
                compiledText += i + '\n'

            if i.startswith(" "):
                compiledText += i + '\n'


    #insert your HTML template below

    htmlText = '<html>...\n <body> \n '+htmlText+'</body>... </html>'


    #write out to file
    outfile = open('processed/'+thisFile+'.html','w')
    outfile.write(htmlText)
    outfile.close()

【讨论】：