使用python将文本提取行拆分为多行答案

【问题标题】：text extraction line splitting across multiple lines with python使用python将文本提取行拆分为多行
【发布时间】：2016-01-23 17:21:11
【问题描述】：

我有以下代码：

f = open('./dat.txt', 'r')
array = []
for line in f:
    # if "1\t\"Overall evaluation" in line:
    #   words = line.split("1\t\"Overall evaluation")
    #   print words[0]
    number = int(line.split(':')[1].strip('"\n'))
    print number

这能够从我的数据中获取最后一个 int，如下所示：

299 1   "Overall evaluation: 3
Invite to interview: 3
Strength or novelty of the idea (1): 4
Strength or novelty of the idea (2): 3
Strength or novelty of the idea (3): 3
Use or provision of open data (1): 4
Use or provision of open data (2): 3
""Open by default"" (1): 2
""Open by default"" (2): 3
Value proposition and potential scale (1): 4
Value proposition and potential scale (2): 2
Market opportunity and timing (1): 4
Market opportunity and timing (2): 4
Triple bottom line impact (1): 4
Triple bottom line impact (2): 2
Triple bottom line impact (3): 2
Knowledge and skills of the team (1): 3
Knowledge and skills of the team (2): 4
Capacity to realise the idea (1): 4
Capacity to realise the idea (2): 3
Capacity to realise the idea (3): 4
Appropriateness of the budget to realise the idea: 3"
299 2   "Overall evaluation: 3
Invite to interview: 3
Strength or novelty of the idea (1): 3
Strength or novelty of the idea (2): 2
Strength or novelty of the idea (3): 4
Use or provision of open data (1): 4
Use or provision of open data (2): 3
""Open by default"" (1): 3
""Open by default"" (2): 2
Value proposition and potential scale (1): 4
Value proposition and potential scale (2): 3
Market opportunity and timing (1): 4
Market opportunity and timing (2): 3
Triple bottom line impact (1): 3
Triple bottom line impact (2): 2
Triple bottom line impact (3): 1
Knowledge and skills of the team (1): 4
Knowledge and skills of the team (2): 4
Capacity to realise the idea (1): 4
Capacity to realise the idea (2): 4
Capacity to realise the idea (3): 4
Appropriateness of the budget to realise the idea: 2"

364 1   "Overall evaluation: 3
Invite to interview: 3
...

我还需要获取“记录标识符”，在上面的示例中，前两个实例为299，下一个实例为364。

上面注释掉的代码，如果我删除最后几行并直接使用它，如下所示：

f = open('./dat.txt', 'r')
array = []
for line in f:
    if "1\t\"Overall evaluation" in line:
        words = line.split("1\t\"Overall evaluation")
        print words[0]
    # number = int(line.split(':')[1].strip('"\n'))
    # print number

可以抓取记录标识符。

但我无法将两者放在一起。

理想情况下，我想要的是以下内容：

368

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2

以此类推所有记录。

如何结合以上两个脚本组件来实现这一点？

【问题讨论】：

你看起来像一个有经验的用户，应该知道那不是用Python处理数据的方式。相反，我建议您处理字典。
外观可能具有欺骗性。你什么意思？
我的意思是 dat.txt 文件的结构不适合您解析它。您应该尝试将它（无论您从哪里获得）结构适当，例如，作为字典，所以您唯一需要做的就是传递您想要拥有的密钥（记录标识符，您称之为）跨度>
我知道我不是在解决您的问题，而是试图以不同的方式帮助您。这是您生成的数据日志，还是您从应用程序外部收到的数据日志？
您可能能够从 Excel 文件中获取 Python 数据类型。检查这个，例如：stackoverflow.com/questions/28774960/…

标签： python

【解决方案1】：

正则表达式是门票。你可以用两种模式来做到这一点。像这样的：

import re

with open('./dat.txt') as fin:
    for line in fin:
        ma = re.match(r'^(\d+) \d.+Overall evaluation', line)
        if ma:
            print("record identifier %r" % ma.group(1))
            continue
        ma = re.search(r': (\d+)$', line)
        if ma:
            print(ma.group(1))
            continue
        print("unrecognized line: %s" % line)

注意：最后一个打印语句不是您的要求的一部分，但是每当我调试正则表达式时，我总是添加某种包罗万象来帮助调试错误的正则表达式语句。一旦我弄清了我的模式，我就删除了包罗万象的东西。

【讨论】：