基于特定模式解析文件结构答案

【问题标题】：Parsing file structure based on a specific pattern基于特定模式解析文件结构
【发布时间】：2016-03-11 07:46:00
【问题描述】：

我有一个包含多行的文本文件，这些行的顺序是姓名、位置、网站，然后是“END”以表示一个人的个人资料的结尾，然后是姓名、位置、网站等等。我需要将名称作为键添加到字典中，其余部分（位置、网站）作为其值。

所以如果我有一个文件：

name1
location1
website1
END
name2
location2
website2
END
name3
location3
website3
END

结果是：

dict = {'name1': ['location1','website1'],
        'name2': ['location2', 'website2'], 
        'name3': ['location3', 'website3']}

编辑：该值将是一个列表，对此感到抱歉

我不知道如何解决这个问题，有人可以指出正确的方向吗？

【问题讨论】：

{'name1': 'location1','website1', 'name2': 'location2', 'website2', 'name3': 'location3', 'website3'} 不是有效的字典

标签： python parsing dictionary

【解决方案1】：

首先，关于 dictionary 的结构，或者更一般地说，associative containers 的结构似乎存在误解，这是这个问题的基础。

字典的结构是，类似python的语法

{
   key : whatever_value1,
   another_key: whatever_value2,
   # ...
}

其次，如果你从

中修剪尾随数字

name1
location1
website1

对于该文件的以 END 分隔的各个条目，您自然会得到一个类似结构的 ADT，即

class Whatever(object):
    def __init__(self, name, location, website):
        self.name = name
        self.location = location
        self.website = website

（您的里程会因课程名称而异）

因此，您可以使用一个 python dict，它将一个键（可能是您记录的 name 属性）映射到该类型的实例（引用）。

要处理输入文件，您只需每次逐行读取文件，直到遇到END，然后使用（例如）其name 作为键将class Whatever 提交到字典。

【讨论】：

【解决方案2】：

使用"END" 分隔每个部分，itertools.groupby 将使用END 拆分文件，我们只需要在迭代 groupby 对象时创建我们的键/值对。

from itertools import groupby
from collections import OrderedDict

with open("test.txt") as f:
    d = OrderedDict((next(v), list(v))
             for k, v in groupby(map(str.rstrip, f), key=lambda x: x[:3] != "END") if k)

输出：

   OrderedDict([('name1', ['location1', 'website1']),
  ('name2', ['location2', 'website2']),
  ('name3', ['location3', 'website3'])])

或者使用常规的 for 循环，每次点击 END 时更改键，将每个部分的行存储在 tmp 列表中：

from collections import OrderedDict

with open("test.txt") as f:
    # itertools.imap for python2
    data = map(str.rstrip, f)
    d, tmp, k = OrderedDict(), [], next(data)
    for line in data:
        if line == "END":
            d[k] = tmp
            k, tmp = next(data, ""), []
        else:
            tmp.append(line)

输出将是相同的：

 OrderedDict([('name1', ['location1', 'website1']),
 ('name2', ['location2', 'website2']), 
('name3', ['location3', 'website3'])])

这两个代码示例都适用于任何长度的部分，而不仅仅是三行。

【讨论】：

【解决方案3】：

已回答，但您可以通过应用 Python 自己的 dict 和列表理解来缩短内容：

with open(file, 'r') as f:
    triplets = [data.strip().split('\n') for data in f.read().strip().split('END') if data]
    d = {name: [line, site] for name, line, site in triplets}

【讨论】：

这会在内存中创建多个数据副本
@PadraicCunningham 是的。您的答案肯定会更高效。然而，我认为，就简洁性和可读性而言，它在这个级别上具有指导意义。
这是非常低效的，你创建一个列表来扔掉，无缘无故地调用一个列表上的元组，还必须总是恰好三个元素，否则代码会失败

【解决方案4】：

您可以一次从文件中取出四行的片段，而无需将其全部加载到内存中。一种方法是使用来自 itertools 的 islice。

from itertools import islice
data = dict()
with open('file.path') as input:
    while True:
        batch = tuple(x.strip() for x in islice(input, 4))
        if not batch:
            break;
        name, location, website, end = batch
        data[name] = (location, website)

验证：

> from pprint import pprint
> pprint(data)

{'name1': ('location1', 'website1'),
 'name2': ('location2', 'website2'),
 'name3': ('location3', 'website3')}

【讨论】：

【解决方案5】：

如果保证您将始终以这种格式获取此数据，那么您可以执行以下操作：

dict = {}
name = None
location = None
website = None
count = 0:
with open(file, 'r') as f:  #where file is the file name
    for each in f:
    count += 1
    if count == 1:
        name = each
    elif count == 2:
        location = each
    elif count == 3:
        website = each
    elif count == 4 and each == 'END':
       count = 0  # Forgot to reset to 0 when it got to four... my bad.
       dict[name] = (location, website)  # Adding to the dictionary as a tuple since you need to have key -> value not key -> value1, value2
    else:
       print("Well, something went amiss %i  %s" % count, each)

【讨论】：