如何从python中的文本文件制作嵌套字典？答案

【问题标题】：How to make a nested dictionary from a text file in python?如何从python中的文本文件制作嵌套字典？
【发布时间】：2021-10-12 19:26:48
【问题描述】：

我有一个结构如下的文本文件：

SOURCE: RCM
DESTINATIONS BEGIN
JCK SF3
DESTINATIONS END
SOURCE: TRO
DESTINATIONS BEGIN
GFN SF3
SYD SF3 DH4
DESTINATIONS END

我正在尝试创建一个嵌套字典，生成的字典如下所示：

handout_routes = {
'RCM': {'JCK': ['SF3']},
'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}
}

现在这只是一个数据样本，但在读取数据时，我们可以假设以下内容：第一行以 SOURCE 开头：后跟三个字母的 IATA 机场代码。以 SOURCE: 开头的每一行之后的行是 DESTINATIONS BEGIN。 DESTINATIONS BEGIN 和 DESTINATIONS END 之间有一行或多行。在每行带有 DESTINATIONS BEGIN 之后，都有相应的带有 DESTINATIONS END 的行。 DESTINATIONS BEGIN 和 DESTINATIONS END 之间的行以三个字母的 IATA 机场代码开头，后跟一个或多个由三个字母数字组成的飞机代码。每个代码由一个空格分隔。 DESTINATIONS END 之后的行将以 SOURCE: 开头，否则您将到达文件末尾。

到目前为止我已经尝试过

with open ("file_path", encoding='utf-8') as text_data:
    answer = {}
    for line in text_data:
        line = line.split()
        if not line:  # empty line?
            continue
        answer[line[0]] = line[1:]
    print(answer)

但它返回的数据是这样的：

{'SOURCE:': ['WYA'], 'DESTINATIONS': ['END'], 'KZN': ['146'], 'DYU': ['320']}

我认为这就是我构建代码以读取文件的方式。任何帮助将不胜感激。可能我的代码对于需要对文件执行的操作来说太简单了。谢谢。

【问题讨论】：

标签： python dictionary dictionary-comprehension

【解决方案1】：

这是我编写的一个运行良好的程序：

def unpack(file):
  contents:dict = {}
  source:str
  
  for line in file.split('\n'):

    if line[:12] == 'DESTINATIONS':
      pass
    #these lines don't affect the program so we ignore them

    elif not line:
      pass
    #empty line so we ignore it
    
    elif line[:6] == 'SOURCE':
      source = line.rpartition(' ')[-1]
      if source not in contents:
        contents[source] = {}
      
    else:
      idx, *data = line.split(' ')
      contents[source][idx] = list(data)

  return contents
      

with open('file.txt') as file:
  handout_routes = unpack(file.read())
  print(handout_routes)

【讨论】：

这让我走上了正轨，但它只返回：{'AER': {}} 也许我错误地实现了你的代码？它会给你带来什么？
这很奇怪，对我来说它返回 {'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD' : ['SF3', 'DH4']}}，正是你说它应该返回的字典。你能告诉我你要打开的文件吗？
当然，这是一个 .dat 文件。这是一个大型数据集，我该如何向您展示？
没关系，是我的实现出错了。这很好用！谢谢！

【解决方案2】：

我知道已经有一个公认的答案，但我使用的方法实际上可以帮助您找到文件中的格式错误，而不仅仅是忽略额外的位：

from tokenize import TokenInfo, tokenize, ENCODING, ENDMARKER, NEWLINE, NAME
from typing import Callable, Generator

class TripParseException(Exception):
    pass

def assert_token_string(token:TokenInfo, expected_string: str):
    if token.string != expected_string:
        raise TripParseException("Unable to parse trip file: expected {}, found {} in line {} ({})".format(
            expected_string, token.string, str(token.start[0]), token.line
        ))
def assert_token_type(token:TokenInfo, expected_type: int):
    if token.type != expected_type:
        raise TripParseException("Unable to parse trip file: expected type {}, found type {} in line {} ({})".format(
            expected_type, token.type, str(token.start[0]), token.line
        ))

def parse_destinations(token_stream: Generator[TokenInfo, None, None])->dict:
    destinations = dict()
    assert_token_string(next(token_stream), "DESTINATIONS")
    assert_token_string(next(token_stream), "BEGIN")
    assert_token_type(next(token_stream), NEWLINE)
    current_token = next(token_stream)
    while(current_token.string != "DESTINATIONS"):
        assert_token_type(current_token, NAME)
        destination = current_token.string
        plane_codes = list()
        current_token = next(token_stream)
        while(current_token.type != NEWLINE):
            assert_token_type(current_token, NAME)
            plane_codes.append(current_token.string)
            current_token = next(token_stream)
        destinations[destination] = plane_codes
        # current token is NEWLINE, get the first token on the next line.
        current_token = next(token_stream)


    # Just parsed "DESTINATIONS", expecting "DESTINATIONS END"
    assert_token_string(next(token_stream), "END")
    assert_token_type(next(token_stream), NEWLINE)
    return destinations

def parse_trip(token_stream: Generator[TokenInfo, None, None]):
    current_token = next(token_stream)
    if(current_token.type == ENDMARKER):
        return None, None
    assert_token_string(current_token, "SOURCE")
    assert_token_string(next(token_stream), ":")
    tok_origin = next(token_stream)
    assert_token_type(tok_origin, NAME)
    assert_token_type(next(token_stream), NEWLINE)
    destinations = parse_destinations(token_stream)

    return tok_origin.string, destinations

def parse_trips(readline: Callable[[], bytes]) -> dict:
    token_gen = tokenize(readline)
    assert_token_type(next(token_gen), ENCODING)
    trips = dict()
    while(True):
        origin, destinations = parse_trip(token_gen)
        if(origin is not None and destinations is not None):
            trips[origin] = destinations
        else:
            break

    return trips

那么您的实现将如下所示：

import pprint

with open("trips.dat", "rb") as trips_file:
    trips = parse_trips(trips_file.readline)
    pprint.pprint(
        trips
    )

产生预期结果：

{'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}}

如果您最终想稍后将其他信息放入文件中，这也会更加灵活。

【讨论】：

【解决方案3】：

from itertools import takewhile
import re


def destinations(lines):
    if next(lines).startswith('DESTINATIONS BEGIN'):
        dest = takewhile(lambda l: not l.startswith('DESTINATIONS END'), lines)
        yield from map(str.split, dest)


def sources(lines):
    source = re.compile('SOURCE:\s*(\w+)')
    while m := source.match(next(lines, '')):
        yield (m.group(1),
               {dest: crafts for dest, *crafts in destinations(lines)})


handout_routes = {s: d for s, d in sources(open('file_path', encoding='utf-8'))}
print(handout_routes)

【讨论】：