【问题标题】:How to make a nested dictionary from a text file in python?如何从python中的文本文件制作嵌套字典?
【发布时间】:2021-10-12 19:26:48
【问题描述】:

我有一个结构如下的文本文件:

SOURCE: RCM
DESTINATIONS BEGIN
JCK SF3
DESTINATIONS END
SOURCE: TRO
DESTINATIONS BEGIN
GFN SF3
SYD SF3 DH4
DESTINATIONS END

我正在尝试创建一个嵌套字典,生成的字典如下所示:

handout_routes = {
'RCM': {'JCK': ['SF3']},
'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}
}

现在这只是一个数据样本,但在读取数据时,我们可以假设以下内容: 第一行以 SOURCE 开头:后跟三个字母的 IATA 机场代码。 以 SOURCE: 开头的每一行之后的行是 DESTINATIONS BEGIN。 DESTINATIONS BEGIN 和 DESTINATIONS END 之间有一行或多行。 在每行带有 DESTINATIONS BEGIN 之后,都有相应的带有 DESTINATIONS END 的行。 DESTINATIONS BEGIN 和 DESTINATIONS END 之间的行以三个字母的 IATA 机场代码开头,后跟一个或多个由三个字母数字组成的飞机代码。每个代码由一个空格分隔。 DESTINATIONS END 之后的行将以 SOURCE: 开头,否则您将到达文件末尾。

到目前为止我已经尝试过

with open ("file_path", encoding='utf-8') as text_data:
    answer = {}
    for line in text_data:
        line = line.split()
        if not line:  # empty line?
            continue
        answer[line[0]] = line[1:]
    print(answer)

但它返回的数据是这样的:

{'SOURCE:': ['WYA'], 'DESTINATIONS': ['END'], 'KZN': ['146'], 'DYU': ['320']}

我认为这就是我构建代码以读取文件的方式。任何帮助将不胜感激。可能我的代码对于需要对文件执行的操作来说太简单了。谢谢。

【问题讨论】:

    标签: python dictionary dictionary-comprehension


    【解决方案1】:

    这是我编写的一个运行良好的程序:

    def unpack(file):
      contents:dict = {}
      source:str
      
      for line in file.split('\n'):
    
        if line[:12] == 'DESTINATIONS':
          pass
        #these lines don't affect the program so we ignore them
    
        elif not line:
          pass
        #empty line so we ignore it
        
        elif line[:6] == 'SOURCE':
          source = line.rpartition(' ')[-1]
          if source not in contents:
            contents[source] = {}
          
        else:
          idx, *data = line.split(' ')
          contents[source][idx] = list(data)
    
      return contents
          
    
    with open('file.txt') as file:
      handout_routes = unpack(file.read())
      print(handout_routes)
    

    【讨论】:

    • 这让我走上了正轨,但它只返回:{'AER': {}} 也许我错误地实现了你的代码?它会给你带来什么?
    • 这很奇怪,对我来说它返回 {'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD' : ['SF3', 'DH4']}},正是你说它应该返回的字典。你能告诉我你要打开的文件吗?
    • 当然,这是一个 .dat 文件。这是一个大型数据集,我该如何向您展示?
    • 没关系,是我的实现出错了。这很好用!谢谢!
    【解决方案2】:

    我知道已经有一个公认的答案,但我使用的方法实际上可以帮助您找到文件中的格式错误,而不仅仅是忽略额外的位:

    from tokenize import TokenInfo, tokenize, ENCODING, ENDMARKER, NEWLINE, NAME
    from typing import Callable, Generator
    
    class TripParseException(Exception):
        pass
    
    def assert_token_string(token:TokenInfo, expected_string: str):
        if token.string != expected_string:
            raise TripParseException("Unable to parse trip file: expected {}, found {} in line {} ({})".format(
                expected_string, token.string, str(token.start[0]), token.line
            ))
    def assert_token_type(token:TokenInfo, expected_type: int):
        if token.type != expected_type:
            raise TripParseException("Unable to parse trip file: expected type {}, found type {} in line {} ({})".format(
                expected_type, token.type, str(token.start[0]), token.line
            ))
    
    def parse_destinations(token_stream: Generator[TokenInfo, None, None])->dict:
        destinations = dict()
        assert_token_string(next(token_stream), "DESTINATIONS")
        assert_token_string(next(token_stream), "BEGIN")
        assert_token_type(next(token_stream), NEWLINE)
        current_token = next(token_stream)
        while(current_token.string != "DESTINATIONS"):
            assert_token_type(current_token, NAME)
            destination = current_token.string
            plane_codes = list()
            current_token = next(token_stream)
            while(current_token.type != NEWLINE):
                assert_token_type(current_token, NAME)
                plane_codes.append(current_token.string)
                current_token = next(token_stream)
            destinations[destination] = plane_codes
            # current token is NEWLINE, get the first token on the next line.
            current_token = next(token_stream)
    
    
        # Just parsed "DESTINATIONS", expecting "DESTINATIONS END"
        assert_token_string(next(token_stream), "END")
        assert_token_type(next(token_stream), NEWLINE)
        return destinations
    
    def parse_trip(token_stream: Generator[TokenInfo, None, None]):
        current_token = next(token_stream)
        if(current_token.type == ENDMARKER):
            return None, None
        assert_token_string(current_token, "SOURCE")
        assert_token_string(next(token_stream), ":")
        tok_origin = next(token_stream)
        assert_token_type(tok_origin, NAME)
        assert_token_type(next(token_stream), NEWLINE)
        destinations = parse_destinations(token_stream)
    
        return tok_origin.string, destinations
    
    def parse_trips(readline: Callable[[], bytes]) -> dict:
        token_gen = tokenize(readline)
        assert_token_type(next(token_gen), ENCODING)
        trips = dict()
        while(True):
            origin, destinations = parse_trip(token_gen)
            if(origin is not None and destinations is not None):
                trips[origin] = destinations
            else:
                break
    
        return trips
    

    那么您的实现将如下所示:

    import pprint
    
    with open("trips.dat", "rb") as trips_file:
        trips = parse_trips(trips_file.readline)
        pprint.pprint(
            trips
        )
    

    产生预期结果:

    {'RCM': {'JCK': ['SF3']}, 'TRO': {'GFN': ['SF3'], 'SYD': ['SF3', 'DH4']}}

    如果您最终想稍后将其他信息放入文件中,这也会更加灵活。

    【讨论】:

      【解决方案3】:
      from itertools import takewhile
      import re
      
      
      def destinations(lines):
          if next(lines).startswith('DESTINATIONS BEGIN'):
              dest = takewhile(lambda l: not l.startswith('DESTINATIONS END'), lines)
              yield from map(str.split, dest)
      
      
      def sources(lines):
          source = re.compile('SOURCE:\s*(\w+)')
          while m := source.match(next(lines, '')):
              yield (m.group(1),
                     {dest: crafts for dest, *crafts in destinations(lines)})
      
      
      handout_routes = {s: d for s, d in sources(open('file_path', encoding='utf-8'))}
      print(handout_routes)
      

      【讨论】:

        猜你喜欢
        • 2011-09-27
        • 1970-01-01
        • 2021-06-21
        • 2018-04-02
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-05-21
        • 1970-01-01
        相关资源
        最近更新 更多