【问题标题】:Python : nested key value data parsingPython:嵌套键值数据解析
【发布时间】:2016-02-29 09:13:12
【问题描述】:

我正在尝试创建一个 python 脚本,它可以解析以下类型的日志条目,其中包含键和值。对于每个键,可能有也可能没有另一对嵌套的键和值。一个例子如下。嵌套的深度可以根据我得到的日志而有所不同,因此它必须是动态的。然而深度是用大括号封装的。

我将拥有的带有键和值的字符串是这样的:

   Countries =     {
    "USA" = 0;
    "Spain" = 0;
    Connections = 1;
    Flights =         {
        "KLM" = 11;
        "Air America" = 15;
        "Emirates" = 2;
        "Delta" = 3;
    };
    "Belgium" = 1;
    "Czech Republic" = 0;
    "Netherlands" = 1;
    "Hungary" = 0;
    "Luxembourg" = 0;
    "Italy" = 0;

};

上面的数据也可以有多个嵌套。我想编写一个函数来解析它并将其放入数据数组(或类似数据)中,这样我就可以获得特定键的值,例如:

    print countries.belgium
          value should be printed as 1

同样,

    print countries.flights.delta
          value should be printed as 3.

请注意,输入不需要在所有键(如连接或航班)中都有引号。

任何指向我可以开始的指针。任何已经可以像这样进行解析的python库?

【问题讨论】:

    标签: python parsing nested


    【解决方案1】:

    我已经创建了一个示例 python 脚本来完成这项工作,你可以随意调整它。它将您的格式转换为嵌套的字典。动态随心所欲。

    看看这里:Paste bin 代码:

    import re
    import ast
    
    data = """ { Countries = { USA = 1; "Connections" = { "1 Flights" = 0; "10 Flights" = 0; "11 Flights" = 0; "12 Flights" = 0; "13 Flights" = 0; "14 Flights" = 0; "15 Flights" = 0; "16 Flights" = 0; "17 Flights" = 0; "18 Flights" = 0; "More than 25 Flights" = 0; }; "Single Connections" = 0; "No Connections" = 0; "Delayed" = 0; "Technical Fault" = 0; "Others" = 0; }; }"""
    
    
    def arrify(string):
        string = string.replace("=", " : ")
        string = string.replace(";", " , ")
        string = string.replace("\"", "")
        stringDict = string.split()
        # print stringDict
        newArr = []
        quoteCosed = True
        for i, splitStr in enumerate(stringDict):
            if i > 0:
                # print newArr
                if not isDelim(splitStr):
                    if isDelim(newArr[i-1]) and quoteCosed:
                        splitStr = "\"" + splitStr
                        quoteCosed = False
    
                    if isDelim(stringDict[i+1]) and not quoteCosed:
                        splitStr += "\""
                        quoteCosed = True
    
            newArr.append(splitStr)   
    
        newString = " ".join(newArr)
        newDict = ast.literal_eval(newString)
        return normalizeDict(newDict)
    
    def isDelim(string):
        return str(string) in "{:,}"
    
    
    def normalizeDict(dic):
        for key, value in dic.items():
            if type(value) is dict:
                dic[key] = normalizeDict(value)
                continue
            dic[key] = normalize(value)
        return dic
    
    def normalize(string):
        try:
            return int(string)
        except:
            return string
    
    print arrify(data)
    

    样本数据的结果:

    {'Countries': {'USA': 1, 'Technical Fault': 0, 'No Connections': 0, 'Delayed': 0, 'Connections': {'17 Flights': 0, '10 Flights': 0, '11 Flights': 0, 'More than 25 Flights': 0, '14 Flights': 0, '15 Flights': 0, '12 Flights': 0, '18 Flights': 0, '16 Flights': 0, '1 Flights': 0, '13 Flights': 0}, 'Single Connections': 0, 'Others': 0}}
    

    你可以像普通的 dict 一样获得值 :) 希望它有所帮助......

    【讨论】:

    • 您确实需要在答案中包含代码。仅仅链接到它是不够的。
    • @richmondwang,正是我想要的。但是,我这次的动态字符串如下,这给了我一个语法错误:
    • 你传递了什么数据? @user2605278
    • 啊。这是因为键的前面的数值。我会修改它。
    • 只需将您的数据用{ data_string } 括起来,这样您就不会收到解析错误:)
    【解决方案2】:

    遍历数据并检查元素是否是另一个键值对,如果是,则递归调用该函数。像这样的:

    def parseNestedData(data):
        if isinstance(data, dict):
            for k in data.keys():
                parseNestedData(data.get(k))
        else:
            print data
    

    输出:

    >>> Countries =     {
    "USA" : 0,
    "Spain" : 0,
    "Connections" : 1,
    "Flights" :         {
        "KLM" : 11,
        "Air America" : 15,
        "Emirates" : 2,
        "Delta" : 3,
    },
    "Belgium" : 1,
    "Czech Republic" : 0,
    "Netherlands" : 1,
    "Hungary" : 0,
    "Luxembourg" : 0,
    "Italy" :0
    };
    
    >>> Countries
    {'Connections': 1,
    'Flights': {'KLM': 11, 'Air America': 15, 'Emirates': 2, 'Delta': 3},
     'Netherlands': 1,
    'Italy': 0,
    'Czech Republic': 0,
    'USA': 0,
    'Belgium': 1,
    'Hungary': 0,
    'Luxembourg': 0, 'Spain': 0}
    >>> parseNestedData(Countries)
    1
    11
    15
    2
    3
    1
    0
    0
    0
    1
    0
    0
    0
    

    【讨论】:

    • 感谢 Himanshu。我怎样才能得到说捷克共和国的价值(应该只给我 0)
    • 这也需要一些预处理吗?因为并非所有键都用双引号括起来,例如 - Connections
    • 如果您知道捷克共和国密钥存在于第一级,那么只需执行data.get('Czech Republic')
    • data 中存在的任何键都应该是不可变的,即它可以是stringintegertuple 类型。只是Connections 无效,这就是我编辑问题的原因。
    【解决方案3】:

    定义一个类结构来处理和存储信息,可以给你这样的东西:

    import re
    
    class datastruct():
        def __init__(self,data_in):
            flights = re.findall('(?:Flights\s=\s*\{)([\s"A-Z=0-9;a-z]*)};',data_in)
            flight_dict = {}
            for flight in flights[0].split(';')[0:-1]:
                key,val = self.split_data(flight)
                flight_dict[key] = val
    
            countries = re.findall('("[A-Za-z]+\s?[A-Za-z]*"\s=\s[0-9]{1,2})',data_in)
            countries_dict = {}
            for country in countries:
                key,val = self.split_data(country)
                if key not in flight_dict:
                    countries_dict[key]=val
    
            connections = re.findall('(?:Connections\s=\s)([0-9]*);',data_in)
            self.country= countries_dict
            self.flight = flight_dict
            self.connections = int(connections[0])
    
        def split_data(self,data2):
            item = data2.split('=')
            key = item[0].strip().strip('"')
            val = int(item[1].strip())
            return key,val
    

    请注意,如果数据与我在下面假设的不完全相同,则可能需要调整正则表达式。数据可以按如下方式设置和引用:

    raw_data = 'Countries =     {    "USA" = 0;    "Spain" = 0;    Connections = 1;    Flights =         {        "KLM" = 11;        "Air America" = 15;        "Emirates" = 2;        "Delta" = 3;    };    "Belgium" = 1;    "Czech Republic" = 0;    "Netherlands" = 1;    "Hungary" = 0;    "Luxembourg" = 0;    "Italy" = 0;};'
    
    flight_data = datastruct(raw_data)
    print("No. Connections:",flight_data.connections)
    print("Country 'USA':",flight_data.country['USA'],'\n'
    print("Flight 'KLM':",flight_data.flight['KLM'],'\n')
    
    for country in flight_data.country.keys():
        print("Country: {0} -> {1}".format(country,flight_data.country[country]))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-01-19
      • 1970-01-01
      • 2020-10-08
      • 1970-01-01
      • 2021-07-31
      • 1970-01-01
      • 2021-12-17
      相关资源
      最近更新 更多