【问题标题】:python: how to merge dict in list of dicts based on valuepython:如何根据值在字典列表中合并字典
【发布时间】:2018-05-04 05:12:54
【问题描述】:

我有一个字典列表,其中每个字典包含 3 个键:名称、网址和位置。
只有 'name' 的值在整个 dicts 中可以相同,并且 'url' 和 'location' 在整个列表中始终是不同的值。

示例:

[
{"name":"A1", "url":"B1", "location":"C1"}, 
{"name":"A1", "url":"B2", "location":"C2"}, 
{"name":"A2", "url":"B3", "location":"C3"},
{"name":"A2", "url":"B4", "location":"C4"}, ...
]  

然后我想让它们根据'name'中的值进行分组,如下所示。

预期:

[
{"name":"A1", "url":"B1, B2", "location":"C1, C2"},
{"name":"A2", "url":"B3, B4", "location":"C3, C4"},
]

(实际列表包含 >2,000 个字典)

我很高兴能解决这个问题。
任何建议/答案将不胜感激。

提前致谢。

【问题讨论】:

标签: python dictionary grouping


【解决方案1】:

由于您的数据集相对较小,所以我想这里的时间复杂度不是什么大问题,因此您可以考虑以下代码。

from collections import defaultdict
given_data = [
    {"name":"A1", "url":"B1", "location":"C1"}, 
    {"name":"A1", "url":"B2", "location":"C2"}, 
    {"name":"A2", "url":"B3", "location":"C3"},
    {"name":"A2", "url":"B4", "location":"C4"},
] 
D = defaultdict(list)
for item in given_data:
    D[item['name']].append(item)
result = []
for x in D:
    urls = ""
    locations = ""
    for pp in D[x]:
        urls += pp['url']+" "
        locations += pp['location']+" "
    result.append({'name': x, 'url': urls.strip(), 'location': locations.strip()})

【讨论】:

    【解决方案2】:

    带辅助分组字典(Python > 3.5):

    data = [
        {"name":"A1", "url":"B1", "location":"C1"}, 
        {"name":"A1", "url":"B2", "location":"C2"}, 
        {"name":"A2", "url":"B3", "location":"C3"},
        {"name":"A2", "url":"B4", "location":"C4"}
    ]
    
    groups = {}
    for d in data:
        if d['name'] not in groups:
            groups[d['name']] = {'url': d['url'], 'location': d['location']}
        else:
            groups[d['name']]['url'] += ', ' + d['url']
            groups[d['name']]['location'] += ', ' + d['location']
    result = [{**{'name': k}, **v} for k, v in groups.items()]
    
    print(result)
    

    输出:

    [{'name': 'A1', 'url': 'B1, B2', 'location': 'C1, C2'}, {'name': 'A2', 'url': 'B3, B4', 'location': 'C3, C4'}]
    

    【讨论】:

      【解决方案3】:

      res 在哪里:

      [{'location': 'C1', 'name': 'A1', 'url': 'B1'},
       {'location': 'C2', 'name': 'A1', 'url': 'B2'},
       {'location': 'C3', 'name': 'A2', 'url': 'B3'},
       {'location': 'C4', 'name': 'A2', 'url': 'B4'}]
      

      您可以使用defaultdict 处理数据并将结果解压缩到列表理解中:

      from collections import defaultdict
      
      result = defaultdict(lambda: defaultdict(list))
      
      for items in res:
           result[items['name']]['location'].append(items['location'])
           result[items['name']]['url'].append(items['url'])
      
      final = [
          {'name': name, **{inner_names: ' '.join(inner_values) for inner_names, inner_values in values.items()}}
          for name, values in result.items()
      ]
      

      final 是:

      In [57]: final
      Out[57]:
      [{'location': 'C1 C2', 'name': 'A1', 'url': 'B1 B2'},
       {'location': 'C3 C4', 'name': 'A2', 'url': 'B3 B4'}]
      

      【讨论】:

        【解决方案4】:

        使用@Yaroslav Surzhikov 评论,这里是使用 itertools.groupby 的解决方案

        from itertools import groupby
        
        dicts = [
            {"name":"A1", "url":"B1", "location":"C1"},
            {"name":"A1", "url":"B2", "location":"C2"},
            {"name":"A2", "url":"B3", "location":"C3"},
            {"name":"A2", "url":"B4", "location":"C4"},
        ]
        
        def merge(dicts):
            new_list = []
            for key, group in groupby(dicts, lambda x: x['name']):
                new_item = {}
                new_item['name'] = key
                new_item['url'] = []
                new_item['location'] = []
                for item in group:
                    new_item['url'].extend([item.get('url', '')])
                    new_item['location'].extend([item.get('location', '')])
                new_item['url'] = ', '.join(new_item.get('url', ''))
                new_item['location'] = ', '.join(new_item.get('location', ''))
                new_list.append(new_item)
            return new_list
        
        print(merge(dicts))
        

        【讨论】:

          【解决方案5】:

          这样的?小偏差:我更喜欢将 urlslocations 存储在 resDict 内的 list 中, 不在附加的 str 中。

          myDict = [
          {"name":"A1", "url":"B1", "location":"C1"}, 
          {"name":"A1", "url":"B2", "location":"C2"}, 
          {"name":"A2", "url":"B3", "location":"C3"},
          {"name":"A2", "url":"B4", "location":"C4"}
          ]
          
          resDict = []
          
          def getKeys(d):
              arr = []
              for row in d:
                  arr.append(row["name"])
              ret = list(set(arr))
              return ret
          
          def filteredDict(d, k):
              arr = []
              for row in d:
                  if row["name"] == k:
                      arr.append(row)
              return arr
          
          def compressedDictRow(rowArr):
              urls = []
              locations = []
              name = rowArr[0]['name']
          
              for row in rowArr:
                 urls.append(row['url'])
                 locations.append(row['location'])
              return {"name":name,"urls":urls, "locations":locations}
          
          keys = getKeys(myDict)
          
          for key in keys:
              rowArr = filteredDict(myDict,key)
              row = compressedDictRow(rowArr)
              resDict.append(row)
          print(resDict)
          

          输出(一行):

          [
              {'name': 'A2', 'urls': ['B3', 'B4'], 'locations': ['C3', 'C4']}, 
              {'name': 'A1', 'urls': ['B1', 'B2'], 'locations': ['C1', 'C2']}
          ]
          

          【讨论】:

            【解决方案6】:

            这里有一个变种(甚至很难看懂,感觉就像用左手挠我的头右侧,但在这一点上,我不知道如何使它更短) 使用:

            >>> pprint.pprint(initial_list)
            [{'location': 'C1', 'name': 'A1', 'url': 'B1'},
             {'location': 'C2', 'name': 'A1', 'url': 'B2'},
             {'location': 'C3', 'name': 'A2', 'url': 'B3'},
             {'location': 'C4', 'name': 'A2', 'url': 'B4'}]
            >>>
            >>> NAME_KEY = "name"
            >>>
            >>> final_list = [list(itertools.accumulate(group_list, func=lambda x, y: {key: x[key] if key == NAME_KEY else " ".join([x[key], y[key]]) for key in x}))[-1] \
            ...     for group_list in [list(group[1]) for group in itertools.groupby(sorted(initial_list, key=lambda x: x[NAME_KEY]), key=lambda x: x[NAME_KEY])]]
            >>>
            >>> pprint.pprint(final_list)
            [{'location': 'C1 C2', 'name': 'A1', 'url': 'B1 B2'},
             {'location': 'C3 C4', 'name': 'A2', 'url': 'B3 B4'}]
            

            基本原理(从):

            • 根据与 name 键对应的值对初始列表中的字典进行分组 (itertools.groupby)
              • 要使其正常工作的辅助操作是在分组之前按相同的值对列表进行排序 (sorted)
            • 对于每个这样的字典组,执行它们的“sum”(itertools.accumulate)
              • func 参数 "sums" 2 个字典,基于键:
                • 如果键是 name,只需从 1st 字典中获取值(无论如何,这两个字典都相同)
                • 否则只需添加 2 个值(字符串),中间有一个空格

            注意事项

            • 字典必须保持同质(所有字典必须具有相同的结构(键))
            • 只有 name 键是硬编码的(但是,如果您决定添加其他不是字符串的键,您也必须调整 func
            • 为了可读性可以拆分
            • 不确定lambdas(性能方面)

            【讨论】:

              猜你喜欢
              • 2016-02-22
              • 1970-01-01
              • 2015-01-10
              • 1970-01-01
              • 2020-05-30
              • 2021-09-17
              • 1970-01-01
              相关资源
              最近更新 更多