如何在 Python 中进行这种聚合答案

【问题标题】：How to do this aggregation in Python如何在 Python 中进行这种聚合
【发布时间】：2015-03-11 03:09:19
【问题描述】：

我有如下 JSON 响应。

{
"2015-03-08": {
"www.ndtv.com": [
{
"traffic": 100,
"name": "Server1"
}
],
"www.profit.ndtv.com": [
{
"traffic": 49.69,
"name": "Server1"
},
{
"traffic": 50.31,
"name": "Server2"
}
]
},
"2015-03-03": {
"www.ndtv.com": [
{
"traffic": 100,
"name": "Server1"
}
],
"www.profit.ndtv.com": [
{
"traffic": 50.11,
"name": "Server1"
},
{
"traffic": 49.89,
"name": "Server2"
},
{
"traffic": 0,
"name": "Server3"
}
]
},
"2015-03-05": {
"www.ndtv.com": [
{
"traffic": 100,
"name": "Server1"
}
],
"www.profit.ndtv.com": [
{
"traffic": 50.36,
"name": "Server1"
},
{
"traffic": 49.64,
"name": "Server2"
}
]
},
"2015-03-04": {
"www.ndtv.com": [
{
"traffic": 100,
"name": "Server1"
}
],
"www.profit.ndtv.com": [
{
"traffic": 50.79,
"name": "Server1"
},
{
traffic: 49.21,
name: "Server2"
}
]
},
"2015-03-07": {
"www.ndtv.com": [
{
"traffic": 100,
"name": "Server1"
}
],
"www.profit.ndtv.com": [
{
"traffic": 51.48,
"name": "Server1"
},
{
"traffic": 48.52,
"name": "Server2"
}
]
},
2015-03-06: {
"www.ndtv.com": [ ],
"www.profit.ndtv.com": [
{
"traffic": 50.96,
"name": "Server1"
},
{
"traffic": 49.04,
"name": "Server2"
}
]
}
}

我需要汇总几天的数据。例如2015-03-08 我想为 Server1 添加所有流量，所以在我的示例中它将是(100+49.69)/2。我将它除以 2，因为 server1 的 # 为 2 并将其存储在父域中。在这种情况下，输出将是。

{
2015-03-08: {
www.ndtv.com: [
{
traffic: 74.85,
name: "Server1"
},
{
traffic: 50.31,
name: "Server2"
}
]
}

我很困惑如何在 Python 中做到这一点。

【问题讨论】：

如果你的 JSON 是正确的，这会有所帮助——这是严重错误的，因为键是 not 引用的；因此，Python 的 json 模块无法加载它。你能修复那个非常破碎的 Json 的来源，还是你需要破解一个修复......？
我需要在此处进行修复，因为我没有修改 JSON 响应的源代码。
添加了正确的 JSON。

标签： python json

【解决方案1】：

添加：OP 现在编辑了 Q 以显示正确的 JSON，但暗示他实际处理的伪 JSON 是损坏的 - 带有未引用的键 - 他以前有；所以我将单独留下 A 的开头，因为它可能会帮助他处理实际损坏的 json。

首先，您需要一个 hack 来修复您显示的严重损坏的 JSON —— 没有引用键！根据 JSON 标准，它们必须是（并且 Python 的 json 模块能够将 JSON 加载到 Python 数据结构中）。幸运的是，如果你展示的例子是规范的，那么破损是相当系统的，可以修复。

假设x 是您显示的字符串：

import re, json

z = re.sub(r'([^:\s]+):', r'"\1":', x)
y = json.loads(z)

现在在y 中，您拥有所需的数据结构。

所以现在你的任务更容易了（取决于你的具体规格——例如，我将假设每天的子字典中 shortest 域是你想要聚合的域，并且服务器的顺序无关紧要 - 当然，这些都是猜测，您需要更准确地解释您的规格:-)。

有了这些猜测……：

import collections

res = {}
for d in y:
    dd = y[d]
    dom = min(dd, key=len)
    res[d] = {dom: []}
    serv_traf = collections.defaultdict(list)
    for subdom in dd:
        for ddd in dd[subdom]:
            serv_traf[ddd['name']].append(ddd['traffic'])
    for serv in serv_traf:
        traf = serv_traf[serv]
        restraf = sum(traf) / len(traf)
        res[d][dom].append({'name': serv, 'traffic': restraf})

应该做你想做的。例如，对于您的示例，已损坏 x，

import pprint
pprint.pprint(res)

显示：

{'2015-03-03': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 75.055},
                                 {'name': 'Server2', 'traffic': 49.89},
                                 {'name': 'Server3', 'traffic': 0}]},
 '2015-03-04': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 75.395},
                                 {'name': 'Server2', 'traffic': 49.21}]},
 '2015-03-05': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 75.18},
                                 {'name': 'Server2', 'traffic': 49.64}]},
 '2015-03-06': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 50.96},
                                 {'name': 'Server2', 'traffic': 49.04}]},
 '2015-03-07': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 75.74},
                                 {'name': 'Server2', 'traffic': 48.52}]},
 '2015-03-08': {'www.ndtv.com': [{'name': 'Server1', 'traffic': 74.845},
                                 {'name': 'Server2', 'traffic': 50.31}]}}

这似乎至少是您的想法的近似值。

当然，要将其转换回 JSON，您可以使用 json.dumps(res) -- 这不可避免地会给您正确 JSON...希望您不需要再次将其破坏为效仿你刚开始的那种破碎的那种？

【讨论】：