【发布时间】:2021-09-09 18:00:59
【问题描述】:
我正在对半结构化数据进行分析,为此我不得不将 xml 和 json 文件展平为 pandas 数据框,现在当分析完成后,我会进行改进,例如删除空值并修复一些数据错误需要生成xml或json文件(取决于用户输入的格式)。
这就是我用来扁平化 xml 的方法:
import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd
def flatten_xml(node, key_prefix=()):
"""
Walk an XML node, generating tuples of key parts and values.
"""
# Copy tag content if any
text = (node.text or '').strip()
if text:
yield key_prefix, text
# Copy attributes
for attr, value in node.items():
yield key_prefix + (attr,), value
# Recurse into children
for child in node:
yield from flatten_xml(child, key_prefix + (child.tag,))
def dictify_key_pairs(pairs, key_sep='.'):
"""
Dictify key pairs from flatten_xml, taking care of duplicate keys.
"""
out = {}
# Group by candidate key.
key_map = defaultdict(list)
for key_parts, value in pairs:
key_map[key_sep.join(key_parts)].append(value)
# Figure out the final dict with suffixes if required.
for key, values in key_map.items():
if len(values) == 1: # No need to suffix keys.
out[key] = values[0]
else: # More than one value for this key.
for suffix, value in enumerate(values, 1):
out[f'{key}{key_sep}{suffix}'] = value
return out
# Parse XML with etree
tree = et.parse('NCT00571389.xml').iter()
# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
这就是我用来扁平化 json 的方法:
from collections import defaultdict
import pandas as pd
import json
def flatten_json(nested_json, exclude=['']):
out = {}
def flatten(x, name='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude: flatten(x[a], name + a + '.')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
f = open('employee_data.json')
this_dict = json.load(f)
df = pd.DataFrame([flatten_json(x) for x in this_dict[list(this_dict.keys())[0]]])
我需要知道如何从数据框转到文件的原始结构,请帮忙?
编辑: 这是我正在使用的 json 文件的示例:
{"features": [{"candidate": {"first_name": "Margaret", "last_name": "Mcdonald", "skills": ["skLearn", "Java", "R", "SQL", "Spark", "C++"], "state": "AL", "specialty": "Database", "experience": "Mid", "relocation": "no"}}, {"candidate": {"first_name": "Michael", "last_name": "Carter", "skills": ["TensorFlow", "R", "Spark", "MongoDB", "C++", "SQL"], "state": "AR", "specialty": "Statistics", "experience": "Senior", "relocation": "yes"}}]}
这是我将它们展平后的列:
candidate.first_name
candidate.last_name
candidate.skills.0
candidate.skills.1
candidate.skills.2
candidate.skills.3
candidate.skills.4
candidate.skills.5
candidate.state
candidate.specialty
candidate.experience
candidate.relocation
candidate.skills.6
candidate.skills.7
candidate.skills.8
【问题讨论】:
-
您检查过
pandas.DataFrame.to_json方法吗? pandas.pydata.org/docs/reference/api/… -
@FlavioMoraes 它不起作用,因为我的列看起来像
candidate.first_namecandidate.skills.0,所以我需要拆分它们然后将具有相同开头的键连接在一起 -
你能提供一个json的例子吗?更容易看到扁平化前后的样子
-
@FlavioMoraes 你好,我刚刚添加了一个例子