【问题标题】:Python: Compare & Count Dictionary Structures Across Thousands of Dictionaries/XMLs/JSONPython:比较和计算数千个字典/XML/JSON 中的字典结构
【发布时间】:2018-09-18 15:10:43
【问题描述】:

我正在将数千个 XML 文件解析成字典,并将它们的结构存储在 JSON 中。

它们有很多相同的结构,但有未知数量的不同标签命名方案。在这数千个文件中,存在各种不同的命名标签缩写。

我需要找出存在多少个不同的标签来描述每条信息,以便正确解析它们。

为此,我想创建一个 XML/字典的主字典,其中包含标记名称的所有变体,最好是它们在数千个 XML/字典中的计数。

这是其中一本词典的一小部分示例:

{
    "Header": {
        "Ts": {},
        "PeriodEndDt": {},
        "PreparedBy": {
            "PreparerID": {},
            "PreparerFirmName": {
                "BusinessNameLine1Txt": {}
            },
            "PreparerAddress": {
                "AddLn1Txt": {},
                "CityName": {},
                "StateAbbreviationCd": {},
                "ZIPCd": {}
            }
        },
        "FormTypeCd": {},
        "PeriodBeginDt": {},
        "Filer": {
            "UniqueID": {},
            "BusinessName": {
                "BusinessNameLine1Txt": {}
            },
            "BusinessNameControlTxt": {},
            "PhoneNum": {},
            "USAddress": {
                "AddressLine1Txt": {},
                "CityNm": {},
                "StateAbbreviationCd": {},
                "ZIPCd": {}
            }
        },

        "FormData": {
            "FormCodeType": {
                "BizType": {},
                "AssetsAtEOY": {},
                "AccountingMethod": {},
                "RevenueAndExpenses": {
                    "ScheduleBNotReqd": {},
                    "DivsRevAndExpenses": {},
                    "DivsNetInvstIncomeAmt": {},
                    "NetGainSaleAstRevAndExpnssAmt": {},
                    "RevsOvrExpenses": {},
                    "NetInvestmentIncomeAmt": {}
                },
                "BalanceSheetGroup": {
                    "CashInvstBOYAmt": {},
                    "CashInvstEOYAmt": {},
                    "CashInvstEOYFMVAmt": {},
                    "OtherInvestmentsBOYAmt": {},
                    "OtherInvestmentsEOYAmt": {},
                    "CapitalStockEOYAmt": {},
                    "TotalLiabilitiesNetAstEOYAmt": {}
                },
                "ChangeNetAssetsFundGroup": {
                    "NetAssettFundBalancesBOYAmt": {},
                    "ExcessRevExpensesAmt": {},
                    "OtherIncreasesAmt": {},
                    "SubtotalAmt": {},
                    "OtherDecreasesAmt": {},
                    "TotNetAstOrFundBalancesEOYAmt": {}
                },
                "CapGainsLossTxInvstIncmDetail": {
                    "CapGainsLossTxInvstIncmGrp": {
                        "PropertyDesc": {},
                        "HowAcquiredCd": {},
                        "GrossSalesPriceAmt": {},
                        "GainOrLossAmt": {},
                        "GainsMinusExcessOrLossesAmt": {}
                    },
                    "StatementsRegardingActyGrp": {
                        "LegislativePoliticalActyInd": {},
                        "MoreThan100SpentInd": {}
                    },
                    "PhoneNum": {},
                    "LocationOfBooksUSAddress": {
                        "AddressLine1Txt": {},
                        "CityNm": {},
                        "StateAbbreviationCd": {},
                        "ZIPCd": {}
                    },
                    "CorporateDirectorsGrp": {
                        "DirectorsGrp": {
                            "PersonNm": {},
                            "USAddress": {
                                "AddressLine1Txt": {},
                                "CityNm": {},
                                "StateAbbreviationCd": {},
                                "ZIPCd": {}
                            },
                            "EmpPrograms": {
                                "EmployeeBenefitGroupNum": {},
                                "GroupType": {
                                    "GroupElement": {},
                                    "GroupCharacter": {
                                        "GroupNames": {}
                                    }
                                }

                            },
                            "EmpOffice1": {},
                            "EmpOffice2": {},
                            "EmpOffice3": {},
                            "EmpOffice4": {}
                        }


                    }
                }
            }
        }
    }
}

我首先用来创建字典/JSON 的代码是这样的:

import xml.etree.ElementTree as ET

strip_ns = lambda xx: str(xx).split('}', 1)[1]
tree = ET.parse('xmlpath.xml')
root = tree.getroot()


tierdict = {}
for tier1 in root:
    tier1var = strip_ns(tier1.tag)
    tierdict[tier1var] = {}
    for tier2 in tier1:
        tier2var = strip_ns(tier2.tag)
        tierdict[tier1var][tier2var] = {}
        for tier3 in tier2:
            tier3var = strip_ns(tier3.tag)
            tierdict[tier1var][tier2var][tier3var] = {}
            for tier4 in tier3:
                tier4var = strip_ns(tier4.tag)
                tierdict[tier1var][tier2var][tier3var][tier4var] = {}

我想看到的输出是这样的:

{
    "Header": {
        "Header.Count": 5672,
        "Ts": {
            "Ts.Count": 3365
            },
        "Ss": {
            "Ss.Count": 2328
            },

【问题讨论】:

    标签: python json xml dictionary elementtree


    【解决方案1】:

    我可能会对您想要的元素进行递归搜索,定义如下:

    def get_elements(json_entry, child_elements=[]):
    
         if not child_elements:
             return json_entry
    
         el, other_children = child_elements[0], child_elements[1:]
    
         children = el.getchildren()
         rec = json_entry.get(el.tag)
         if not children:
             json_entry[el.tag] = {"Count": rec.get("Count",0)+1 if rec else 1}
    
         else:
             json_entry[el.tag] = {"Count": rec.get("Count",0) if rec else 1,
                                        **get_elements({}, children)}
    
         return get_elements(json_entry, other_children)
    

    这样,您可以只传递 xml 的根元素:

    from lxml import etree
    
    with open("myxml.xml", "r") as fh:
        tree = etree.parse(fh)
    
    root = tree.getroot()
    
    root_children = root.getchildren()
    
    child_recs = get_elements({}, root_children)
    
    {'tagOne': {'Count': 1}, 'tagTwo': {'Count': 1, 'tagThree': {'Count': 1}, 'tagFour': {'Count': 1, 'tagFive': {'Count': 1}}}}
    

    如果您想将根元素包裹在它周围,请这样做:

    master_lookup = {root.tag: {"Count": 1, **child_recs}}
    

    这可以很容易地扩展到一个for 循环遍历许多文件

    master_lookup = {}
    
    for file in os.walk(path):
        with open(file) as fh:
            tree = etree.parse(fh)
    
        root = tree.getroot()
        root_entry = master_lookup.get(root.tag, {"Count": 0})
        root_children = root.getchildren()
    
        root_count = root_entry.pop("Count")
    
        master_lookup[root.tag] = {"Count": root_count, **get_elements({**root_entry}, root_children)}
    

    类似的东西

    【讨论】:

    • 感谢您的详细回复!你从哪里得到getchildren()?我见过的每个地方,它都不是 Python 3 的一部分。
    • getchildrenElement 的函数(etree 的一部分)。所以树中的每个Element 都有这个方法。它将返回所有子元素的列表(空列表,如果没有的话),这就是为什么 unpack 不会抛出 IndexError
    • 我在 Python 3.6 中运行它,如果这很重要的话
    • 再次感谢您的帮助。我仍在努力完成这项工作。现在,我一定是在循环中做错了什么,以至于master_lookup 的最终值似乎仅来自一个文件,所有计数 = 1,并且似乎来自各种不同的字段未包含在该字典中的文件。
    • 所以对于文件的循环,它不是 100% 正确的。我会进行编辑
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-04-25
    • 2019-06-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多