【问题标题】:Parsing XML file to CSV with as little hard-coding as possible使用尽可能少的硬编码将 XML 文件解析为 CSV
【发布时间】:2021-02-14 00:17:07
【问题描述】:

我想通过 xml 解析并获取尽可能少的硬编码标签并转换为 CSV

我需要对这些特定的列名进行硬编码: 'InfoGroup'、'InfoRegister'、'RegisterType'、'Measures'、'Description'、'GeneratedOn'

InfoGroup 是名称标签,例如 RecordingSystem、Ports 等

InfoRegister 是位于行标签内的子名称,例如 closedFileCount、processedFileCount 等

RegisterType 是子名所在的标签名,如 , , 等

Measures 只是度量标签

描述只是描述标签

GeneratedOn 位于像 sessmgr、rtpportal 等的 generatedOn 标记内

如果 xml 中有任何其他或新标签,我希望它能够自动将其添加到 csv 中。

我目前的实现基本上都是硬编码的,但我无法让它正常工作。请使用我的 xml 运行代码以查看 CSV 的实际外观。

<?xml version="1.0" encoding="UTF-8"?>

<infoconfig xmlns="urn:nortel:namespaces:mcp:oms" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:nortel:namespaces:mcp:oms OMSchema.xsd" >

        <group>
                <name>RecordingSystem</name>
                <row>
                        <package>com.nortelnetworks.mcp.ne.base.recsystem.fw.system</package>
                        <class>RecSysFileOMRow</class>
                        <usage name="closedFileCount" hasThresholds="true">
                                <measures>
                                        closed file count
                                </measures>
                                <description>
                                        This register counts the number
                                        of closed files in the spool directory of a
                                        particular stream and a particular system.
                                        Files in the spool directory store the raw
                                        OAM records where they are sent to the
                                        Element Manager for formatting.
                                </description>
                                <notes>
                                        Minor and major alarms
                                        when the value of closedFileCount
                                        exceeds certain thresholds. Configure
                                        the threshold values for minor and major
                                        alarms for this OM through engineering
                                        parameters for minorBackLogCount and
                                        majorBackLogCount, respectively. These
                                        engineering parameters are grouped under
                                        the parameter group of Log, OM, and
                                        Accounting for the logs’ corresponding
                                        system.
                                </notes>
                        </usage>
                        <usage name="processedFileCount" hasThresholds="true">
                                <measures>
                                        Processed file count
                                </measures>
                                <description>
                                        The register counts the number
                                        of processed files in the spool directory of
                                        a particular stream and a particular system.
                                        Files in the spool directory store the raw
                                        OAM records and then send the records to
                                        the Element Manager for formatting.
                                </description>
                        </usage>
                </row>
                <documentation>
                        <description>
                                Rows of this OM group provide a count of the number of files contained
                                within the directory (which is the OM row key value).
                        </description>
                        <rowKey>
                                The full name of the directory containing the files counted by this row.
                        </rowKey>
                </documentation>
                <generatedOn>
                        <all/>
                </generatedOn>
        </group>
        <group traffic="true">
                <name>Ports</name>
                <row>
                        <package>com.nortelnetworks.ims.cap.mediaportal.host</package>
                        <class>PortsOMRow</class>
                        <usage name="rtpMpPortUsage">
                                <measures>
                                        BCP port usage
                                </measures>
                                <description>
                                        Meter showing number of ports in use.
                                </description>
                        </usage>
                        <lwGauge name="connMapEntriesLWM">
                                <measures>
                                        Lowest simultaneous port usage
                                </measures>
                                <description>
                                        Lowest number of
                                        simultaneous ports detected to be in
                                        use during the collection interval
                                </description>
                        </lwGauge>
                        <hwGauge name="connMapEntriesHWM">
                                <measures>
                                        Highest simultaneous port usage
                                </measures>
                                <description>
                                        Highest number of
                                        simultaneous ports detected to be in
                                        use during the collection interval.
                                </description>
                        </hwGauge>
                        <waterMark name="connMapEntries">
                                <measures>
                                        Connections map entries
                                </measures>
                                <description>
                                        Meter showing the number of connections in the host
                                        CPU connection map.
                                </description>
                                <bwg lwref="connMapEntriesLWM" hwref="connMapEntriesHWM"/>
                        </waterMark>
                        <counter name="portUsageSampleCnt">
                                <measures>
                                    Usage sample count
                                </measures>
                                <description>
                                    The number of 100-second samples taken during the
                                    collection interval contributing to the average report.
                                </description>
                        </counter>
                        <counter name="sampledRtpMpPortUsage">
                                <measures>
                                    In-use ports usage
                                </measures>
                                <description>
                                    Provides the sum of the in-use ports every 100 seconds.
                                </description>
                        </counter>
                        <precollector>
                                <package>com.nortelnetworks.ims.cap.mediaportal.host</package>
                                <class>PortsOMCenturyPrecollector</class>
                                <collector>centurySecond</collector>
                        </precollector>
                </row>
                <documentation>
                        <description>
                        </description>
                        <rowKey>
                        </rowKey>
                </documentation>
                <generatedOn>
                        <list>
                            <ne>sessmgr</ne>
                            <ne>rtpportal</ne>
                        </list>
                </generatedOn>
        </group>
        <group traffic="true">
            <name>SASIPPBXTrunkGroupCallMgmt</name>
            <row>
                <package>com.nortelnetworks.ims.cap.svc.sippbx.fsm</package>
                <class>StandAloneSipPbxTrunkGroupOMRow</class>
                <hwGauge name="callAttemptsHighForOrigination">
                    <measures></measures>
                    <description></description>
                </hwGauge>
                <waterMark name="callAttemptsForOrigination">
                    <measures> Number of Call attempts </measures>
                    <description>> This counter will keep track of incoming call attempts of Trunk Group  to or from a SIPPBX node </description>
                    <bwg lwref="callAttemptsLowForOrigination" hwref="callAttemptsHighForOrigination"/>
                </waterMark>
                <lwGauge name="callAttemptsLowForOrigination">
                    <measures></measures>
                    <description></description>
                </lwGauge>  
                <hwGauge name="callAttemptsHighForTermination">
                    <measures></measures>
                    <description></description>
                </hwGauge>
                <waterMark name="callAttemptsForTermination">
                    <measures> Number of Call attempts </measures>
                    <description>> This counter will keep track of outgoing call attempts of Trunk Group  to or from a SIPPBX node </description>
                    <bwg lwref="callAttemptsLowForTermination" hwref="callAttemptsHighForTermination"/>
                </waterMark>
                <lwGauge name="callAttemptsLowForTermination">
                    <measures></measures>
                    <description></description>
                </lwGauge>  
                <hwGauge name="activeCallsHighForOrigination">
                    <measures></measures>
                    <description></description>
                </hwGauge>
                <waterMark name="activeCallsForOrigination">
                    <measures> Number of Incoming Active calls </measures>
                    <description>> This counter will keep track of incoming active call of Trunk Group  to or from a SIPPBX node </description>
                    <bwg lwref="activeCallsLowForOrigination" hwref="activeCallsHighForOrigination"/>
                </waterMark>
                <lwGauge name="activeCallsLowForOrigination">
                    <measures></measures>
                    <description></description>
                </lwGauge>  
                <hwGauge name="activeCallsHighForTermination">
                    <measures></measures>
                    <description></description>
                </hwGauge>
                <waterMark name="activeCallsForTermination">
                    <measures> Number of Outgoing Active calls </measures>
                    <description>> This counter will keep track of outgoing call active call of Trunk Group  to or from a SIPPBX node </description>
                    <bwg lwref="activeCallsLowForTermination" hwref="activeCallsHighForTermination"/>
                </waterMark>
                <lwGauge name="activeCallsLowForTermination">
                    <measures></measures>
                    <description></description>
                </lwGauge>  
                <counter name="deniedCallsDueToCapacityForOrigination">
                    <measures>Number of Denied Calls due to capacity </measures>
                    <description>This counter will keep track denied for incoming call attempts of Trunk Group  to or from a SIPPBX node </description>
                </counter>
                <counter name="deniedCallsDueToCapacityForTermination">
                    <measures>Number of Denied Calls due to capacity </measures>
                    <description>This counter will keep track denied for outgoing call attempts of Trunk Group  to or from a SIPPBX node </description>
                </counter>
                <counter name="failoverRouteCallAttempts">
                    <measures>Number of FailOverRoute Call  attempts </measures>
                    <description>This counter will keep track of FailOverRoute Call attempts of Trunk Group  for a SIPPBX node </description>
                </counter>
            </row>
            <documentation>
                <description></description>
                <rowKey></rowKey>
            </documentation>
            <generatedOn>
                <list>
                    <ne>sessmgr</ne>
                </list>
            </generatedOn>
        </group>
       
</infoconfig>
from bs4 import BeautifulSoup
import re
import csv



def extract_data_from_report3():
    xmlfile = open('infoconfig.xml', 'r')
    soup = BeautifulSoup(xmlfile, 'lxml')

    with open('data2.csv', 'w', newline='') as f_out:
        writer = csv.writer(f_out)
        writer.writerow(['InfoGroup:InfoRegister', 'InfoGroup', 'InfoRegister', 'RegisterType', 'Measures', 'Description', 'GeneratedOn'])


        for item in soup.select('row [name]'):
            desc = getattr(item.find('description'), 'text', None)
            desc= str(desc)
            desc = re.sub(r'\s{2,}', ' ', desc)
            generatedOn = ','.join(ne.get_text(strip=True) for ne in item.find_parent('group').select('ne'))

            writer.writerow([item.find_previous('name').text + ':' + item['name'], item.find_previous('name').text, item['name'], item.name, item.find('measures').get_text(strip=True), desc, generatedOn])

        print("File successfuly converted to CSV")

问题截图

任何帮助将不胜感激

【问题讨论】:

  • 对不起,我还没有理解你的要求。我明天去看看。
  • @yazz 感谢它,谢谢。如果您需要我澄清任何事情,请告诉我

标签: xml csv


【解决方案1】:

我还是不明白你提到的其他新标签的规则,但我按照你现在的逻辑重写了。我们可以在此基础上进一步沟通,最终达到您想要的结果。

from simplified_scrapy import SimplifiedDoc, utils


def extract_data_from_report3():

    header = [
        'InfoGroup:InfoRegister', 'InfoGroup', 'InfoRegister', 'RegisterType', 'GeneratedOn' # edit
    ]
    datas = []
    doc = SimplifiedDoc(utils.getFileContent('infoconfig.xml'))
    groups = doc.selects('group')
    for group in groups:
        name = group.select('name>text()')
        # generatedOn = ','.join(group.selects('generatedOn>ne>text()'))
        # edit start...
        all = group.select('generatedOn').child
        if not all.child: 
            generatedOn = all.tag
        else:
            generatedOn = ','.join(all.selects('ne>text()'))
        # edit end...

        RegisterTypes = group.row.children.containsReg(
            '.+', attr='name')  # The node with the name attribute.
        for registerType in RegisterTypes:
            extr = {}
            for c in registerType.children:
                if c['tag'] not in header:
                    header.append(c['tag'])
                extr[c['tag']] = c.text # edit
                
            datas.append([
                '{}:{}'.format(name, registerType['name']), name,
                registerType['name'], registerType['tag'], generatedOn, extr])
    
    rows = [header]
    for data in datas:
        row = data[:-1]
        extr = data[-1]
        for i in range(5,len(header)): # edit
            row.append(extr.get(header[i]))

        rows.append(row)

    utils.save2csv('data.csv', rows, newline='')


extract_data_from_report3()

【讨论】:

  • 感谢@yazz 的实现,但它似乎并没有获取所有数据。最后我用另一个组更新了我的 XML,所以当你用你的代码运行它时,你会看到一些行丢失了
  • 我所说的新标签的意思是,例如,如果有一个新标签我添加到 标签下的 XML 中。我希望脚本能够找到它并为其创建一个新的列和行,而无需硬编码。
  • @joe_sanders 我已经更新了答案,请再试一次。
  • 谢谢,但有些列是空的,即使它们应该有数据。请查看我添加的屏幕截图以查看缺少的列。再次感谢@yazz
  • @joe_sanders 抱歉,我犯了一些错误。现在已修改,您可以再试一次。
猜你喜欢
  • 2021-01-27
  • 2021-02-10
  • 1970-01-01
  • 2022-01-26
  • 1970-01-01
  • 1970-01-01
  • 2018-04-08
  • 2021-07-04
  • 1970-01-01
相关资源
最近更新 更多