【问题标题】:从高度嵌套的 XML 文件中提取整齐的数据框
【发布时间】:2022-01-22 16:40:17
【问题描述】:

我有一个复杂的、多重嵌套的 XML 文件,我试图从中提取数据并将其转换为数据框,以进行后续绘图和分析等。使用 R 或 Python 的解决方案都可以,但我从来没有使用 XML 文件,我正在努力理解如何提取我需要的数据(我正在阅读 XPath 语法,这对我来说是新的)。

我尝试过使用 R 包 XML、xml2 和 xmltools,并且我还尝试过使用 Python 元素树。我尝试过的大多数示例都使用了更简单的 XML 文件,而且我还没有弄清楚如何将逻辑扩展到我自己的案例中,结果却是一团糟。

XML文件的结构是:

(1) ------------

 ├── XMLFILE
├── DATASET


 (2) ------------

 └── GROUPDATA
  └── GROUP
    ├── METHODDATA
    ├── SAMPLELISTDATA
      ├── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
      └── SAMPLE
        ├── USERDATA
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
        ├── COMPOUND
          ├── METHOD
          ├── USERDATA
          └── PEAK
            └── ISPEAK
    └── CALIBRATIONDATA
      ├── COMPOUND
        ├── RESPONSE
        └── CURVE
          └── RESPONSEFACTOR
      └── COMPOUND
        ├── RESPONSE
        └── CURVE
          ├── CALIBRATIONCURVE
          └── DETERMINATION

我只关心 SAMPLELISTDATA 部分中的内容。此外,我在每个 SAMPLE 中只展示了 2 个 SAMPLES 和 2 个 COMPOUNDS,但是在真实文件中两者都有很多。树中的所有标签也有很多属性,我需要从中提取数据。

实际的 XML 很大,但这里有一个(有点)最小的例子:

<QUANDATASET description="" version="1">
    <XMLFILE filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\quandata.xml" modifieddate="20 Dec 2021" modifiedtime="15:53:06"/>
    <DATASET filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\211220_MAA_Jack.qld" modifieddate="20 Dec 2021" modifiedtime="15:50:10" creationdate="20 Dec 2021" creationtime="14:18:02"/>
    <GROUPDATA count="1">
        <GROUP id="1" name="MAA_JACK">
            <METHODDATA id="1" filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\MethDB\MAA_Jack.mdb" modifieddate="20 Dec 2021" modifiedtime="14:04:55" creationdate="20 Dec 2021" creationtime="14:04:55"/>
            <SAMPLELISTDATA filename="C:\Masslynx  Projects\Polyphenols_Dev.PRO\SampleDB\MAA_211220.SPL" modifieddate="20 Dec 2021" modifiedtime="09:55:58" count="12">
                <SAMPLE id="1" groupid="1" name="MAA_211220_01" createdate="20-Dec-21" createtime="10:00:08" type="Analyte" desc="'Umbilicalis' laver filtrate 7D7" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="1" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,1" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 7D7">
                    <COMPOUND id="1" sampleid="1" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="514" foundrt="1.7100000381" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="89222.9220000000" height="1567686.0000000000" response="89222.9220000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:22:50" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6399999857" endrt="1.7532999516" startht="-10476.0000000000" endht="-10476.0000000000" absresponse="89222.9220000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="11.0944900513" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0210000000" pksigma="6.3800000000" pkskew="-0.1190000000" pkkurt="-0.4500000000" heightdivarea="17.5704400266" baselinewidth="6.7979979515" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="141303.1146768486" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.0700000003" peaktailwidth="0.0430000015" peakasymmetryvalue="0.6190000176" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="1" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="161" foundrt="3.3292999268" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="2140861.2500000000" height="16134221.0000000000" response="2140861.2500000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1303999424" endrt="3.7107000351" startht="3651.8000000000" endht="16670.4000000000" absresponse="2140861.2500000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="334.2170715332" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.7870000000" pksigma="3.2770000000" pkskew="0.6590000000" pkkurt="1.4860000000" heightdivarea="7.5363225898" baselinewidth="34.8180055618" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="48274.6764729440" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.2000000030" peaktailwidth="0.3799999952" peakasymmetryvalue="1.8999999762" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="6160.2280000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="1" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="1" groupid="1"/>
                </SAMPLE>
                <SAMPLE id="2" groupid="1" name="MAA_211220_02" createdate="20-Dec-21" createtime="10:11:04" type="Analyte" desc="'Umbilicalis' laver filtrate 3D9" dilutionfac="0.0000000000" extractvolume="0.0000000000" initamount="0.0000000000" injectvolume="2.0000000000" job="MAA_211220" sampleid="" samplenumber="2" stdconc="0.0000000000" stockdilutionfac="0.0000000000" subjecttext="" subjecttime="0.0000000000" userdilutionfac="0.0000000000" vial="1:A,2" inletmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAA_Dev_17" msmethodname="C:\Masslynx  Projects\Polyphenols_Dev.PRO\ACQUDB\MAAs SIR5.EXP" prerunmethodname="" postrunmethodname="" switchmethodname="" hplcmethodname="" tunemethodname="C:\Masslynx  Projects\Histamine_QDA_Dev.PRO\ACQUDB\Default.ipr" fractionlynxname="" instrument="ACQ-QDA#KAD3691" lab="" conditions="" submitter="" task="" user="" reinjections="0" text="'Umbilicalis' laver filtrate 3D9">
                    <COMPOUND id="1" sampleid="2" groupid="1" name="Palythine" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="517" foundrt="1.7200000286" foundrrt="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" area="69654.0080000000" height="1250121.0000000000" response="69654.0080000000" pkflags="MM!" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="20-Dec-21" modifiedtime="14:24:57" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="1.6000000238" endrt="1.7599999905" startht="0.0000000000" endht="10847.0340000000" absresponse="69654.0080000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="4.1693286896" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="318_322" peaks="0" pkwidth="3.0090000000" pksigma="6.4940000000" pkskew="-0.4530000000" pkkurt="0.7820000000" heightdivarea="17.9475817099" baselinewidth="9.5999979973" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="299837.4781816338" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1199999973" peaktailwidth="0.0399999991" peakasymmetryvalue="0.3330000043" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="0.0000000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="1.7500000000" predrrt="0.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="318_322" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Palythine" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <COMPOUND id="14" sampleid="2" groupid="1" name="Porphyra 334 SIR" type="" cas="" stdconc="0.0000000000">
                        <PEAK foundscan="162" foundrt="3.3459000587" foundrrt="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" area="1934833.8750000000" height="14881056.0000000000" response="1934833.8750000000" pkflags="bb" analconc="0.0000000000" empc="0.0000000000" bsanalconc="0.0000000000" conccalc="NaN" modifieddate="" modifiedtime="" modifiedtext="" modifieduser="" peakmass="0.0000000000" startrt="3.1800999641" endrt="3.7107000351" startht="5267.0000000000" endht="16324.8000000000" absresponse="1934833.8750000000" rrtref="0" quanratio="0.0000000000" quanratiopred="1.0000000000" quanratiowin="0.0000000000" ionratio="0.0000000000" ionratiopred="0.0000000000" ionratiowin="0" ionratioflag="0" chromnoise="208.7208557129" detectionthreshold="0.0000000000" detectionflag="0" quanthreshold="0.0000000000" quanflag="0" snlodflag="0" snloqflag="0" rrf="0.0000000000" chromtrace="347.1" peaks="0" pkwidth="7.5160000000" pksigma="3.2120000000" pkskew="0.6470000000" pkkurt="1.3920000000" heightdivarea="7.6911285213" baselinewidth="31.8360042572" peakquality="n/a" peakqualitydesc="" peakqualityref="N" replimflag="0" maxreplimflag="0" recovlimflag="0" matrixblankflag="0" solventblankflag="0" devflag="0" devflagmidconc="0" devflaglowconc="0" qcsignoiseflag="0" qcionratioflag="0" qcrettimeflag="0" qcpeakshapeflag="0" signoise="71296.4497446734" signoiseflag="0" cdflag="0" stddevflag="0" rtflag="0" peakasymmetry="0" peakfrontwidth="0.1669999957" peaktailwidth="0.3639999926" peakasymmetryvalue="2.1860001087" percrecovery="0.0000000000" symflag="" percsym="0.0000000000" belowrl="1" chromnoisehgt="5185.1130000000" concdevperc="0.0000000000" lowerbound1="0.0000000000" lowerbound2="0.0000000000" lowerbound3="0.0000000000" lowerbound4="0.0000000000" mediumbound1="0.0000000000" mediumbound2="0.0000000000" mediumbound3="0.0000000000" mediumbound4="0.0000000000" upperbound1="0.0000000000" upperbound2="0.0000000000" upperbound3="0.0000000000" upperbound4="0.0000000000" nosolflag="0" peakmissing="0" peaksinc="0" toxconc1="0.0000000000" toxconc2="0.0000000000" toxconc3="0.0000000000" toxconc4="0.0000000000" toxfactor1="0.0000000000" toxfactor2="0.0000000000" toxfactor3="0.0000000000" toxfactor4="0.0000000000" toxlod1="0.0000000000" toxlod2="0.0000000000" toxlod3="0.0000000000" toxlod4="0.0000000000" toxloq1="0.0000000000" toxloq2="0.0000000000" toxloq3="0.0000000000" toxloq4="0.0000000000" userfactor="1.0000000000" userrf="0.0000000000" picsforward="0" picsreverse="0" iFIT="N/A" iFITnorm="N/A" iFITconfidence="N/A" foundmass="N/A" mDamasserror="N/A" ppmmasserror="N/A" iFitflag="0" iFitnormflag="0" iFitconfflag="0" mDaerrorflag="0" ppmerrorflag="0">
                            <ISPEAK area="" height="" foundrt="" absresponse=""/>
                        </PEAK>
                        <METHOD rref="0.0000000000" predrt="3.3099999428" predrrt="1.0000000000" userfactor="0.0000000000" userrf="0.0000000000" quantrace="347.1" secondarytrace="" useabsmasswin="1" chromasswinabs="1.0000000000" chromasswinppm="10.0000000000" stockconcfactor="0.0000000000" calibref="Porphyra 334 SIR" replim="0.0000000000" replimflag="0" maxreplim="0.0000000000" maxreplimflag="0" minrecovlim="0.0000000000" maxrecovlim="100.0000000000" recovlimflag="0" maxstddev="0.0000000000" signoiseflag="0" mincoeffdet="0.5000000000" cdflag="0" minpeakwidth="0.0000000000" peakwidthtol="0.0000000000" peakwidthflag="0" blanklevel="0.0000000000" stddevflag="0" rtupper="0.0000000000" rtlower="0.0000000000" rtflag="0"/>
                        <USERDATA sampleid="2" groupid="1"/>
                    </COMPOUND>
                    <USERDATA sampleid="2" groupid="1"/>
                </SAMPLE>
            </SAMPLELISTDATA>
            <CALIBRATIONDATA filename="C:\Masslynx  Projects\Caffeine.PRO\CurveDB\Meth1.cdb" modifieddate="25 Sep 2015" modifiedtime="00:20:14" count="2">
                <COMPOUND id="1" name="Compound A ( 430.5 )">
                    <RESPONSE type="External Std" ref="" rah="Area"/>
                    <CURVE type="RF" origin="" weighting="" axistrans="">
                        <RESPONSEFACTOR cc="15552.5556000000" stddev="2208.2674143620" percrelsd="0.1319874310"/>
                    </CURVE>
                </COMPOUND>
                <COMPOUND id="2" name="Compound B ( 458.5 )">
                    <RESPONSE type="Internal Std" ref="1" rah="Area * ( IS Conc. / IS Area )"/>
                    <CURVE type="Linear" origin="Exclude" weighting="1/x" axistrans="None">
                        <CALIBRATIONCURVE curve="0.012594 * x + 0.005516"/>
                        <DETERMINATION rsquared="0.9741537568"/>
                    </CURVE>
                </COMPOUND>
            </CALIBRATIONDATA>
        </GROUP>
    </GROUPDATA>
</QUANDATASET>

我想要得到的是一个单一的数据框(在 R 或 Python/Pandas 中),其中每一行代表与 SAMPLE/COMPOUND 对关联的所有数据(属性)(即在我上面的示例中) 2 个样本,每个样本有 2 个化合物,然后应该是数据框的 4 行,其中包含与它们关联的所有节点/子/属性的所有属性的许多列。

数据框列表(每个样本一个)也可以使用,但是样本名称需要与该列表中的每个数据框相关联,所以我认为一个大数据框可能更容易。

非常感谢您提供任何帮助/见解/提示/建议。

【问题讨论】:

  • 如果我们有 xml 文件和您已经尝试过的示例,我们会更容易提供帮助
  • 我已经提供了 XML 代码。它只是一个记录较少的缩短版本,但它是完整的文件结构。我正在努力制作我尝试过的所有主要方法的 MRE,并在完成后添加它们。
  • 预期输出是什么? (基于您发布的 XML)
  • 请发布预期的输出数据框

标签: python r pandas xml xpath


【解决方案1】:

这是我设法做到的,基于我问过的previous question

## Import the data
# Here, test.xml is the code you provided

data <- xml2::read_xml("Z:/temp/test.xml")

## Isolate as list SAMPLELISTDATA
data_list2 <- xml2::as_list(data)[[1]][[3]][[1]][[2]]

## Creating the output data.frame
output_desired <- data.frame(foundscan = NA, area = NA, height = NA) %>% 
  filter(!is.na(foundscan))


## Function to get the attributes
fusion_et_gestion <-  function(y){

  ## We choose the attributes we want to keep
  foundscan <- attr(y,"foundscan")
  area <- attr(y,"area")
  height <- attr(y, "height")
  
  ## Output as tibble
  tibble(foundscan = foundscan,
         area =area,
         height = height)
}


## Using for loops here, but map(1:2, ...) would be faster for you real data
for(j in 1:2) {
  for (i in 1:2) {
    test <- data_list2[[j]][[i]] %>%
      purrr::map_dfr(fusion_et_gestion)
    
    output_desired <<- rbind(output_desired, test) %>%
      filter()
  }
}


## Output:

# A tibble: 4 x 3
  foundscan area               height             
  <chr>     <chr>              <chr>              
1 514       89222.9220000000   1567686.0000000000 
2 161       2140861.2500000000 16134221.0000000000
3 517       69654.0080000000   1250121.0000000000 
4 162       1934833.8750000000 14881056.0000000000

但是:

  1. 如果要保留ISPEAK节点中的属性,需要在fusion_et_gestion函数内添加一行,并指定x的级别,即x[[1]]。请注意,您为该属性指定的名称稍后不会在函数中重复使用。
  2. 我没有找到包含所有属性的方法,除非您逐个键入它们。因为它们是 196,所以一个想法可能是在 fusion_et_gestion 中添加另一个函数,以获取所有属性名称及其值。这可以通过map(list_of_attribute, function_to_get_values) 完成。

要获取属性列表,您可以:

data %>% 
    xml2::xml_find_all("//*") %>% 
    purrr::map(~names(xml2::xml_attrs(.))) %>%
    unlist() %>% 
    unique()

【讨论】:

    猜你喜欢
    • 2023-03-29
    • 2016-05-08
    • 2018-07-23
    • 1970-01-01
    • 1970-01-01
    • 2020-07-30
    • 2018-06-01
    • 2021-06-17
    • 2020-11-07
    相关资源
    最近更新 更多