【问题标题】:Pandas read xml not working properly for single tag xmlPandas 读取的 xml 对于单标签 xml 无法正常工作
【发布时间】:2021-06-16 23:20:07
【问题描述】:

我正在使用 pandas_read_xml 包来读取 xml 文件并将其处理为 pandas 数据帧。在绝大多数情况下,该软件包对于我的目的来说绝对正常。但是,当读取只有一个标签的 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)

生成的 df_1 包含 163 行和 31 列,其中每一行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取只出现一次标记“invstOrSec”的 xml 时,输出有点奇怪。

# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)

生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它应该包含 6 行,而实际上它应该是 1 行。我观察到这种行为只发生在标签“invstOrSec”只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。

【问题讨论】:

    标签: python pandas xml-parsing


    【解决方案1】:

    首先,感谢您的反馈!我写了 pandas-read-xml 因为 pandas 没有 pd.read_xml() 实现。你(和我们其他人)会很高兴知道有一个开发版的 pandas read_xml 应该很快就会推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)

    至于您当前的难题,这是 XML 结构的结果(也是我不喜欢的其中一个)。与 JSON 可以在列表中返回单个元素不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。

    基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签......我没有多大意义是吗?让我用你的例子来解释一下。

    我建议你使用它:

    # Import package
    import pandas_read_xml as pdx
    from pandas_read_xml import fully_flatten
    
    # Example 1
    url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
    df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec']).pipe(fully_flatten)
    
    # Example 2
    url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
    df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True).pipe(fully_flatten)
    df_2
    

    有什么区别?

    在示例 1 中,您已经期望多个内标记。 因此,传递 root_tag_list=['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'] 会在后台返回一个列表。 fully_flatten 过程首先将列表分解为行。

    在示例 2 中,如果您使用相同的 root_tag_list,pandas 不会在列表中读取。相反,它正在阅读与单行相对应的字典。实际上,它将作为列的标记视为行。相反,我会将其上方的一个标签作为根标签传递,然后转置它,然后是fully_flatten。

    是的...我知道...这是一种解决方法。但是......话又说回来,我没有创建 pandas-read-xml 希望解决所有问题。在 pandas 原生支持读取 XML(看起来即将推出)之前,它一直是一种临时解决方案。

    告诉我进展如何!

    编辑:

    关于如何使XML到pandas DataFrame的转换可以根据XML只有一个“行”标签还是多个来切换,我有以下两种选择。

    在多行情况下,DataFrame 将生成一个具有整数索引(行号)的 DataFrame,而在单行情况下,DataFrame 索引将是“字符串”,它们应该是列。因此,一种策略是检测到这一点并相应地重新做。 (您可能可以通过更智能的方法避免重复下载)

    # Import package
    import pandas as pd
    import pandas_read_xml as pdx
    from pandas_read_xml import fully_flatten
    
    # Example 3
    
    dfs = []
    url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
    
    for url_component in url_components:
        url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
        temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'])
        if 0 not in temp.index:
            temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs'], transpose=True)
        else:
            temp = pdx.read_xml(url, ['edgarSubmission', 'formData', 'invstOrSecs', 'invstOrSec'])
        dfs.append(temp)
    
    df = pd.concat(dfs, ignore_index=True).pipe(fully_flatten)
    
    df
    

    另一种选择是使用底层工具。 pandas_read_xml 背后没有什么神奇之处,它使用了一个名为 xmltodict 的包。读取 XML,转换为 dicts,然后转换为 pandas,然后展平。唯一的缺点是因为保留了标签“invstOrSec”的名称,它们成为列名的前缀。您应该能够轻松删除它们。

    # Import package
    import pandas as pd
    import pandas_read_xml as pdx
    import xmltodict
    from pandas_read_xml import fully_flatten
    
    # Example 4
    
    url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
    xmldicts = []
    
    for url_component in url_components:
        url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
        xml = pdx.read_xml_from_url(url)
        xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
        
    df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)
    
    df
    

    希望有帮助!

    编辑:

    所以,我更新了包(现在是 0.2.0 版)。现在 pandas_read_xml 应该将根标记视为生成的 pandas 数据帧中的行作为默认值,因此无需区分有时具有单个“行”和有时具有多个行的 XML。

    如果这在其他情况下是一个问题,那么有一个新参数root_is_rows 默认为 True,但可以设为 False。

    【讨论】:

    • 只是补充...似乎没有一种“标准”的 XML 构造方式。您可能需要为遇到的每个新数据源修改这些内容。
    • 感谢您的详细解释。您建议的方法可以实现。但是,它要求我首先识别具有单个或多个“invstOrSecs”实例的 URL,然后再使用这两种方法中的任何一种将它们转换为数据框。我有几千个要解析的 URL,我目前正在 for 循环中执行此操作。你知道我是否可以定义一个参数,使我能够过滤掉单个或多个“invstOrSecs”出现的情况,所以我仍然可以在 for 循环中解析它们。
    • 嘿。所以这里有几个不同的选择。但在此之前,我仍然不确定是否有一种“干净”的方式可以在包本身中实现某些东西,因此现在必须采用以下解决方法。编辑:我将作为答案回复,而不是尝试在 cmets 中进行。
    【解决方案2】:

    确实,在即将发布的 Pandas 1.3 中,read_xml 将允许您将已解析的节点迁移到数据帧中。但是,由于 XML 可以有许多维度超出 2D 的行列,如前所述:

    此方法最适合导入浅层 XML 文档

    因此,任何嵌套元素都不会立即被拾取,如此处所示的大约 20 列。由于文档中的默认命名空间,请注意namespaces 的必需使用。

    熊猫 1.3+

    url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
    df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                     namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
    
    print(df)
    #                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
    # 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
    # 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
    # 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
    # 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
    # 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN              NaN      None     NaN
    ..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
    # 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN              NaN      None     NaN
    # 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN              NaN      None     NaN
    # 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
    # 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
    # 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN
    
    # [163 rows x 20 columns]
    
    
    url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
    df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                     namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
    
    print(df)
    #                                        name  lei                                     title      cusip  ...  invCountry  isRestrictedSec fairValLevel securityLending
    # 0  Salient Private Access Master Fund, L.P.  NaN  Salient Private Access Master Fund, L.P.  999999999  ...          US                Y          NaN             NaN
    
    # [1 rows x 18 columns]
    

    幸运的是,read_xml 支持 XSLT(用于转换 XML 文档的专用语言)和默认的 lxml 解析器。使用 XSLT,您可以展平迁移所需的节点以检索 32 列。

    xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                           xmlns:edgar="http://www.sec.gov/edgar/nport">
        <xsl:output method="xml" indent="yes" />
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="edgar:invstOrSec">
            <xsl:copy>
                <xsl:apply-templates select="*|*/*"/>
            </xsl:copy>
        </xsl:template>
    
    </xsl:stylesheet>
    """
    
    url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
    df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
                     stylesheet=xsl)
    print(df)
    #                                                   name  lei                                              title      cusip  ...  annualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
    # 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
    # 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...        0.0624          N                     N           N
    # 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           NaN       None                  None        None
    # 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
    # 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN       None                  None        None
    ..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
    # 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN       None                  None        None
    # 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN       None                  None        None
    # 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
    # 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
    # 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None
    
    # [163 rows x 32 columns]
    

    熊猫

    要通过 XPath 方法获得相同的结果需要更多步骤,您必须处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表,并传递给 DataFrame 构造函数。下面使用与上述命名空间相同的 XSLT 和 XPath。

    import lxml.etree as lx
    import pandas as pd
    import urllib.request as rq
    
    url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
    
    xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                           xmlns:edgar="http://www.sec.gov/edgar/nport">
        <xsl:output method="xml" indent="yes" />
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="edgar:invstOrSec">
            <xsl:copy>
                <xsl:apply-templates select="*|*/*"/>
            </xsl:copy>
        </xsl:template>
    
    </xsl:stylesheet>
    """
    
    content = rq.urlopen(url)
    
    # LOAD XML AND XSL
    doc = lx.fromstring(content.read())
    style = lx.fromstring(xsl)
    
    # INITIALIZE AND TRANSFORM ORIGINAL DOC
    transformer = lx.XSLT(style)
    result = transformer(doc)
    
    # RUN XPATH PARSING ON FLATTER XML
    data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
            } for inv in result.xpath("//edgar:invstOrSec", 
                                     namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
    
    # BIND DATA FOR DATA FRAME
    df = pd.DataFrame(data)
    
    print(df)
    #                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
    # 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
    # 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  N/A  Regatta XV Funding Ltd., Subordinated Note, Pr...  ...         N                    N           N
    # 2                Hired, Inc., Series C Preferred Stock  N/A              Hired, Inc., Series C Preferred Stock  ...       NaN                  NaN         NaN
    # 3                      WESTVIEW CAPITAL PARTNERS II LP  N/A                    WESTVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
    # 4                       VOYAGER CAPITAL FUND III, L.P.  N/A                     VOYAGER CAPITAL FUND III, L.P.  ...       NaN                  NaN         NaN
    # ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
    # 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  ...       NaN                  NaN         NaN
    # 159                       ALLOY MERCHANT PARTNERS L.P.  N/A                       ALLOY MERCHANT PARTNERS L.P.  ...       NaN                  NaN         NaN
    # 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
    # 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
    # 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN
    
    # [163 rows x 32 columns]
    
    

    【讨论】:

    • 谢谢@Parfait。 Pandas 1.3 解决方案非常棒。请原谅我的无知,但是是否可以安装 Pandas 1.3 的开发版。 pip install pandas==1.3 找不到版本。
    • 很高兴听到来自模块原作者的消息!对于开发版本,您需要在本地克隆 repo 后从 pandas git repo 中 pip install。请参阅home page of repo 上的说明。注意:不适合 Python 新手。
    • 这太美了T^T
    猜你喜欢
    • 1970-01-01
    • 2019-10-09
    • 2014-04-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多