确实,在即将发布的 Pandas 1.3 中,read_xml 将允许您将已解析的节点迁移到数据帧中。但是,由于 XML 可以有许多维度超出 2D 的行列,如前所述:
此方法最适合导入浅层 XML 文档
因此,任何嵌套元素都不会立即被拾取,如此处所示的大约 20 列。由于文档中的默认命名空间,请注意namespaces 的必需使用。
熊猫 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
幸运的是,read_xml 支持 XSLT(用于转换 XML 文档的专用语言)和默认的 lxml 解析器。使用 XSLT,您可以展平迁移所需的节点以检索 32 列。
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
熊猫
要通过 XPath 方法获得相同的结果需要更多步骤,您必须处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表,并传递给 DataFrame 构造函数。下面使用与上述命名空间相同的 XSLT 和 XPath。
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]