使用 spark 解析/加载巨大的 XML 文件答案

【问题标题】：Parse/load Huge XML file with spark使用 spark 解析/加载巨大的 XML 文件
【发布时间】：2018-01-16 11:42:20
【问题描述】：

我有一个具有以下设置的 XML 文件。

<?xml version="1.0" encoding="utf-8"?>
<SomeRoottag>
 <row Id="47513849" PostTypeId="1" />
 <row Id="4751323" PostTypeId="4" />
 <row Id="475546" PostTypeId="1" />
 <row Id="47597" PostTypeId="2" />
</SomeRoottag>

我解析文件并使用以下代码将其保存为 Hive 表。

df = sqlContext.read.format('xml').option("rowTag","SomeRoottag").load("/tmp/xmlfile.xml")
flat=df.withColumn("rows2",explode(df.row)).select("rows2.*")
flat.write.format("parquet").saveAsTable("xml_table")

使用我的测试数据 (10mb) 一切正常，但是当我加载大文件 (>50G) 时它失败了。似乎 spark JVM 尝试加载整个文件失败，因为它只有 20G 大。

处理这样的文件的最佳方法是什么？

更新：

如果我执行以下操作，我不会收到任何数据：

df = (sqlContext.read.format('xml').option("rowTag", "row").load("/tmp/someXML.xml"))
df.printSchema()
df.show()

输出：

root

++
||
++
++

【问题讨论】：

标签： python xml apache-spark hive

【解决方案1】：

不要将SomeRoottag 用作rowTag。它指示 Spark 将整个文档用作单行。而是：

df = (sqlContext.read.format('xml')
    .option("rowTag", "row")
    .load("/tmp/xmlfile.xml"))

现在也不需要爆炸了：

df.write.format("parquet").saveAsTable("xml_table")

编辑：

考虑到您的编辑，您会受到已知错误的影响。请参阅Self-closing tags are not supported as top-level rows #92。目前在解决该问题方面似乎没有任何进展，因此您可能必须：

自己做一个 PR 来解决这个问题。

手动解析文件。如果元素总是单行，则可以使用udf 轻松完成。

from pyspark.sql.functions import col, udf
from lxml import etree

@udf("struct<id: string, postTypeId: string>")
def parse(s):
    try:
        attrib = etree.fromstring(s).attrib
        return attrib.get("Id"), attrib.get("PostTypeId")
    except:
        pass

(spark.read.text("/tmp/someXML.xml")
    .where(col("value").rlike("^\\s*<row "))
    .select(parse("value").alias("value"))
    .select("value.*")
    .show())

# +--------+----------+
# |      id|postTypeId|
# +--------+----------+
# |47513849|         1|
# | 4751323|         4|
# |  475546|         1|
# |   47597|         2|
# +--------+----------+

【讨论】：

为我工作！我能够使用这种方法处理维基百科转储 xml 文件。