【问题标题】:How to parse xml files in Apache Spark?如何在 Apache Spark 中解析 xml 文件?
【发布时间】:2016-08-17 19:15:28
【问题描述】:

如何解析包含 Apache Spark 中相同节点列表的 xml 文件?

文件示例:

<?xml version="1.0" encoding="UTF-8"?>
<osm version="0.6" generator="CGImap 0.4.0 (25361 thorn-02.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
 <bounds minlat="48.8306100" minlon="2.3310900" maxlat="48.8337900" maxlon="2.3389100"/>
 <node id="430785" visible="true" version="8" changeset="24482318" timestamp="2014-08-01T14:24:53Z" user="dhuyp" uid="1779584" lat="48.8340725" lon="2.3309196"/>
 <node id="661209" visible="true" version="6" changeset="9914127" timestamp="2011-11-22T21:46:44Z" user="lapinos03" uid="33634" lat="48.8337517" lon="2.3333992"/>
 <node id="24912996" visible="true" version="2" changeset="806076" timestamp="2009-03-14T10:38:25Z" user="Goon" uid="24657" lat="48.8302268" lon="2.3338015">
  <tag k="crossing" v="uncontrolled"/>
  <tag k="highway" v="traffic_signals"/>
 </node>
 <node id="24912994" visible="true" version="5" changeset="5904801" timestamp="2010-09-28T15:32:01Z" user="maouth-" uid="322872" lat="48.8301333" lon="2.3309869">
  <tag k="highway" v="mini_roundabout"/>
 </node>
</osm>

【问题讨论】:

标签: xml apache-spark pyspark


【解决方案1】:

正如另一个答案中提到的,来自 Databricks 的 spark-xml 是读取 XML 的一种方式,但是 there is currently a bug in spark-xml 会阻止您导入自关闭元素。为了解决这个问题,您可以将整个 XML 作为单个值导入,然后执行以下操作:

val pathToYourData = "Z:/test.xml"
val osm = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "osm").load(pathToYourData)
val nodes = osm.selectExpr("explode(node) as node")
nodes.select("node.*").show
/*
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
|#VALUE|@changeset|     @id|      @lat|     @lon|          @timestamp|   @uid|    @user|@version|@visible|                 tag|
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
|  null|  24482318|  430785|48.8340725|2.3309196|2014-08-01T14:24:53Z|1779584|    dhuyp|       8|    true|                null|
|  null|   9914127|  661209|48.8337517|2.3333992|2011-11-22T21:46:44Z|  33634|lapinos03|       6|    true|                null|
|  null|    806076|24912996|48.8302268|2.3338015|2009-03-14T10:38:25Z|  24657|     Goon|       2|    true|[[null,crossing,u...|
|  null|   5904801|24912994|48.8301333|2.3309869|2010-09-28T15:32:01Z| 322872|  maouth-|       5|    true|[[null,highway,mi...|
+------+----------+--------+----------+---------+--------------------+-------+---------+--------+--------+--------------------+
*/

【讨论】:

    【解决方案2】:

    使用https://github.com/databricks/spark-xml

    val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "result")
    .load(pathTOyourDATA)
    

    【讨论】:

      猜你喜欢
      • 2017-09-11
      • 2019-04-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-08-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多