【问题标题】:Converting XML string to Spark Dataframe in Databricks在 Databricks 中将 XML 字符串转换为 Spark Dataframe
【发布时间】:2020-08-03 14:15:36
【问题描述】:

如何从包含 XML 代码的字符串构建 Spark 数据帧?

如果代码保存在文件中,我可以轻松做到

dfXml = (sqlContext.read.format("xml")
           .options(rowTag='my_row_tag')
           .load(xml_file_name))

不过,正如我所说,我必须从包含常规 XML 的字符串构建数据框。

谢谢

毛罗

【问题讨论】:

    标签: python dataframe apache-spark databricks


    【解决方案1】:

    您可以在没有 spark xml 连接器的情况下解析 xml 字符串。使用下面的 udf,您可以将 xml 字符串转换为 json,然后对其进行转换。

    我已经获取了一个示例 xml 字符串并存储在 catalog.xml 文件中。

    /tmp> cat catalog.xml
    <?xml version="1.0"?><catalog><book id="bk101"><author>Gambardella, Matthew</author><title>XML Developer's Guide</title><genre>Computer</genre><price>44.95</price><publish_date>2000-10-01</publish_date><description>An in-depth look at creating applications with XML.</description></book></catalog>
    <?xml version="1.0"?><catalog><book id="bk102"><author>Ralls, Kim</author><title>Midnight Rain</title><genre>Fantasy</genre><price>5.95</price><publish_date>2000-12-16</publish_date><description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description></book></catalog>
    
    
    

    请注意以下代码是在 scala 中,这将帮助您在 python 中实现相同的逻辑。

    scala> val df = spark.read.textFile("/tmp/catalog.xml")
    df: org.apache.spark.sql.Dataset[String] = [value: string]
    
    scala> import org.json4s.Xml.toJson
    import org.json4s.Xml.toJson
    
    scala> import org.json4s.jackson.JsonMethods.{compact, parse}
    import org.json4s.jackson.JsonMethods.{compact, parse}
    
    scala> :paste
    // Entering paste mode (ctrl-D to finish)
    
    implicit class XmlToJson(data: String) {
        def json(root: String) = compact {
          toJson(scala.xml.XML.loadString(data)).transformField {
            case (field,value) => (field.toLowerCase,value)
          } \ root.toLowerCase
        }
        def json = compact(parse(data))
      }
    
    val parseUDF = udf { (data: String,xmlRoot: String) => data.json(xmlRoot.toLowerCase)}
    
    
    // Exiting paste mode, now interpreting.
    
    defined class XmlToJson
    parseUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
    
    scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog")))
    json: org.apache.spark.sql.DataFrame = [value: string]
    
    scala> val json = df.withColumn("value",parseUDF($"value",lit("catalog"))).select("value").map(_.getString(0))
    json: org.apache.spark.sql.Dataset[String] = [value: string]
    
    scala> val bookDF = spark.read.json(json).select("book.*")
    bookDF: org.apache.spark.sql.DataFrame = [author: string, description: string ... 5 more fields]
    
    scala> bookDF.printSchema
    root
     |-- author: string (nullable = true)
     |-- description: string (nullable = true)
     |-- genre: string (nullable = true)
     |-- id: string (nullable = true)
     |-- price: string (nullable = true)
     |-- publish_date: string (nullable = true)
     |-- title: string (nullable = true)
    
    
    scala> bookDF.show(false)
    +--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
    |author              |description                                                                                                         |genre   |id   |price|publish_date|title                |
    +--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
    |Gambardella, Matthew|An in-depth look at creating applications with XML.                                                                 |Computer|bk101|44.95|2000-10-01  |XML Developer's Guide|
    |Ralls, Kim          |A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.|Fantasy |bk102|5.95 |2000-12-16  |Midnight Rain        |
    +--------------------+--------------------------------------------------------------------------------------------------------------------+--------+-----+-----+------------+---------------------+
    
    

    【讨论】:

      【解决方案2】:

      在 Scala 上,“XmlReader”类可用于将 RDD[String] 转换为 DataFrame:

          val result = new XmlReader().xmlRdd(spark, rdd)
      

      如果你有 Dataframe 作为输入,它可以很容易地转换为 RDD[String]。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-05-22
        • 1970-01-01
        • 2018-09-25
        • 1970-01-01
        相关资源
        最近更新 更多