【发布时间】:2015-12-09 20:37:49
【问题描述】:
我正在尝试用逗号分隔符连接 Scala 中的 XML 属性。
scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21
scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23
scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25
scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."
scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27
这是我需要用逗号连接column1,然后是column 2,然后是comma,然后column3的地方......事实上,我希望能够改变column3,column1,column2的顺序......也是。
scala> val attr = elem.map(_.attributes("column1"))
attr: org.apache.spark.rdd.RDD[Seq[scala.xml.Node]] = MapPartitionsRDD[35] at map at <console>:29
这是它现在的样子:
scala> attr.take(1)
res17: Array[String] = Array(Hello)
我需要这个:
scala> attr.take(1)
res17: Array[String] = Array(Hello, there, how, are you?)
或者这个,如果我喜欢的话:
scala> attr.take(1)
res17: Array[String] = Array(are you?, there, Hello)
【问题讨论】:
标签: scala apache-spark