Scala XML：解析可能包含其他 XML 元素的父文本答案

【问题标题】：Scala XML: Parsing parent text that may contain other XML elements withinScala XML：解析可能包含其他 XML 元素的父文本
【发布时间】：2017-04-03 18:39:48
【问题描述】：

我需要解析一个 XML 节点，它的文本中可能包含也可能不包含其他 XML：

输入

<! -- i. no xml nested within text --> 
<my.element>
  This is some text. It's simple and easy.
</my.element>

<! -- ii. no xml nested within text --> 
<my.element>
  This is some text. It comes with <other.xml ID="1234" type="x" ...>1</other.xml> xml element nested within it's text.
</my.element>

输出

# i.
"This is some text. It's simple and easy."

# ii.
"This is some text. It comes with [1] other xml element nested within its text."

问题是如何可靠地将`my.element` 文本与嵌套元素`other.xml` 的文本分开。

每个元素都作为scala.xml.NodeSeq 加载，就好像内部xml 或多或少被忽略（即无法通过label 对其应用逻辑）。我能做的最好的就是得到text。复制内部元素的text：

foo.text
String = 
"This is some text. It comes with
    1
    1
    other xml element nested within its text."

这是一个简单的例子。实际上，我正在处理 TB 或更多的数据，并且需要处理大量可变数量的潜在嵌套 xml 元素。有些像上面那样提取和合并文本，有些可以忽略，有些格式不同等等。

这与 Spark 有关，因为我需要解决方案可序列化并使用 Spark 大规模运行。

【问题讨论】：

apache-spark 标签与这个问题有什么关系？
@TzachZohar 我需要可序列化的解决方案，以便我可以使用 Spark 大规模运行它。

标签： xml scala apache-spark

【解决方案1】：

不是 scala-xml 方面的专家，但我会预先规范化该 xml 块，以便您最终得到类似的东西

<! -- i. no xml nested within text --> 
<my.element>
  This is some text. It's simple and easy.
</my.element>

<! -- ii. no xml nested within text --> 
<my.element>
  <textblock>This is some text. It comes with</textblock> <other.xml ID="1234" type="x" ...>1</other.xml><textblock> xml element nested within it's text.</textblock>
</my.element>

然后你可以简单地找到你的元素并调用(xmlObj / "textblock").text

【讨论】：

【解决方案2】：

这就是我打电话给text时得到的结果

scala> val foo = <my.element>This is some text. It comes with <other.xml ID="1234" type="x">1</other.xml> xml element nested within it's text. </my.element>
foo: scala.xml.Elem = <my.element>This is some text. It comes with <other.xml ID="1234" type="x">1</other.xml> xml element nested within it's text. </my.element>

scala> foo.text
res0: String = "This is some text. It comes with 1 xml element nested within it's text. "

您可以使用child 方法获取my.element 中的孩子的集合：

scala> foo.child
res4: Seq[scala.xml.Node] = ArrayBuffer(This is some text. It comes with , <other.xml ID="1234" type="x">1</other.xml>,  xml element nested within it's text. )

然后过滤（使用collect）文本而不是XML元素：

scala> foo.child.collect { case scala.xml.Text(data) => data }
res6: Seq[String] = ArrayBuffer("This is some text. It comes with ", " xml element nested within it's text. ")

我不确定您到底要做什么，但这是根据您的要求进行的操作。我希望这会有所帮助。

【讨论】：

输入

输出

问题是如何可靠地将my.element 文本与嵌套元素other.xml 的文本分开。

问题是如何可靠地将`my.element` 文本与嵌套元素`other.xml` 的文本分开。