如何在 xml 标签之间提取多语言字符串答案

【问题标题】：How do I extract a multilingual string in between an xml tag如何在 xml 标签之间提取多语言字符串
【发布时间】：2016-10-07 09:57:39
【问题描述】：

我正在尝试在 xml 标记之间提取文本。标签之间的文本是多语言的。例如：

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

我尝试用谷歌搜索它并得到了一些正则表达式，但没有奏效这是我尝试过的一个：

String str = "<string xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">"+
    "तुम्हारा नाम क्या है"+"</string>";

final Pattern pattern = Pattern.compile("<String xmlns="+
    "http://schemas.microsoft.com/2003/10/Serialization/"+">(.+?)</string>");

final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1));

给定的String 格式是

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
    तुम्हारा नाम क्या है
</string>

预期的输出是：

तुम्हारा नाम क्या है

它给了我一个错误

【问题讨论】：

一方面，正则表达式区分大小写。您的模式只会匹配 String [...] 与大写的“S”
请记住：您不能使用正则表达式解析 XML 或 HTML。理论见stackoverflow.com/questions/6751105/…，乐趣见stackoverflow.com/questions/1732348/… ...
补充 Jägermeister 的观点：stackoverflow.com/questions/701166/…

标签： java regex xml

【解决方案1】：

此模式与预期部分匹配，$1 为您提供预期结果：

/<string .*?>(.*?)<\\/string>/

Online Demo

但强烈建议停止使用正则表达式 ..！您必须在 JAVA 中找到一个 HTML 解析器，然后简单地抓取 <string> 标记的内容。

【讨论】：

【解决方案2】：

不要使用正则表达式来解析 XML。它在少数情况下会起作用，但最终会失败。完整说明请参见Can you provide some examples of why it is hard to parse XML and HTML with a regex?。

提取元素字符串内容的最简单方法是使用 XPath：

String contents =
    XPathFactory.newInstance().newXPath().evaluate(
        "//*[local-name()='string']",
        new InputSource(new StringReader(str)));

【讨论】：