xml中特殊字符的正则表达式模式匹配答案

【问题标题】：Pattern matching by regular expression for special characters in xmlxml中特殊字符的正则表达式模式匹配
【发布时间】：2014-02-08 21:07:42
【问题描述】：

我正在尝试从包含特殊字符的 xml 中收集所有值，因为 XmlDocunemt 和 XDocument 抛出异常读取 xml 包含 c# 中的特殊字符。

说，我有一个 xml 字符串

<root>\n\t<childone>\n\t\t<attributeone name=\"aa\">aa</attributeone>\n\t\t<attributetwo adds=\"ba\">ab&\"'<</attributetwo>\n\t\t<attributeone name=\"aa\">&</attributeone>\n\t</childone>\n</root>

我正在使用以下代码段来仅获取那些包含特殊字符的值，例如“ab&”'

string pat = @"(>)([&\""\'<]+)(<)(/)";
Match match = Regex.Match(input, pat, RegexOptions.IgnoreCase);

但它不捕获任何内容。那么，捕获所有包含特殊字符的值并将它们存储在字符串数组或列表中的最佳方法是什么。我的输入是带有特殊字符的 xml 字符串，在某些情况下，它可能不包含任何新行字符或标签之间的制表符，并且某些 xml 包含 17000 多行。在捕获之后，我需要用可比较的类型（& 到 &）替换那些特殊字符。请帮助我找到解决问题的好方法。（它只捕获包含 char 指定 id “pat”字符串的字符串，例如“&&&”或“

【问题讨论】：

许多人可能会指出，使用正则表达式解析 XML 通常是一个坏主意，因为它很快就会失控，难以维护，而且通常很难预见可能的错误，尤其是如果您的输入变化。那里有许多编写良好的 XML 解析器，最好选择一个并使用它，而不是使用正则表达式来完成这项工作。
同意 Nit 你应该尽量避免使用正则表达式。您说“因为 XmlDocunemt 和 XDocument 引发读取 xml 的异常” - 我建议修复产生明显无效的 xml 输入字符串的应用程序，例如通过让它正确地转义特殊字符。 xml 是不好的，除非它可以被普通的 xml 解析器读取。
@Astrotrain 我了解您描述的情况，但不幸的是我没有太多访问 XML 源的权限。我只能从源代码中获取它作为输入字符串。但我不知道它在哪里或是谁创造的。

标签： c# xml

【解决方案1】：

我重新格式化了您的 xml 片段，使其更具可读性。可以清楚地看到xml无效（我们已经知道，因为XmlDocument解析失败）。显然，attributetwo 的内容应该是ab&\"'<，但是由于“&”（应该是“&amp;”）和最后一个“<”）：

<root>\n
\t<childone>\n
\t\t<attributeone name=\"aa\">aa</attributeone>\n
\t\t<attributetwo adds=\"ba\">ab&\"'<</attributetwo>\n
\t\t<attributeone name=\"aa\">&</attributeone>\n
\t</childone>\n
</root>

我仍然认为您应该尝试将这个字符串转换为有效的 xml，以便您可以解析它。这可能是一种方法（此示例要求在实际的 xml 字符串中不使用“{”和“}”，尽管您可以使用任何两个未使用的字符）：

class Program
{
    private const string BrokenXml = 
        "<root>\n" +
        "\t<childone>\n" +
        "\t\t<attributeone name=\"aa\">aa</attributeone>\n" +
        "\t\t<attributetwo adds=\"ba\">ab&\"'<</attributetwo>\n" +
        "\t\t<attributeone name=\"aa\">&</attributeone>\n" +
        "\t<empty />\n" +
        "\t</childone>\n" +
        "</root>";

    // Matches an opening tag with 0 or more attributes, and captures everything within "<...>" as Groups[1].
    // Unescaped regex looks like: <(\w+(?:\s+\w+="[^"]*")?)>
    private static Regex OpenTagRegex = new Regex("<(\\w+(?:\\s+\\w+=\"[^\"]*\")?)>");

    // Matches a close tag and captures everything within "<...>" as Groups[1].
    private static Regex CloseTagRegex = new Regex("<(/\\w+)>");

    // Matches an empty tag and captures everything within "<...>" as Groups[1].
    private static Regex EmptyTagRegex = new Regex("<(\\w+\\s*/)>");

    public static void Main(string[] args)
    {
        //Replace the angular brackets (<>) of all valid xml elements with curly brackets ({})
        string step1 = OpenTagRegex.Replace(BrokenXml, ReplaceMatch);
        string step2 = CloseTagRegex.Replace(step1, ReplaceMatch);
        string step3 = EmptyTagRegex.Replace(step2, ReplaceMatch);

        //Fix the remaining special characters with their xml entity counterparts:
        string step4 = step3.Replace("&", "&amp;");
        string step5 = step4.Replace("<", "&lt;");
        string step6 = step5.Replace(">", "&gt;");

        //Convert from curly braces xml back to regular xml
        string result = step6.Replace("{", "<").Replace("}", ">");

        Console.WriteLine(result);

        Console.WriteLine("Press enter to exit...");
        Console.ReadLine();
    }

    /// <summary>
    /// Matches the MatchEvaluator signature.
    /// </summary>
    private static string ReplaceMatch(Match match)
    {
        string contentWithoutAngularBrackets = match.Groups[1].Value;
        return "{" + contentWithoutAngularBrackets + "}";
    }
}

【讨论】：