【问题标题】:Dealing with invalid XML hexadecimal characters处理无效的 XML 十六进制字符
【发布时间】:2011-12-31 12:39:37
【问题描述】:

我正在尝试通过网络发送 XML 文档,但收到以下异常:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---

我无法控制我尝试发送的内容,因为字符串是从电子邮件中收集的。如何对我的字符串进行编码,使其成为有效的 XML,同时保留非法字符?

我想以某种方式保留原始字符。

【问题讨论】:

  • 取决于非法字符是XML根本无法处理的x0之类的东西,还是只需要转义的<之类的东西。

标签: c# xml .net-3.5


【解决方案1】:

以下代码从字符串中删除 XML 无效字符并返回没有它们的新字符串:

public static string CleanInvalidXmlChars(string text) 
{ 
     // From xml spec valid chars: 
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
     // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
     string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; 
     return Regex.Replace(text, re, ""); 
}

【讨论】:

【解决方案2】:
byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

是一种方法

【讨论】:

    【解决方案3】:

    以下解决方案会删除任何无效的 XML 字符,但我认为这样做会尽可能提高性能,尤其是它分配一个新的 StringBuilder 以及一个新的字符串,除非已经确定字符串中包含任何无效字符。所以热点最终只是字符上的一个 for 循环,检查结果通常不超过两个大于/小于每个字符上的数字比较。如果没有找到,它只返回原始字符串。当绝大多数字符串一开始都很好时,这特别有用,最好尽快将这些字符串输入和输出(没有浪费的分配等)。

    -- 更新--

    见下文如何也可以直接编写包含这些无效字符的 XElement,尽管它使用此代码 --

    部分代码受到by Mr. Tom Bogle's solution here 的影响。另请参阅同一线程上superlogical 的帖子中的有用信息。然而,所有这些总是实例化一个新的 StringBuilder 和字符串。

    用法:

        string xmlStrBack = XML.ToValidXmlCharactersString("any string");
    

    测试:

        public static void TestXmlCleanser()
        {
            string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
            string goodString = "My name is Inigo Montoya!";
    
            string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
            string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string
    
            XElement x1 = new XElement("test", back1);
            XElement x2 = new XElement("test", back2);
            XElement x3WithBadString = new XElement("test", badString);
    
            string xml1 = x1.ToString();
            string xml2 = x2.ToString().Print();
    
            string xmlShouldFail = x3WithBadString.ToString();
        }
    

    // --- 代码 ---(我在一个名为 XML 的静态实用程序类中有这些方法)

        /// <summary>
        /// Determines if any invalid XML 1.0 characters exist within the string,
        /// and if so it returns a new string with the invalid chars removed, else 
        /// the same string is returned (with no wasted StringBuilder allocated, etc).
        /// </summary>
        /// <param name="s">Xml string.</param>
        /// <param name="startIndex">The index to begin checking at.</param>
        public static string ToValidXmlCharactersString(string s, int startIndex = 0)
        {
            int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
            if (firstInvalidChar < 0)
                return s;
    
            startIndex = firstInvalidChar;
    
            int len = s.Length;
            var sb = new StringBuilder(len);
    
            if (startIndex > 0)
                sb.Append(s, 0, startIndex);
    
            for (int i = startIndex; i < len; i++)
                if (IsLegalXmlChar(s[i]))
                    sb.Append(s[i]);
    
            return sb.ToString();
        }
    
        /// <summary>
        /// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
        /// </summary>
        /// <param name="s">Xml string.</param>
        /// <param name="startIndex">Start index.</param>
        public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
        {
            if (s != null && s.Length > 0 && startIndex < s.Length) {
    
                if (startIndex < 0) startIndex = 0;
                int len = s.Length;
    
                for (int i = startIndex; i < len; i++)
                    if (!IsLegalXmlChar(s[i]))
                        return i;
            }
            return -1;
        }
    
        /// <summary>
        /// Indicates whether a given character is valid according to the XML 1.0 spec.
        /// This code represents an optimized version of Tom Bogle's on SO: 
        /// https://stackoverflow.com/a/13039301/264031.
        /// </summary>
        public static bool IsLegalXmlChar(char c)
        {
            if (c > 31 && c <= 55295)
                return true;
            if (c < 32)
                return c == 9 || c == 10 || c == 13;
            return (c >= 57344 && c <= 65533) || c > 65535;
            // final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
            //c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
        }
    

    ======== ======== ========

    直接写 XElement.ToString

    ======== ======== ========

    一、这个扩展方法的用法:

    string result = xelem.ToStringIgnoreInvalidChars();
    

    -- 更全面的测试--

        public static void TestXmlCleanser()
        {
            string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
    
            XElement x = new XElement("test", badString);
    
            string xml1 = x.ToStringIgnoreInvalidChars();                               
            //result: <test>My name is Inigo Montoya</test>
    
            string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
            //result: <test>My name is Inigo Mont&#x1E;oya</test>
        }
    

    --- 代码---

        /// <summary>
        /// Writes this XML to string while allowing invalid XML chars to either be
        /// simply removed during the write process, or else encoded into entities, 
        /// instead of having an exception occur, as the standard XmlWriter.Create 
        /// XmlWriter does (which is the default writer used by XElement).
        /// </summary>
        /// <param name="xml">XElement.</param>
        /// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
        /// <param name="indent">Indent setting.</param>
        /// <param name="indentChar">Indent char (leave null to use default)</param>
        public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
        {
            if (xml == null) return null;
    
            StringWriter swriter = new StringWriter();
            using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {
    
                // -- settings --
                // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
                writer.Formatting = indent ? Formatting.Indented : Formatting.None;
    
                if (indentChar != null)
                    writer.IndentChar = (char)indentChar;
    
                // -- write --
                xml.WriteTo(writer); 
            }
    
            return swriter.ToString();
        }
    

    -- 这使用以下 XmlTextWritter--

    public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
    {
        public bool DeleteInvalidChars { get; set; }
    
        public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
        {
            DeleteInvalidChars = deleteInvalidChars;
        }
    
        public override void WriteString(string text)
        {
            if (text != null && DeleteInvalidChars)
                text = XML.ToValidXmlCharactersString(text);
            base.WriteString(text);
        }
    }
    

    【讨论】:

      【解决方案4】:

      为我工作:

      XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };
      

      【讨论】:

      【解决方案5】:

      我正在接受@parapurarajkumar 的解决方案,其中非法字符被正确加载到XmlDocument,但在我尝试保存输出时破坏了XmlWriter

      我的背景

      我正在使用 Elmah 查看来自网站的异常/错误日志。 Elmah 以大型 XML 文档的形式返回异常时服务器的状态。对于我们的报告引擎,我使用 XmlWriter 漂亮地打印 XML。

      在一次网站攻击期间,我注意到一些 xml 没有解析并收到此 '.', hexadecimal value 0x00, is an invalid character. 异常。

      非解决方案:我将文档转换为 byte[] 并将其清理为 0x00,但没有找到。

      扫描xml文档时,发现如下:

      ...
      <form>
      ...
      <item name="SomeField">
         <value
           string="C:\boot.ini&#x0;.htm" />
       </item>
      ...
      

      有nul字节编码为html实体&amp;#x0;!!!

      解决方案: 为了修复编码,我在将 &amp;#x0; 值加载到我的 XmlDocument 之前替换了它,因为加载它会创建 nul 字节并且很难清理它物体。这是我的整个过程:

      XmlDocument xml = new XmlDocument();
      details.Xml = details.Xml.Replace("&#x0;", "[0x00]");  // in my case I wanted to see it, otherwise just replace with ""
      xml.LoadXml(details.Xml);
      
      string formattedXml = null;
      
      // I stuff this all in a helper function, but put it in-line for this example
      StringBuilder sb = new StringBuilder();
      XmlWriterSettings settings = new XmlWriterSettings {
          OmitXmlDeclaration = true,
          Indent = true,
          IndentChars = "\t",
          NewLineHandling = NewLineHandling.None,
      };
      using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
          xml.Save(writer);
          formattedXml = sb.ToString();
      }
      

      经验教训:如果您的传入数据在输入时是 html 编码的,则使用关联的 html 实体清理非法字节。

      【讨论】:

        【解决方案6】:

        在 C# 中使用 XmlConvert.IsXmlChar Method 删除不正确 XML 字符的另一种方法(自 .NET Framework 4.0 起可用)

        public static string RemoveInvalidXmlChars(string content)
        {
           return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        }
        

        .Net Fiddle - https://dotnetfiddle.net/v1TNus

        例如垂直制表符 (\v) 对 XML 无效,它是有效的 UTF-8,但不是有效的 XML 1.0,甚至许多库(包括 libxml2)都错过了它并默默地输出无效的 XML。

        【讨论】:

          【解决方案7】:

          有一个很好的通用解决方案:

          public class XmlTextTransformWriter : System.Xml.XmlTextWriter
          {
              public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
              public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
              public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }
          
              public Func<string, string> TextTransform = s => s;
          
              public override void WriteString(string text)
              {
                  base.WriteString(TextTransform(text));
              }
          
              public override void WriteCData(string text)
              {
                  base.WriteCData(TextTransform(text));
              }
          
              public override void WriteComment(string text)
              {
                  base.WriteComment(TextTransform(text));
              }
          
              public override void WriteRaw(string data)
              {
                  base.WriteRaw(TextTransform(data));
              }
          
              public override void WriteValue(string value)
              {
                  base.WriteValue(TextTransform(value));
              }
          }
          

          一旦这到位,您就可以按如下方式创建您的 THIS 覆盖:

          public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
          {
              public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
              public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
              public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }
          
              void SetTransform()
              {
                  TextTransform = XmlUtil.RemoveInvalidXmlChars;
              }
          }
          

          其中 XmlUtil.RemoveInvalidXmlChars 定义如下:

              public static string RemoveInvalidXmlChars(string content)
              {
                  if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
                      return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
                  else
                      return content;
              }
          

          【讨论】:

            【解决方案8】:

            不能用以下方法清理字符串:

            System.Net.WebUtility.HtmlDecode()
            

            ?

            【讨论】:

              猜你喜欢
              • 2016-01-07
              • 2012-11-24
              • 1970-01-01
              • 1970-01-01
              • 2012-09-10
              • 1970-01-01
              • 2012-06-17
              • 2016-11-14
              • 2016-03-30
              相关资源
              最近更新 更多