【问题标题】:split a comma-separated string with both quoted and unquoted strings [duplicate]用带引号和不带引号的字符串拆分逗号分隔的字符串[重复]
【发布时间】:2011-04-16 03:25:44
【问题描述】:

我需要拆分以下逗号分隔的字符串。问题是某些内容在引号内并包含不应在拆分中使用的逗号。

字符串:

111,222,"33,44,55",666,"77,88","99"

我想要输出:

111  
222  
33,44,55  
666  
77,88  
99  

我试过这个:

(?:,?)((?<=")[^"]+(?=")|[^",]+)   

但它读取“77,88”,“99”之间的逗号作为命中,我得到以下输出:

111  
222  
33,44,55  
666  
77,88  
,  
99  

【问题讨论】:

    标签: c# regex


    【解决方案1】:

    根据您的需要,您可能无法使用 csv 解析器,实际上可能想要重新发明轮子!!

    你可以用一些简单的正则表达式来做到这一点

    (?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
    

    这将执行以下操作:

    (?:^|,) = 匹配表达式“行或字符串的开头,

    (\"(?:[^\"]+|\"\")*\"|[^,]*) = 一个编号的捕获组,这将在 2 个备选方案之间进行选择:

    1. 引号内的东西
    2. 逗号之间的东西

    这应该会给你你正在寻找的输出。

    C# 中的示例代码

     static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
    
    public static string[] SplitCSV(string input)
    {
    
      List<string> list = new List<string>();
      string curr = null;
      foreach (Match match in csvSplit.Matches(input))
      {        
        curr = match.Value;
        if (0 == curr.Length)
        {
          list.Add("");
        }
    
        list.Add(curr.TrimStart(','));
      }
    
      return list.ToArray();
    }
    
    private void button1_Click(object sender, RoutedEventArgs e)
    {
        Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
    }
    

    警告根据@MrE 的评论 - 如果一个流氓换行符出现在格式错误的 csv 文件中,并且您最终得到一个不均匀的(“字符串)”,您将获得灾难性的回溯(@987654321 @) 在你的正则表达式中,你的系统可能会崩溃(就像我们的生产系统那样)。可以很容易地在 Visual Studio 中复制,我发现它会崩溃。一个简单的 try/catch 也不会捕获这个问题。

    你应该使用:

    (?:^|,)(\"(?:[^\"])*\"|[^,]*)
    

    改为

    【讨论】:

    • 如果没有代码示例,它非常有意义,并且没有给出示例,因为我不知道他正在编写什么语言。我现在在 C# 中包含了一个示例
    • 嗯,不适合我...双引号内的逗号仍用于“拆分”我的字符串。
    • 上述正则表达式失败并出现以下情况(假设逗号分隔):whatever,"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER &amp; WC'S ),13,14,15,A""",another column,因为它被拆分为whatever"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER &amp; WC'S ),13,14,15,A"""another column。对 Regex 的更新有何想法?
    • 我更改了列表。添加到列表中。Add(curr.TrimStart(',').TrimStart('"').TrimEnd('"'));
    • 如果一行以 " 开头但错过了结束行(即带有损坏的 csv 文件)regular-expressions.info/catastrophic.html (?:^|,)(\"(?:[^\"])*\"|[^,]*) 覆盖它而没有这个问题,这会产生灾难性的回溯,并且更简单。
    【解决方案2】:

    快速简单:

        public static string[] SplitCsv(string line)
        {
            List<string> result = new List<string>();
            StringBuilder currentStr = new StringBuilder("");
            bool inQuotes = false;
            for (int i = 0; i < line.Length; i++) // For each character
            {
                if (line[i] == '\"') // Quotes are closing or opening
                    inQuotes = !inQuotes;
                else if (line[i] == ',') // Comma
                {
                    if (!inQuotes) // If not in quotes, end of current string, add it to result
                    {
                        result.Add(currentStr.ToString());
                        currentStr.Clear();
                    }
                    else
                        currentStr.Append(line[i]); // If in quotes, just add it 
                }
                else // Add any other character to current string
                    currentStr.Append(line[i]); 
            }
            result.Add(currentStr.ToString());
            return result.ToArray(); // Return array of all strings
        }
    

    将此字符串作为输入:

     111,222,"33,44,55",666,"77,88","99"
    

    它会返回:

    111  
    222  
    33,44,55  
    666  
    77,88  
    99  
    

    【讨论】:

    • 如果你能在你的代码中解释你的方法的主要部分,那将是最有用的。
    • 好的,我添加了 cmets 和示例。还使用 StringBuilder 对其进行了优化。
    • 干得好。这对我帮助很大。谢谢。
    • 喜欢这个答案。这个问题类似于如何处理表达式(例如,带有括号和运算符的数学),这个概念以一种直接、可预测和可读的方式解决了它。不像正则表达式解决方案。
    • 注意:这个解决方案可能有问题!如果您的行中的一列在其文本中间有一个“真实”引号(我的意思是真实的,它属于文本并且不打算显示数据语义),它就不能正常工作。例如,如果您有以下行:“C1”、“C2withA”Text”、“C3”,那么上面的算法不会给你 3 个字符串 C1 C2withA"Text C3 但是这个例子很奇怪,应该避免(也许通过数据清理规则?),这可能会发生,Excel确实得到了“正确”的结果C1,C2......,C3。
    【解决方案3】:

    我真的很喜欢 jimplode 的回答,但我认为有 yield return 的版本更有用,所以这里是:

    public IEnumerable<string> SplitCSV(string input)
    {
        Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
    
        foreach (Match match in csvSplit.Matches(input))
        {
            yield return match.Value.TrimStart(',');
        }
    }
    

    也许将它像扩展方法一样使用会更有用:

    public static class StringHelper
    {
        public static IEnumerable<string> SplitCSV(this string input)
        {
            Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
    
            foreach (Match match in csvSplit.Matches(input))
            {
                yield return match.Value.TrimStart(',');
            }
        }
    }
    

    【讨论】:

      【解决方案4】:

      此正则表达式无需遍历值和 TrimStart(',') 即可工作,就像在接受的答案中一样:

      ((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))
      

      这是 C# 中的实现:

      string values = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
      
      MatchCollection matches = new Regex("((?<=\")[^\"]*(?=\"(,|$)+)|(?<=,|^)[^,\"]*(?=,|$))").Matches(values);
      
      foreach (var match in matches)
      {
          Console.WriteLine(match);
      }
      

      输出

      111  
      222  
      33,44,55  
      666  
      77,88  
      99  
      

      【讨论】:

      • 上述正则表达式失败并出现以下情况(假设逗号分隔):whatever,"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER &amp; WC'S ),13,14,15,A""",another column,因为它被拆分为whatever"""1,2,3,4,6(31/01/14)11(5) ,12 (MINIMUM 4 STAR WELS, TAPS , SHOWER &amp; WC'S ),13,14,15,A"""another column。对 Regex 的更新有何想法?
      • @FreeCoder24 omegacoder.com/?p=542
      【解决方案5】:

      当字符串在引号内有逗号(如"value, 1")或转义双引号(如"value ""1""")时,这些答案都不起作用,它们是valid CSV,应该被解析为value, 1和@ 987654325@,分别。

      如果您传入制表符而不是逗号作为分隔符,这也适用于制表符分隔的格式。

      public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
      {
          var currentString = new StringBuilder();
          var inQuotes = false;
          var quoteIsEscaped = false; //Store when a quote has been escaped.
          row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
          foreach (var character in row.Select((val, index) => new {val, index}))
          {
              if (character.val == delimiter) //We hit a delimiter character...
              {
                  if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
                  {
                      Console.WriteLine(currentString);
                      yield return currentString.ToString();
                      currentString.Clear();
                  }
                  else
                  {
                      currentString.Append(character.val);
                  }
              } else {
                  if (character.val != ' ')
                  {
                      if(character.val == '"') //If we've hit a quote character...
                      {
                          if(character.val == '\"' && inQuotes) //Does it appear to be a closing quote?
                          {
                              if (row[character.index + 1] == character.val) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                              {
                                  quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                              }
                              else if (quoteIsEscaped)
                              {
                                  quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                                  currentString.Append(character.val);
                              }
                              else
                              {
                                  inQuotes = false;
                              }
                          }
                          else
                          {
                              if (!inQuotes)
                              {
                                  inQuotes = true;
                              }
                              else
                              {
                                  currentString.Append(character.val); //...It's a quote inside a quote.
                              }
                          }
                      }
                      else
                      {
                          currentString.Append(character.val);
                      }
                  }
                  else
                  {
                      if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
                      {
                          currentString.Append(character.val);
                      }
                  }
              }
          }
      }
      

      【讨论】:

        【解决方案6】:

        对“Chad Hedgcock”提供的功能进行了小幅更新。

        更新正在进行中:

        第 26 行:character.val == '\"' - 由于第 24 行的检查,这永远不会是真的。即 character.val == '"'

        第 28 行:如果 (row[character.index + 1] == character.val) 添加了 !quoteIsEscaped 以转义 3 个连续的引号。

        public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
        {
        var currentString = new StringBuilder();
        var inQuotes = false;
        var quoteIsEscaped = false; //Store when a quote has been escaped.
        row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
        foreach (var character in row.Select((val, index) => new {val, index}))
        {
            if (character.val == delimiter) //We hit a delimiter character...
            {
                if (!inQuotes) //Are we inside quotes? If not, we've hit the end of a cell value.
                {
                    //Console.WriteLine(currentString);
                    yield return currentString.ToString();
                    currentString.Clear();
                }
                else
                {
                    currentString.Append(character.val);
                }
            } else {
                if (character.val != ' ')
                {
                    if(character.val == '"') //If we've hit a quote character...
                    {
                        if(character.val == '"' && inQuotes) //Does it appear to be a closing quote?
                        {
                            if (row[character.index + 1] == character.val && !quoteIsEscaped) //If the character afterwards is also a quote, this is to escape that (not a closing quote).
                            {
                                quoteIsEscaped = true; //Flag that we are escaped for the next character. Don't add the escaping quote.
                            }
                            else if (quoteIsEscaped)
                            {
                                quoteIsEscaped = false; //This is an escaped quote. Add it and revert quoteIsEscaped to false.
                                currentString.Append(character.val);
                            }
                            else
                            {
                                inQuotes = false;
                            }
                        }
                        else
                        {
                            if (!inQuotes)
                            {
                                inQuotes = true;
                            }
                            else
                            {
                                currentString.Append(character.val); //...It's a quote inside a quote.
                            }
                        }
                    }
                    else
                    {
                        currentString.Append(character.val);
                    }
                }
                else
                {
                    if (!string.IsNullOrWhiteSpace(currentString.ToString())) //Append only if not new cell
                    {
                        currentString.Append(character.val);
                    }
                }
            }
        }
        

        }

        【讨论】:

          【解决方案7】:

          对于 Jay 的回答,如果您使用第二个布尔值,那么您可以在单引号内嵌套双引号,反之亦然。

              private string[] splitString(string stringToSplit)
          {
              char[] characters = stringToSplit.ToCharArray();
              List<string> returnValueList = new List<string>();
              string tempString = "";
              bool blockUntilEndQuote = false;
              bool blockUntilEndQuote2 = false;
              int characterCount = 0;
              foreach (char character in characters)
              {
                  characterCount = characterCount + 1;
          
                  if (character == '"' && !blockUntilEndQuote2)
                  {
                      if (blockUntilEndQuote == false)
                      {
                          blockUntilEndQuote = true;
                      }
                      else if (blockUntilEndQuote == true)
                      {
                          blockUntilEndQuote = false;
                      }
                  }
                  if (character == '\'' && !blockUntilEndQuote)
                  {
                      if (blockUntilEndQuote2 == false)
                      {
                          blockUntilEndQuote2 = true;
                      }
                      else if (blockUntilEndQuote2 == true)
                      {
                          blockUntilEndQuote2 = false;
                      }
                  }
          
                  if (character != ',')
                  {
                      tempString = tempString + character;
                  }
                  else if (character == ',' && (blockUntilEndQuote == true || blockUntilEndQuote2 == true))
                  {
                      tempString = tempString + character;
                  }
                  else
                  {
                      returnValueList.Add(tempString);
                      tempString = "";
                  }
          
                  if (characterCount == characters.Length)
                  {
                      returnValueList.Add(tempString);
                      tempString = "";
                  }
              }
          
              string[] returnValue = returnValueList.ToArray();
              return returnValue;
          }
          

          【讨论】:

            【解决方案8】:

            原版

            目前我使用以下正则表达式:

            public static Regex regexCSVSplit = new Regex(@"(?x:(
                  (?<FULL>
                    (^|[,;\t\r\n])\s*
                    ( (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>) |
                      (?<QUODAT> (?<DAT> [^""',;\s\r\n]* )) )
                    (?=\s*([,;\t\r\n]|$))
                  ) |
                  (?<FULL>
                    (^|[\s\t\r\n])
                    ( (?<QUODAT> (?<QUO>[""'])(?<DAT> [^""',;\s\t\r\n]* )\k<QUO>) |
                      (?<QUODAT> (?<DAT> [^""',;\s\t\r\n]* )) )
                    (?=[,;\s\t\r\n]|$)
                  )
                ))", RegexOptions.Compiled);
            

            此解决方案也可以处理非常混乱的情况,如下所示:

            这是将结果输入数组的方法:

            var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
                  Select(x => x.Groups["DAT"].Value).ToArray();
            

            查看此示例的实际操作HERE

            注意:正则表达式包含两组&lt;FULL&gt;块,每组包含两个&lt;QUODAT&gt;块,用“或”分隔(|)。根据您的任务,您可能只需要其中一个。

            注意:这个正则表达式为我们提供了一个字符串数组,并且可以在有或没有&lt;carrier return&gt; 和/或&lt;line feed&gt; 的情况下处理单行。

            简化版

            下面的正则表达式已经涵盖了很多复杂的情况:

            public static Regex regexCSVSplit = new Regex(@"(?x:(
                  (?<FULL>
                    (^|[,;\t\r\n])\s*
                    (?<QUODAT> (?<QUO>[""'])(?<DAT>([^,;\t\r\n]|(?<!\k<QUO>\s*)[,;\t\r\n])*)\k<QUO>)
                    (?=\s*([,;\t\r\n]|$))
                  )
                ))", RegexOptions.Compiled);
            

            查看这个示例:HERE

            它也可以处理复杂、简单和空的项目:

            这是将结果输入数组的方法:

            var data = regexCSVSplit.Matches(line_to_process).Cast<Match>().
                  Select(x => x.Groups["DAT"].Value).ToArray();
            

            这里的主要规则是每个项目都可以包含除&lt;quotation mark&gt;&lt;separators&gt;&lt;comma&gt; 序列之外的任何内容,并且每个项目都应以相同的&lt;quotation mark&gt; 结尾。

            • &lt;quotation mark&gt;: &lt;"&gt;, &lt;'&gt;
            • &lt;comma&gt;: &lt;,&gt;, &lt;;&gt;, &lt;tab&gt;, &lt;carrier return&gt;, &lt;line feed&gt;

            编辑注释:我添加了更多解释以使其更易于理解并将文本“CO”替换为“QUO”。

            【讨论】:

            • 这实际上是唯一匹配简单"E", "Z"字符串的解决方案
            • @esskar 我还添加了一个简化的,它仍然适用于 "E", "Z" 等简单案例或更复杂的案例。
            【解决方案9】:

            试试这个:

                   string s = @"111,222,""33,44,55"",666,""77,88"",""99""";
            
                   List<string> result = new List<string>();
            
                   var splitted = s.Split('"').ToList<string>();
                   splitted.RemoveAll(x => x == ",");
                   foreach (var it in splitted)
                   {
                       if (it.StartsWith(",") || it.EndsWith(","))
                       {
                           var tmp = it.TrimEnd(',').TrimStart(',');
                           result.AddRange(tmp.Split(','));
                       }
                       else
                       {
                           if(!string.IsNullOrEmpty(it)) result.Add(it);
                       }
                  }
                   //Results:
            
                   foreach (var it in result)
                   {
                       Console.WriteLine(it);
                   }
            

            【讨论】:

            • 您的函数无法处理以逗号开头的引号内的字符串。字符串 s = @"AAA,"",BBB,CCC""";上面的字符串应该产生两个标记,但你的函数输出三个标记。
            【解决方案10】:

            我知道我对此有点晚了,但是对于搜索,这就是我在升 C 中所做的你所问的事情

            private string[] splitString(string stringToSplit)
                {
                    char[] characters = stringToSplit.ToCharArray();
                    List<string> returnValueList = new List<string>();
                    string tempString = "";
                    bool blockUntilEndQuote = false;
                    int characterCount = 0;
                    foreach (char character in characters)
                    {
                        characterCount = characterCount + 1;
            
                        if (character == '"')
                        {
                            if (blockUntilEndQuote == false)
                            {
                                blockUntilEndQuote = true;
                            }
                            else if (blockUntilEndQuote == true)
                            {
                                blockUntilEndQuote = false;
                            }
                        }
            
                        if (character != ',')
                        {
                            tempString = tempString + character;
                        }
                        else if (character == ',' && blockUntilEndQuote == true)
                        {
                            tempString = tempString + character;
                        }
                        else
                        {
                            returnValueList.Add(tempString);
                            tempString = "";
                        }
            
                        if (characterCount == characters.Length)
                        {
                            returnValueList.Add(tempString);
                            tempString = "";
                        }
                    }
            
                    string[] returnValue = returnValueList.ToArray();
                    return returnValue;
                }
            

            【讨论】:

              【解决方案11】:

              Don't reinvent CSV 解析器,试试FileHelpers

              【讨论】:

              • 对于一次性类型的 csv 解析,该解决方案看起来相当麻烦。根据文档,“接下来,您需要定义一个映射到源/目标文件中记录的类。”。因此,如果我正在编写一次性程序来解析 CSV 文件,我必须定义一个包含 csv 文件中每个字段的类?不用了。
              • 是的,这个答案是垃圾。没有解释,用例有限,没有解决用户的问题。
              【解决方案12】:

              我需要一些更强大的东西,所以我从这里获取并创建了这个......这个解决方案不太优雅,也更冗长,但在我的测试中(使用 1,000,000 行样本),我发现了这个要快 2 到 3 倍。此外,它还可以处理非转义的嵌入式引号。由于我的解决方案的要求,我使用了字符串分隔符和限定符而不是字符。我发现找到一个好的通用 CSV 解析器比我预期的要困难,所以我希望这个解析算法可以帮助某人。

                  public static string[] SplitRow(string record, string delimiter, string qualifier, bool trimData)
                  {
                      // In-Line for example, but I implemented as string extender in production code
                      Func <string, int, int> IndexOfNextNonWhiteSpaceChar = delegate (string source, int startIndex)
                      {
                          if (startIndex >= 0)
                          {
                              if (source != null)
                              {
                                  for (int i = startIndex; i < source.Length; i++)
                                  {
                                      if (!char.IsWhiteSpace(source[i]))
                                      {
                                          return i;
                                      }
                                  }
                              }
                          }
              
                          return -1;
                      };
              
                      var results = new List<string>();
                      var result = new StringBuilder();
                      var inQualifier = false;
                      var inField = false;
              
                      // We add new columns at the delimiter, so append one for the parser.
                      var row = $"{record}{delimiter}";
              
                      for (var idx = 0; idx < row.Length; idx++)
                      {
                          // A delimiter character...
                          if (row[idx]== delimiter[0])
                          {
                              // Are we inside qualifier? If not, we've hit the end of a column value.
                              if (!inQualifier)
                              {
                                  results.Add(trimData ? result.ToString().Trim() : result.ToString());
                                  result.Clear();
                                  inField = false;
                              }
                              else
                              {
                                  result.Append(row[idx]);
                              }
                          }
              
                          // NOT a delimiter character...
                          else
                          {
                              // ...Not a space character
                              if (row[idx] != ' ')
                              {
                                  // A qualifier character...
                                  if (row[idx] == qualifier[0])
                                  {
                                      // Qualifier is closing qualifier...
                                      if (inQualifier && row[IndexOfNextNonWhiteSpaceChar(row, idx + 1)] == delimiter[0])
                                      {
                                          inQualifier = false;
                                          continue;
                                      }
              
                                      else
                                      {
                                          // ...Qualifier is opening qualifier
                                          if (!inQualifier)
                                          {
                                              inQualifier = true;
                                          }
              
                                          // ...It's a qualifier inside a qualifier.
                                          else
                                          {
                                              inField = true;
                                              result.Append(row[idx]);
                                          }
                                      }
                                  }
              
                                  // Not a qualifier character...
                                  else
                                  {
                                      result.Append(row[idx]);
                                      inField = true;
                                  }
                              }
              
                              // ...A space character
                              else
                              {
                                  if (inQualifier || inField)
                                  {
                                      result.Append(row[idx]);
                                  }
                              }
                          }
                      }
              
                      return results.ToArray<string>();
                  }
              

              一些测试代码:

                      //var input = "111,222,\"33,44,55\",666,\"77,88\",\"99\"";
              
                      var input =
                          "111, 222, \"99\",\"33,44,55\" ,      \"666 \"mark of a man\"\", \" spaces \"77,88\"   \"";
              
                      Console.WriteLine("Split with trim");
                      Console.WriteLine("---------------");
                      var result = SplitRow(input, ",", "\"", true);
                      foreach (var r in result)
                      {
                          Console.WriteLine(r);
                      }
                      Console.WriteLine("");
              
                      // Split 2
                      Console.WriteLine("Split with no trim");
                      Console.WriteLine("------------------");
                      var result2 = SplitRow(input, ",", "\"", false);
                      foreach (var r in result2)
                      {
                          Console.WriteLine(r);
                      }
                      Console.WriteLine("");
              
                      // Time Trial 1
                      Console.WriteLine("Experimental Process (1,000,000) iterations");
                      Console.WriteLine("-------------------------------------------");
                      watch = Stopwatch.StartNew();
                      for (var i = 0; i < 1000000; i++)
                      {
                          var x1 = SplitRow(input, ",", "\"", false);
                      }
                      watch.Stop();
                      elapsedMs = watch.ElapsedMilliseconds;
                      Console.WriteLine($"Total Process Time: {string.Format("{0:0.###}", elapsedMs / 1000.0)} Seconds");
                      Console.WriteLine("");
              

              结果

              Split with trim
              ---------------
              111
              222
              99
              33,44,55
              666 "mark of a man"
              spaces "77,88"
              
              Split with no trim
              ------------------
              111
              222
              99
              33,44,55
              666 "mark of a man"
               spaces "77,88"
              
              Original Process (1,000,000) iterations
              -------------------------------
              Total Process Time: 7.538 Seconds
              
              Experimental Process (1,000,000) iterations
              --------------------------------------------
              Total Process Time: 3.363 Seconds
              

              【讨论】:

              • 这个方法其实更好。比 RegEx 方法更快
              • 你应该使用字符类型作为分隔符和限定符,因为你现在只使用字符串的第一个字符。
              • 这段代码有两个问题:首先,如果 'trimData' 为真,那么它会保留尾随空格,但不保留前导空格。其次,它将两个连续的引号视为文字引号加上子字符串的终端引号的组合。如果要解析 Excel 保存的 CSV 文件,则 三个连续引号 表示文字引号和终端封闭引号。我大量修改了代码和posted it in an answer to a duplicate question
              【解决方案13】:

              我曾经不得不做类似的事情,最后我被正则表达式卡住了。 Regex 无法拥有状态使其非常棘手 - 我刚刚写了一个简单的小解析器

              如果您要进行 CSV 解析,您应该坚持使用 CSV 解析器 - 不要重新发明轮子。

              【讨论】:

                【解决方案14】:

                这是我基于字符串原始指针操作的最快实现:

                string[] FastSplit(string sText, char? cSeparator = null, char? cQuotes = null)
                    {            
                        string[] oTokens;
                
                        if (null == cSeparator)
                        {
                            cSeparator = DEFAULT_PARSEFIELDS_SEPARATOR;
                        }
                
                        if (null == cQuotes)
                        {
                            cQuotes = DEFAULT_PARSEFIELDS_QUOTE;
                        }
                
                        unsafe
                        {
                            fixed (char* lpText = sText)
                            {
                                #region Fast array estimatation
                
                                char* lpCurrent      = lpText;                    
                                int   nEstimatedSize = 0;
                
                                while (0 != *lpCurrent)
                                {
                                    if (cSeparator == *lpCurrent)
                                    {
                                        nEstimatedSize++;
                                    }
                
                                    lpCurrent++;
                                }
                
                                nEstimatedSize++; // Add EOL char(s)
                                string[] oEstimatedTokens = new string[nEstimatedSize];
                
                                #endregion
                
                                #region Parsing
                
                                char[] oBuffer = new char[sText.Length];
                                int    nIndex  = 0;
                                int    nTokens = 0;
                
                                lpCurrent      = lpText;
                
                                while (0 != *lpCurrent)
                                {
                                    if (cQuotes == *lpCurrent)
                                    {
                                        // Quotes parsing
                
                                        lpCurrent++; // Skip quote
                                        nIndex = 0;  // Reset buffer
                
                                        while (
                                               (0       != *lpCurrent)
                                            && (cQuotes != *lpCurrent)
                                        )
                                        {
                                            oBuffer[nIndex] = *lpCurrent; // Store char
                
                                            lpCurrent++; // Move source cursor
                                            nIndex++;    // Move target cursor
                                        }
                
                                    } 
                                    else if (cSeparator == *lpCurrent)
                                    {
                                        // Separator char parsing
                
                                        oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex); // Store token
                                        nIndex                      = 0;                              // Skip separator and Reset buffer
                                    }
                                    else
                                    {
                                        // Content parsing
                
                                        oBuffer[nIndex] = *lpCurrent; // Store char
                                        nIndex++;                     // Move target cursor
                                    }
                
                                    lpCurrent++; // Move source cursor
                                }
                
                                // Recover pending buffer
                
                                if (nIndex > 0)
                                {
                                    // Store token
                
                                    oEstimatedTokens[nTokens++] = new string(oBuffer, 0, nIndex);
                                }
                
                                // Build final tokens list
                
                                if (nTokens == nEstimatedSize)
                                {
                                    oTokens = oEstimatedTokens;
                                }
                                else
                                {
                                    oTokens = new string[nTokens];
                                    Array.Copy(oEstimatedTokens, 0, oTokens, 0, nTokens);
                                }
                
                                #endregion
                            }
                        }
                
                        // Epilogue            
                
                        return oTokens;
                    }
                

                【讨论】:

                  【解决方案15】:

                  试试这个

                  private string[] GetCommaSeperatedWords(string sep, string line)
                      {
                          List<string> list = new List<string>();
                          StringBuilder word = new StringBuilder();
                          int doubleQuoteCount = 0;
                          for (int i = 0; i < line.Length; i++)
                          {
                              string chr = line[i].ToString();
                              if (chr == "\"")
                              {
                                  if (doubleQuoteCount == 0)
                                      doubleQuoteCount++;
                                  else
                                      doubleQuoteCount--;
                  
                                  continue;
                              }
                              if (chr == sep && doubleQuoteCount == 0)
                              {
                                  list.Add(word.ToString());
                                  word = new StringBuilder();
                                  continue;
                              }
                              word.Append(chr);
                          }
                  
                          list.Add(word.ToString());
                  
                          return list.ToArray();
                      }
                  

                  【讨论】:

                    【解决方案16】:

                    这是 Chad 用基于状态的逻辑重写的答案。当遇到"""BRAD""" 作为一个字段时,他的回答对我来说失败了。那应该返回 "BRAD" 但它只是吃掉了所有剩余的字段。当我尝试调试它时,我只是将它重写为基于状态的逻辑:

                    enum SplitState { s_begin, s_infield, s_inquotefield, s_foundquoteinfield };
                    public static IEnumerable<string> SplitRow(string row, char delimiter = ',')
                    {
                        var currentString = new StringBuilder();
                        SplitState state = SplitState.s_begin;
                        row = string.Format("{0}{1}", row, delimiter); //We add new cells at the delimiter, so append one for the parser.
                        foreach (var character in row.Select((val, index) => new { val, index }))
                        {
                            //Console.WriteLine("character = " + character.val + " state = " + state);
                            switch (state)
                            {
                                case SplitState.s_begin:
                                    if (character.val == delimiter)
                                    {
                                        /* empty field */
                                        yield return currentString.ToString();
                                        currentString.Clear();
                                    } else if (character.val == '"')
                                    {
                                        state = SplitState.s_inquotefield;
                                    } else
                                    {
                                        currentString.Append(character.val);
                                        state = SplitState.s_infield;
                                    }
                                    break;
                                case SplitState.s_infield:
                                    if (character.val == delimiter)
                                    {
                                        /* field with data */
                                        yield return currentString.ToString();
                                        state = SplitState.s_begin;
                                        currentString.Clear();
                                    } else
                                    {
                                        currentString.Append(character.val);
                                    }
                                    break;
                                case SplitState.s_inquotefield:
                                    if (character.val == '"')
                                    {
                                        // could be end of field, or escaped quote.
                                        state = SplitState.s_foundquoteinfield;
                                    } else
                                    {
                                        currentString.Append(character.val);
                                    }
                                    break;
                                case SplitState.s_foundquoteinfield:
                                    if (character.val == '"')
                                    {
                                        // found escaped quote.
                                        currentString.Append(character.val);
                                        state = SplitState.s_inquotefield;
                                    }
                                    else if (character.val == delimiter)
                                    {
                                        // must have been last quote so we must find delimiter
                                        yield return currentString.ToString();
                                        state = SplitState.s_begin;
                                        currentString.Clear();
                                    }
                                    else
                                    {
                                        throw new Exception("Quoted field not terminated.");
                                    }
                                    break;
                                default:
                                    throw new Exception("unknown state:" + state);
                            }
                        }
                        //Console.WriteLine("currentstring = " + currentString.ToString());
                    }
                    

                    这比其他解决方案要多得多的代码行,但很容易修改以添加边缘情况。

                    【讨论】:

                      猜你喜欢
                      • 2017-01-05
                      • 1970-01-01
                      • 1970-01-01
                      • 1970-01-01
                      • 1970-01-01
                      • 1970-01-01
                      • 1970-01-01
                      • 2023-04-09
                      相关资源
                      最近更新 更多