在所有空格处拆分字符串答案

【问题标题】：Splitting a string at all whitespace在所有空格处拆分字符串
【发布时间】：2019-05-18 18:36:16
【问题描述】：

我需要在所有空格处分割一个字符串，它应该只包含单词本身。

如何在 vb.net 中做到这一点？

制表符、换行符等都必须拆分！

这已经困扰了我很长一段时间了，因为我制作的语法荧光笔完全忽略了每一行中的第一个单词，除了第一行。

【问题讨论】：

还可以使用 SplitStringOptions 来查看可能的重复项以删除多余的空格。 stackoverflow.com/questions/6111298/…

标签： .net vb.net

【解决方案1】：

String.Split()（无参数）在所有空格（包括 LF/CR）上拆分

【讨论】：

他们为什么不把它作为一个重载大声笑？非常感谢！
因为它解析为具有空数组的 Split(params char[]) 重载。该重载的文档提到了这种行为。
注意：正如 Johannes Rudolph 在他的回答中提到的那样，如果一行中有多个空白字符，则 String.Split 将包含空元素。这就是为什么鲁本斯·法里亚斯的回答更胜一筹。
@ToolMakerSteve - 删除空元素String.Split(new char[] {}, StringSplitOptions.RemoveEmptyEntries)
@Joe 好主意，谢谢！更简单：line.Split((char[])null, StringSplitOptions.RemoveEmptyEntries)

【解决方案2】：

试试这个：

Regex.Split("your string here", "\s+")

【讨论】：

它是 C#。没有你应该没问题。

【解决方案3】：

如果你想避免正则表达式，你可以这样做：

"Lorem ipsum dolor sit amet, consectetur adipiscing elit"
    .Split()
    .Where(x => x != string.Empty)

Visual Basic 等效项：

"Lorem ipsum dolor sit amet, consectetur adipiscing elit" _
    .Split() _
    .Where(Function(X$) X <> String.Empty)

Where() 很重要，因为如果您的字符串有多个相邻的空白字符，它会删除由Split() 产生的空字符串。

在撰写本文时，当前接受的答案 (https://stackoverflow.com/a/1563000/49241) 并未考虑到这一点。

【讨论】：

很好的解决方案。它不仅避免了对正则表达式参考的需要，而且速度更快（请参阅下面的帖子）。我想补充一点，我不认为 VB 使用 lambda 运算符“=>”，所以 VB 版本有点不同，我认为是这样的：s.Split().Where(Function( x) x String.Empty)
嘿@u8it，我在这个答案中添加了一个 VB .NET 版本。我在编辑答案几天后才看到你的评论！！！
@Sree 您的编辑不正确。 Visual Basic 版本不等同于 C# 版本，因为它使用 String.IsNullOrWhiteSpace() 而不是 != 运算符与 String.Empty 进行比较。你能修好吗？我不知道 Visual Basic 的语法是什么。
@Adam 我放了一个Not，就像Not String.IsNullOrWhiteSpace(X))一样…… Not 运算符否定一个布尔值。你说的是这个吗？
我也尝试过使用示例字符串进行编辑（在发布编辑之前），它按照 OP 的要求完美运行。我错过了你说的任何东西吗？

【解决方案4】：

String.Split() 将拆分每个空格，因此结果通常包含空字符串。 Ruben Farias 给出的正则表达式解决方案是正确的方法。我赞成他的回答，但我想补充一点，剖析正则表达式：

\s 是一个匹配所有空白字符的character class。

当字符串包含多个单词之间的空白字符时，为了正确拆分字符串，我们需要在规范中添加quantifier（或重复运算符）以匹配单词之间的所有空白。在这种情况下使用的正确量词是+，表示给定规范的“一次或多次”出现。虽然语法 "\s+" 在这里就足够了，但我更喜欢更明确的“[\s]+”。

【讨论】：

像往常一样，我们现在有两个问题而不是一个... ;-)

【解决方案5】：

所以，在看到 Adam Ralph 的帖子后，我怀疑他的解决方案比 Regex 解决方案更快。只是想我会分享我的测试结果，因为我确实发现它更快。

实际上有两个因素在起作用（忽略系统变量）：提取的子字符串数量（由分隔符数量决定）和总字符串长度。下面绘制的非常简单的场景使用“A”作为由两个空格字符（空格后跟制表符）分隔的子字符串。这突出了提取的子字符串数量的影响。我继续进行了一些多变量测试，以得出适用于我的操作系统的以下通用方程。

正则表达式()
t = (28.33*SSL + 572)(SSN/10^6)

Split().Where()
t = (6.23*SSL + 250)(SSN/10^6)

其中 t 是以毫秒为单位的执行时间，SSL 是平均子字符串长度，SSN 是字符串中分隔的子字符串数。

这些方程也可以写成

t = (28.33*SL + 572*SSN)/10^6

和

t = (6.23*SL + 250*SSN)/10^6

其中 SL 是总字符串长度 (SL = SSL * SSN)

结论： Split().Where() 解决方案比 Regex() 更快。主要因素是子字符串的数量，而字符串长度起次要作用。相应系数的性能增益约为 2 倍和 5 倍。

这是我的测试代码（可能比必要的材料要多，但它是为获取我谈到的多变量数据而设置的）

using System;
using System.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
using System.Windows.Forms;
namespace ConsoleApplication1
{
    class Program
    {
        public enum TestMethods {regex, split};
        [STAThread]
        static void Main(string[] args)
        {
            //Compare TestMethod execution times and output result information
            //to the console at runtime and to the clipboard at program finish (so that data is ready to paste into analysis environment)
            #region Config_Variables
            //Choose test method from TestMethods enumerator (regex or split)
            TestMethods TestMethod = TestMethods.split;
            //Configure RepetitionString
            String RepetitionString =  string.Join(" \t", Enumerable.Repeat("A",100));
            //Configure initial and maximum count of string repetitions (final count may not equal max)
            int RepCountInitial = 100;int RepCountMax = 1000 * 100;

            //Step increment to next RepCount (calculated as 20% increase from current value)
            Func<int, int> Step = x => (int)Math.Round(x / 5.0, 0);
            //Execution count used to determine average speed (calculated to adjust down to 1 execution at long execution times)
            Func<double, int> ExecutionCount = x => (int)(1 + Math.Round(500.0 / (x + 1), 0));
            #endregion

            #region NonConfig_Variables
            string s; 
            string Results = "";
            string ResultInfo; 
            double ResultTime = 1;
            #endregion

            for (int RepCount = RepCountInitial; RepCount < RepCountMax; RepCount += Step(RepCount))
            {
                s = string.Join("", Enumerable.Repeat(RepetitionString, RepCount));
                ResultTime = Test(s, ExecutionCount(ResultTime), TestMethod);
                ResultInfo = ResultTime.ToString() + "\t" + RepCount.ToString() + "\t" + ExecutionCount(ResultTime).ToString() + "\t" + TestMethod.ToString();
                Console.WriteLine(ResultInfo); 
                Results += ResultInfo + "\r\n";
            }
            Clipboard.SetText(Results);
        }
        public static double Test(string s, int iMax, TestMethods Method)
        {
            switch (Method)
            {
                case TestMethods.regex:
                    return Math.Round(RegexRunTime(s, iMax),2);
                case TestMethods.split:
                    return Math.Round(SplitRunTime(s, iMax),2);
                default:
                    return -1;
            }
        }
        private static double RegexRunTime(string s, int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = Regex.Split(s, @"\s+");
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
        private static double SplitRunTime(string s,int iMax)
        {
            Stopwatch sw = new Stopwatch();
            sw.Restart();
            for (int i = 0; i < iMax; i++)
            {
                System.Collections.Generic.IEnumerable<string> ens = s.Split().Where(x => x != string.Empty);
            }
            sw.Stop();
            return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
        }
    }
}

【讨论】：

付出了很多努力，但两种解决方案都不是最理想的。只需使用str.Split((char[])null, StringSplitOptions.RemoveEmptyEntries)，而不是从结果中过滤掉空字符串。
这看起来也是个不错的选择。你对比过吗？我想知道性能归结为什么，例如编译器优化，也许它在 IL 中非常相似。
WhereIterator 的后期过滤绝对是额外的费用。我创建了一个快速的performance test。

【解决方案6】：

我发现我使用了 Adam Ralph 指出的解决方案，以及 P57 下面的 VB.NET 评论，但有一个奇怪的例外。我发现我必须在最后添加 .ToList.ToArray。

像这样：

.Split().Where(Function(x) x <> String.Empty).ToList.ToArray

没有它，我不断收到“无法将类型为 'WhereArrayIterator`1[System.String]' 的对象转换为类型 'System.String[]'。”

【讨论】：

我只能通过以下方式完成这项工作：.Split().Where(Function(x) x String.Empty).ToArray
不客气。我想我当时也应该说它使用的是 VS2013 和 .Net 4.5.2，以防万一它是最近的变化。

【解决方案7】：

Dim words As String = "This is a list of words, with: a bit of punctuation" + _
                          vbTab + "and a tab character." + vbNewLine
Dim split As String() = words.Split(New [Char]() {" "c, CChar(vbTab), CChar(vbNewLine) })

【讨论】：