大文本文件中的词频答案

【问题标题】：Word frequency in a large text file大文本文件中的词频
【发布时间】：2015-05-16 19:56:12
【问题描述】：

我正在尝试读取一个大文本文件并输出其中的不同单词及其计数。到目前为止，我已经尝试了几次，这是迄今为止我想出的最快的解决方案。

private static readonly char[] separators = { ' ' };

public IDictionary<string, int> Parse(string path)
{
    var wordCount = new Dictionary<string, int>();

    using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var streamReader = new StreamReader(fileStream))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (wordCount.ContainsKey(word))
                {
                    wordCount[word] = wordCount[word] + 1;
                }
                else
                {
                    wordCount.Add(word, 1);
                }
            }
        }
    }

    return wordCount;
}

我如何衡量我的解决方案

我有一个 200MB 的文本，我知道它的总字数（通过文本编辑器）。我正在使用Stopwatch class 并计算单词以确保准确性并测量所用时间。到目前为止，大约需要 9 秒。

其他尝试

我尝试利用多线程通过 TPL 库。这涉及批处理多行，发送将一批行处理到一个单独的任务并锁定字典中的读/写操作。然而，这似乎不是向我提供任何性能改进。
大约需要 30 秒。我怀疑锁定读/写字典的成本太高，无法获得任何性能。
我也看过ConcurrentDictionary类型，但是 AddOrUpdate 方法确实需要调用代码来处理根据我的理解同步，并没有带来任何性能受益。

我相信有更快的方法来实现这一点！有没有更好的数据结构来解决这个问题？

欢迎对我的解决方案提出任何建议/批评 - 在这里尝试学习和改进！

干杯。

更新：这是我正在使用的测试文件的link。

【问题讨论】：

你的源文件是什么？ 200MB 的文本可能相当于整本百科全书！
批处理多行 也许将整体划分为核心数（n）并使用不锁定它们的n个字典会更好。然后将它们整合成一个可能会快很多，尤其是有很多重复的单词
这看起来可以通过 map reduce 范式有效地解决。这是一个答案，它解释了如何将 map reduce 应用于您所问的几乎相同的事情：stackoverflow.com/questions/12375761/…
这和我写的完全一样，一行一行:-)等等...我会使用TryGetValue而不是ContainsKey来减少对字典的访问.
@pwee167 出于好奇，您是否对程序读取 200mb 文件所需的时间进行了基准测试？因为200mb/9sec = 21mb/sec...已经不错了

标签： c# multithreading performance algorithm data-structures

【解决方案1】：

我能给出的最佳简短答案是测量、测量、测量。 Stopwatch 很高兴了解时间花在哪里，但最终你会用它来散布大量代码，否则你将不得不为此目的找到更好的工具。我建议为此准备一个专用的分析器工具，C# 和 .NET 有很多可用的工具。

我已经设法通过三个步骤减少了大约 43% 的总运行时间。

首先我测量了你的代码并得到了这个：

这似乎表明这里有两个热点我们可以尝试对抗：

字符串拆分（SplitInternal）
字典维护（FindEntry、Insert、get_Item）

花费的最后一部分时间是读取文件，我真的怀疑我们可以通过更改这部分代码获得多少收益。这里的另一个答案提到使用特定的缓冲区大小，我试过了，但无法获得可测量的差异。

首先，字符串拆分有点简单，但需要将一个非常简单的string.Split 调用重写为更多代码。处理一行的循环我重写为：

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                // process word here
            }
            lastPos = index + 1;
        }
    }
}

然后我将一个词的处理改写成这样：

int currentCount;
wordCount.TryGetValue(word, out currentCount);
wordCount[word] = currentCount + 1;

这取决于以下事实：

TryGetValue 比检查单词是否存在然后检索其当前计数更便宜
如果TryGetValue获取值失败（key不存在），那么它会将这里的currentCount变量初始化为默认值，即0。这意味着我们真的不需要检查是否这个词确实存在。
我们可以通过索引器向字典中添加新词（它会覆盖现有值或向字典中添加新的键+值）

因此，最终循环如下所示：

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                int currentCount;
                wordCount.TryGetValue(word, out currentCount);
                wordCount[word] = currentCount + 1;
            }
            lastPos = index + 1;
        }
    }
}

新的测量结果表明：

详情：

我们从 6876 毫秒变为 5013 毫秒
我们浪费了在SplitInternal、FindEntry 和get_Item 上花费的时间
我们获得了在TryGetValue 和Substring 上花费的时间

这里是不同的细节：

如您所见，我们失去的时间比获得的新时间多，这导致了净改进。

但是，我们可以做得更好。我们在这里进行 2 次字典查找，其中包括计算单词的哈希码，并将其与字典中的键进行比较。第一个查找是TryGetValue 的一部分，第二个查找是wordCount[word] = ... 的一部分。

我们可以通过在字典中创建更智能的数据结构来移除第二次字典查找，但代价是使用更多的堆内存。

我们可以使用 Xanatos 将计数存储在对象中的技巧，以便我们可以删除第二个字典查找：

public class WordCount
{
    public int Count;
}

...

var wordCount = new Dictionary<string, WordCount>();

...

string word = line.Substring(lastPos, index - lastPos);
WordCount currentCount;
if (!wordCount.TryGetValue(word, out currentCount))
    wordCount[word] = currentCount = new WordCount();
currentCount.Count++;

这只会从字典中检索计数，增加 1 次额外出现不涉及字典。该方法的结果也将更改为返回此 WordCount 类型作为字典的一部分，而不仅仅是 int。

最终结果：节省约 43%。

最后一段代码：

public class WordCount
{
    public int Count;
}

public static IDictionary<string, WordCount> Parse(string path)
{
    var wordCount = new Dictionary<string, WordCount>();

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
    using (var streamReader = new StreamReader(fileStream, Encoding.Default, false, 65536))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            int lastPos = 0;
            for (int index = 0; index <= line.Length; index++)
            {
                if (index == line.Length || line[index] == ' ')
                {
                    if (lastPos < index)
                    {
                        string word = line.Substring(lastPos, index - lastPos);
                        WordCount currentCount;
                        if (!wordCount.TryGetValue(word, out currentCount))
                            wordCount[word] = currentCount = new WordCount();
                        currentCount.Count++;
                    }
                    lastPos = index + 1;
                }
            }
        }
    }

    return wordCount;
}

【讨论】：

我实现了与这个答案几乎相同的东西。原始代码对我来说大约需要 30 秒（我的笔记本电脑是旧的 pos）。通过这里的优化，减去缓冲区大小，我得到了大约 19 秒。将缓冲区大小更改为 64K 又节省了大约 2 秒。
您能解释一下为什么选择 65536 作为缓冲区大小吗？
不，我只是选择了一个合理的 2 次幂的数字。

【解决方案2】：

您的方法似乎与大多数人的处理方式一致。您是正确地注意到使用多线程并没有提供任何显着的收益，因为瓶颈很可能是 IO 限制的，并且无论您拥有什么样的硬件，您都无法比您的硬件支持的读取速度更快。

如果你真的在寻找速度改进（我怀疑你会得到任何改进），你可以尝试实现一个生产者-消费者模式，其中一个线程读取文件，其他线程处理这些行（也许然后并行检查一行中的单词）。这里的权衡是您添加了更多复杂的代码以换取边际改进（只有基准测试才能确定这一点）。

http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem

编辑：也看看ConcurrentDictionary

【讨论】：

我看过 ConcurrentDictionary，我认为它可能有用，但从我今天的研究来看，我相信 AddOrUpdate 方法并不能保证不会发生脏读/写。

【解决方案3】：

我已经获得了很多（从 200mb 的文件上的 25 秒到 20 秒）只是改变：

int cnt;

if (wordCount.TryGetValue(word, out cnt))
{
    wordCount[word] = cnt + 1;
}
else
....

基于ConcurrentDictionary<> 和Parallel.ForEach 的变体（使用IEnumerable<> 重载）。请注意，我没有使用int，而是使用InterlockedInt，它使用Interlocked.Increment 来增加自身。作为引用类型，它可以与ConcurrentDictionary<>.GetOrAdd 一起正常工作...

public class InterlockedInt
{
    private int cnt;

    public int Cnt
    {
        get
        {
            return cnt;
        }
    }

    public void Increment()
    {
        Interlocked.Increment(ref cnt);
    }
}

public static IDictionary<string, InterlockedInt> Parse(string path)
{
    var wordCount = new ConcurrentDictionary<string, InterlockedInt>();

    Action<string> action = line2 =>
    {
        var words = line2.Split(separators, StringSplitOptions.RemoveEmptyEntries);

        foreach (var word in words)
        {
            wordCount.GetOrAdd(word, x => new InterlockedInt()).Increment();
        }
    };

    IEnumerable<string> lines = File.ReadLines(path);
    Parallel.ForEach(lines, action);

    return wordCount;
}

请注意，使用Parallel.ForEach 的效率低于为每个物理内核直接使用一个线程（您可以在历史中查看如何使用）。虽然这两种解决方案在我的 PC 上都占用不到 10 秒的“挂墙”时钟，但 Parallel.ForEach 使用 55 秒的 CPU 时间，而 Thread 解决方案的 33 秒。

还有一个价值约为 5-10% 的技巧：

public static IEnumerable<T[]> ToBlock<T>(IEnumerable<T> source, int num)
{
    var array = new T[num];
    int cnt = 0;

    foreach (T row in source)
    {
        array[cnt] = row;
        cnt++;

        if (cnt == num)
        {
            yield return array;
            array = new T[num];
            cnt = 0;
        }
    }

    if (cnt != 0)
    {
        Array.Resize(ref array, cnt);
        yield return array;
    }
}

您将数据包中的行“分组”（选择 10 到 100 之间的数字），以减少线程内通信。然后工作人员必须对收到的行执行foreach。

【讨论】：

不错。我会将这种改进纳入解决方案。我应该首先使用它！
@pwee167 你知道++wordCount[word] 也可以，对吧？
@YuvalItzchakov 你会回到额外的阅读......这将毫无用处！
@xanatos 如果你真的关心它，那将是一个超级微小的微优化。
你错过了else。如果完全去掉 if 会更好，因为如果找不到，cnt 将默认为 0。

【解决方案4】：

使用 200mb 的文本文件，以下在我的机器上花费了 5 秒多一点。

    class Program
{
    private static readonly char[] separators = { ' ' };
    private static List<string> lines;
    private static ConcurrentDictionary<string, int> freqeuncyDictionary;

    static void Main(string[] args)
    {
        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        string path = @"C:\Users\James\Desktop\New Text Document.txt";
        lines = ReadLines(path);
        ConcurrentDictionary<string, int> test = GetFrequencyFromLines(lines);

        stopwatch.Stop();
        Console.WriteLine(@"Complete after: " + stopwatch.Elapsed.TotalSeconds);
    }

    private static List<string> ReadLines(string path)
    {
        lines = new List<string>();
        using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
        {
            using (var streamReader = new StreamReader(fileStream))
            {
                string line;
                while ((line = streamReader.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
        }
        return lines;            
    }

    public static ConcurrentDictionary<string, int> GetFrequencyFromLines(List<string> lines)
    {
        freqeuncyDictionary = new ConcurrentDictionary<string, int>();
        Parallel.ForEach(lines, line =>
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (freqeuncyDictionary.ContainsKey(word))
                {
                    freqeuncyDictionary[word] = freqeuncyDictionary[word] + 1;
                }
                else
                {
                    freqeuncyDictionary.AddOrUpdate(word, 1, (key, oldValue) => oldValue + 1);
                }
            }
        });

        return freqeuncyDictionary;
    }
}

【讨论】：

我试图避免将整个文件读入内存，因为在其他情况下文件可能会更大。
好的。在您的问题中，您提到您尝试分批阅读这些行，然后使用线程，但由于（您认为）锁定了字典，它最终变慢了。您是否在这种情况下尝试过并发字典？此外，使用您在 Q 中链接的文件，大约需要 7 秒，使用 TryGetValue 时下降到 6
我查看了 ConcurrentDictionary 类型，最初我认为它可能很有用，但从我今天的研究来看，我相信 AddOrUpdate 方法并不能保证防止脏读/写。这意味着我需要围绕字典中计数的递增进行自己的同步，这不会给我带来任何性能提升。因此，我回到了标准字典类型。

【解决方案5】：

如果你想计算一个特定的单词，你可以使用函数 strtok linked here 并将每个单词与您正在评估的单词进行比较，我认为这不是很昂贵，但我从未尝试过使用大文件夹...

【讨论】：

不，这不是他想要的。他想创建一个包含所有单词及其计数的列表。 strtok 除了拆分还有什么作用吗？
好的，我误解了这个问题。 strtok 拆分并在您比较之后，但对于多个单词更复杂，抱歉造成误解并感谢您的纠正。

【解决方案6】：

我建议您将流缓冲区大小设置得更大，并且匹配：

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 8192))
    using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, false, 8192))

首先，您的代码导致缓冲区太小，无法进行此类工作。其次，由于读取器的缓冲区小于流的缓冲区，因此数据首先复制到流的缓冲区，然后再复制到读取器的缓冲区。这可能会破坏您正在从事的工作类型的性能。

当缓冲区大小匹配时，流的缓冲区将永远不会被使用——事实上它永远不会被分配。

【讨论】：