计算文件的数据字节平均值答案

【问题标题】：Calculating a file's mean value of data bytes计算文件的数据字节平均值
【发布时间】：2021-05-22 21:29:17
【问题描述】：

只是为了好玩，我正在尝试计算文件的数据字节平均值，本质上是复制现有工具 (ent) 中可用的功能。基本上，它只是将文件的所有字节相加并除以文件长度的结果。如果数据接近随机，这应该是大约 127.5。我正在测试两种计算平均值的方法，一种是简单的for 循环，它适用于unordered_map，另一种是直接在string 对象上使用std::accumulate。

对这两种方法进行基准测试表明，使用std::accumulate 比使用简单的for 循环要慢得多。此外，在我的系统上测量，平均而言，clang++ 的累积方法比 g++ 快 4 倍左右。

所以这是我的问题：

为什么for 循环方法在g++ 的输入约为 2.5GB 时会产生错误的输出，而clang++ 却不会。我的猜测是我做错了（可能是 UB），但它们恰好与 clang++ 一起工作。 （已解决并相应修改代码）
为什么std::accumulate 方法在具有相同优化设置的g++ 上会慢很多？

谢谢！

编译器信息（目标是x86_64-pc-linux-gnu）：

clang version 11.1.0

gcc version 11.1.0 (GCC)

构建信息：

g++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++2a main.cpp -o main-g

clang++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++20 main.cpp -o main-clang

示例文件（使用随机数据）：

dd if=/dev/urandom iflag=fullblock bs=1G count=8 of=test-8g.bin（以 8GB 随机数据文件为例）

代码：

#include <chrono>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <numeric>
#include <stdexcept>
#include <string>
#include <unordered_map>

auto main(int argc, char** argv) -> int {
  using std::cout;

  std::filesystem::path file_path{};

  if (argc == 2) {
    file_path = std::filesystem::path(argv[1]);
  } else {
    return 1;
  }

  std::string input{};
  std::unordered_map<char, int> char_map{};

  std::ifstream istrm(file_path, std::ios::binary);
  if (!istrm.is_open()) {
    throw std::runtime_error("Could not open file");
  }

  const auto file_size = std::filesystem::file_size(file_path);
  input.resize(file_size);
  istrm.read(input.data(), static_cast<std::streamsize>(file_size));

  istrm.close();

  // store frequency of individual chars in unordered_map
  for (const auto& c : input) {
    if (!char_map.contains(c)) {
      char_map.insert(std::pair<char, int>(c, 1));
    } else {
      char_map[c]++;
    }
  }

  double sum_for_loop = 0.0;

  cout << "using for loop\n";
  // start stopwatch
  auto start_timer = std::chrono::steady_clock::now();

  // for loop method
  for (const auto& item : char_map) {
    sum_for_loop += static_cast<unsigned char>(item.first) * static_cast<double>(item.second);
  }

  // stop stopwatch
  cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";

  auto mean_for_loop = static_cast<double>(sum_for_loop) / static_cast<double>(input.size());

  cout << std::fixed << "sum_for_loop: " << sum_for_loop << " size: " << input.size() << '\n';
  cout << "mean value of data bytes: " << mean_for_loop << '\n';

  cout << "using accumulate()\n";
  // start stopwatch
  start_timer = std::chrono::steady_clock::now();

  // accumulate method, but is slow (much slower in g++)
  auto sum_accum =
      std::accumulate(input.begin(), input.end(), 0.0, [](auto current_val, auto each_char) { return current_val + static_cast<unsigned char>(each_char); });

  // stop stopwatch
  cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " s\n";

  auto mean_accum = sum_accum / static_cast<double>(input.size());

  cout << std::fixed << "sum_for_loop: " << sum_accum << " size: " << input.size() << '\n';
  cout << "mean value of data bytes: " << mean_accum << '\n';
}

2GB 文件 (clang++) 的示例输出：

using for loop
2.024e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
1.317576 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814

2GB 文件 (g++) 的示例输出：

using for loop
2.41e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
5.269024 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814

8GB 文件 (clang++) 的示例输出：

using for loop
1.853e-05 s
sum_for_loop: 1095220441576 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
5.247585 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440

8GB 文件 (g++) 的示例输出：

using for loop
7.5e-07 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
21.484348 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440

【问题讨论】：

你为什么要使用std::unordered_map？ std::array<std::size_t, 256> 初始化为 0 作为你的字节直方图有什么问题？
只读取在unsigned long long 中累积所有字节的文件不是更简单吗？只要这不会溢出，您就可以在最后除以文件大小。
我正在使用unordered_map 来计算香农熵，这不是此代码 sn-p 的一部分并且工作正常。我想我只是重用一个现有的变量来计算平均值。

标签： c++ c++20 unordered-map accumulate

【解决方案1】：

代码存在许多问题。第一个 - 也是导致显示问题的一个 - 是 sum_for_loop 应该是 double，而不是 unsigned long。总和溢出了可以存储在 unsigned long 中的内容，导致发生这种情况时您的结果不正确。

计时器应该在cout 之后启动，否则你会将输出时间包含在计算时间中。此外，“for循环”经过的时间不包括构造char_map所用的时间。

在构建char_map 时，您不需要if。如果在映射中未找到条目，则将其初始化为零。更好的方法（因为您只有 256 个唯一值）是使用索引向量（记得将 char 转换为 unsigned char）。

【讨论】：

@Simog static_cast<unsigned char>(item.first) * item.second 计算也有可能溢出。
确实我可以用sum_for_loop += static_cast<unsigned char>(item.first) * static_cast<double>(item.second); 修复它，所以我想剩下的唯一问题是为什么std::accumulate 在g++ 中的速度这么慢。
我只是做了一些没有if 语句的基准测试，它恰好比if 语句慢得多。也就是说，char_map 部分不是我现在关心的部分。对于我需要做的其他计算，我需要坚持使用unordered_map，而且这部分不是我遇到的性能问题（它与std::accumulate有关。