将多个大型 csv 文件并行读入数组答案

【问题标题】：read multiple large csv files into array in parallel将多个大型 csv 文件并行读入数组
【发布时间】：2017-12-11 15:47:30
【问题描述】：

在 stackoverflow 上的一些教程和一些人的帮助下，我设法将这个基本的并行（多个文件）csv -> 数组阅读器拼接在一起。我能更快地得到它吗？我在这里和那里读过关于将文件预加载到内存中的可能性，以某种方式优化线程，或者可能在 cuda 中做一些部分（我对此有一些小的经验）？但不知道下一步应该是什么。有什么建议可以加快速度吗？：

// parallel-matrix-multiply.cpp
// compile with: /EHsc
#include <windows.h>
#include <ppl.h>
#include <iostream>
#include <random>

using namespace concurrency;    
using namespace std;

#include <fstream>
#include <sstream>

int main()
{
    int numRows = 360;
    int numCols = 4096;

    int** data = new int*[numRows * 120];
    for (int i = 0; i < numRows * 120; i++) {
        data[i] = new int[numCols];
    }

    clock_t starttimetotal = clock();
    char comma; // Just a place holder to store the commas
    char newLine; // Just a place holder to store the newlines

    int m = 120; //120 files of same format

    Concurrency::parallel_for(0, m,
        [&numCols, &numRows, &comma, &newLine, &data](int i) {

        std::ifstream in("C:/codeoutput/output_" + std::to_string(i + 1) + ".txt");

        for (int row = 0; row < numRows; row++) {
            for (int col = 0; col < numCols; col++)
            {
                // Grab Data for the cell in (row,col)
                in >> data[i * 360 + row][col];
                // If this is not the last column grab the comma between the values
                if (col < numCols - 1) {
                    in >> comma;
                }
            }
            in >> newLine; // Grab the remaining newLine character
        }
        in.close();
    });

    clock_t stoptotal = clock();
    double elapsed = (double)(stoptotal - starttimetotal) * 1000.0 / CLOCKS_PER_SEC;
    printf("Time elapsed in ms: %f\n", elapsed);

    return 0;
}

【问题讨论】：

可能更适合code review。您可以做的一件简单的事情是从文本文件切换到二进制文件。然后，您将节省文本解析时间，并消除循环内部的 if 语句，这将对性能产生积极影响。
您也可以从使用int**（2D，区域A）切换到使用int*（1D，长度A），并手动进行行/列->索引计算。这将减少您需要执行的主存查找次数（并可能让 CPU 在缓存中存储更多）。
有几件事：你有数据竞争到 comma 和 newLine。您还可以将分配循环移动到并行部分内部。您还可以使用底层平台相关的 I/O 函数将整个文件移动到内存中，然后遍历内存以提取您需要的数据。

标签： c++ arrays csv parallel-processing

【解决方案1】：

你的内循环正在计算可以移出到外循环的东西。考虑到这一点，瓶颈很可能是读取文件，所以我不太确定这会产生巨大的影响。

    for (int row = 0; row < numRows; row++) {
        int rowIdx = i * 360 + row;
        for (int col = 0; col < numCols - 1; col++)
        {
            // Grab Data for the cell in (row,col)
            in >> data[rowIdx][col];
            in >> comma;
        }

        // Get last column + new line
        in >> data[rowIdx][numCols - 1];
        in >> newLine; // Grab the remaining newLine character
    }

【讨论】：

我会试试的。 Col 在第二次循环后未定义。所有 120 个文件（108 行，而不是 360 个）加起来约为 200mb，我正在从 ssd 读取。整个过程在我的电脑上大约需要 75 秒，从磁盘读取是瓶颈吗？
我的错。我做了一个应该解决问题的改变。从磁盘读取，即使使用 ssd，也肯定会成为瓶颈。您可以通过首先将整个文件读入内存来缓解这种情况（通过一些重大的代码更改）。