对多维数组中的 .csv 进行排序答案

【问题标题】：Sort .csv in multidimensional arrays对多维数组中的 .csv 进行排序
【发布时间】：2019-10-22 01:05:13
【问题描述】：

我正在尝试从 .csv 文件中读取特定值（即 values@coordinate XY），并努力寻找在该 .csv 中定义多维数组的正确方法。

这是我的 .csv 文件中的表单示例

NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN

NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN

NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
NaN,NaN,1.23,2.34,9.99
1.23,NaN,2.34,3.45,NaN
1.23,NaN,2.34,3.45,NaN

...

好的，实际上，这个文件变得非常大。您可以解释行=纬度和列=经度，因此每个块都是每小时测量的坐标图。这些块通常具有行[361] 列[720] 的大小，时间段可以长达 20 年（=24*365*20 块），只是为了让您了解数据大小。

为了构建它，我想扫描 .csv 并将每个块定义为一个向量 t，我可以通过选择所需的时间步长 t=0,1,2,3...来访问它...

然后，在此块中，我想转到特定线（即纬度）并将其定义为向量 longitudeArray。

结果应该是在时间 Z 坐标 XY 的指定值。

正如您可能猜到的，我的编码经验相当有限，这就是为什么我的实际问题可能非常简单：如何排列我的向量以便能够调用任何随机值？

这是我目前的代码（遗憾的是它并不多，因为我不知道如何继续......）

#include <fstream>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>


using namespace std;


int main()
{   
  int longitude, latitude;                //Coordinates used to specify desired value
  int t;                                  //Each array is associated to a specific time t=0,1,2,3... (corresponds to hourly measured data)
  string value;                           

  vector<string> t;                       //Vector of each block
  vector<string> longitudeArray;          //Line of array, i.e. latitude    

  ifstream file("swh.csv");               //Open file
  if (!file.is_open())                    //Check if file is opened, if not 
  print "File could..."
  {
     cout << "File could not open..." << endl;
     return 1;
  }

  while (getline(file, latitude, latitude.empty()))   //Scan .csv (vertically) and delimit every time a white line occurs
  {
     longitudeArray.clear();
     stringstream ss(latitude);

     while(getline(ss,value,',')         //Breaks line into comma delimited fields //Specify line number (i.e. int latitude) here??
     {
        latitudeArray.push_back(value); //Adds each field to the 1D array //Horizontal vector, i.e. latitude
     }
     t.push_back(/*BLOCK*/)              //Adds each block to a distinct vector t
  }
  cout << t(longitudeArray[5])[6] << endl;    //Output:   5th element of longitudeArray in my 6th block

  return 0;

}

如果您有任何提示，尤其是如果有更好的方法来处理大型 .csv 文件，我将不胜感激。

Ps：C++对于这个项目来说是不可避免的......

Tüdelüü, jtotheakob

【问题讨论】：

如果你想使用一些 STL，我会倾向于使用时间步长的地图。然后，您可以随机访问，而不是向量中的线性访问。然后，您可以在地图中的每个位置都有一个二维数组。
您真的要使用字符串而不是数值吗？您说您的文件包含45'537'984'000 值，即使对于浮点数也将是 169GB 的数据，而对于字符串则要多得多（如果 SSO 对每个字符串都启动，则大约是 gcc 的 8 倍）
@Yksisarvinen：这里如果存储为 char 数组，每个值将使用 4 个字节，而 double 通常使用 8 个字节，因此大小是一个糟糕的参数。但我同意你的观点，数据应该转换为 double 和 NaN 到 double nan。
@SergeBallesta 没错，我想在代码中使用std::string，但char[4] 的内存效率更高。
@Yksisarvinen：老 C 程序员以老 C 风格思考 ;-)

标签： c++ multidimensional-array vector read.csv

【解决方案1】：

像往常一样，您应该首先考虑数据和数据使用情况。这里有浮点值（可以是 NaN），应该可以作为 3D 事物沿纬度、经度和时间访问。

如果您可以接受简单（整数）索引，则 C++ 中的标准方法是原始数组，std::array 和 std::vector。然后经验法则说：如果在编译时知道大小数组（或者std::array，如果你想对全局数组进行操作）就可以了，否则就使用向量。如果不确定 std:vector 是你的主力。

因此，您可能会以std::vector<std::vector<std::vector<double>>> data 结尾，您可以将其用作data[timeindex][latindex][longindex]。如果一切都是静态的，您可以使用double data[NTIMES][NLATS][NLONGS]，您可以或多或少地以相同的方式访问它。请注意，如果数组很大，如果您在函数（包括 main）中声明它，大多数编译器都会阻塞，但它可能是一个编译单元中的全局变量（C-ish，但在 C++ 中仍然有效）。

所以逐行读取文件，在你的容器中输入值。如果您使用静态定义的数组，只需在其位置分配每个新值，如果您使用向量，则可以使用 push_back 动态添加新元素。

这与您当前的代码相差太远，我无法向您展示更多琐碎的代码。

静态（C-ish）版本可能包含：

#define NTIMES 24*365*20
#define NLATS 361
#define NLONGS 720

double data[NTIMES][NLATS][NLONGS];
...
int time, lat, long;
for(time=0; time<NTIMES; time++) {
    for (lat=0; lat<NLATS; lat++) {
        for (long=0; long<NLONGS; long++) {
            std::cin >> data[time][lat][long];
            for (;;) {
                if (! std::cin) break;
                char c = std::cin.peek();
                if (std::isspace(c) || (c == ',')) std::cin.get();
                else break;
            }
            if (! std::cin) break;
        }
        if (! std::cin) break;
    }
    if (! std::cin) break;
}
if (time != NTIMES) {
    //Not enough values or read error
    ...
}

使用向量的更动态的版本可能是：

int ntimes = 0;
const int nlats=361;      // may be a non compile time values
const int nlongs=720;     // dito

vector<vector<vector<double>>> data;

int lat, long;

for(;;) {
    data.push_back(vector<vector<double>>);
    for(lat=0; lat<nlats; lat++) {
        data[ntimes].push_back(vector<double>(nlongs));
        for(long=0; long<nlongs; long++) {
            std::cin >> data[time][lat][long];
            for (;;) {
                if (! std::cin) break;
                char c = std::cin.peek();
                if (std::isspace(c) || (c == ',')) std::cin.get();
                else break;
            }
            if (! std::cin) break;
        }
        if (! std::cin) break;
    }
    if (! std::cin) break;
    if (lat!=nlats || long!=nlongs) {
        //Not enough values or read error
        ...
    }
    ntimes += 1;
}

此代码将成功处理 NaN，将其转换为特殊的不是数字值，但它不检查每行的字段数。为此，请读取带有std::getline 的行并使用strstream 对其进行解析。

【讨论】：

【解决方案2】：

谢谢，我尝试将这两个版本都转移到我的代码中，但我无法让它运行。猜猜我糟糕的编码技能无法看到对其他人来说显而易见的东西。你能说出我可能需要的额外库吗？对于std::isspace，我确实需要#include <cctype>，还有什么没有在我上面的代码中提到的吗？

您能否解释一下if (std::isspace(c) || (c == ',')) std::cin.get(); 的工作原理？据我了解，它将检查 c （这是输入字段？）是否是空格，如果是，正确的术语会因为||而自动变为“真”？会导致什么后果？

最后使用if (! std::cin) break在到达指定数组[time][lat][long]后停止循环？

无论如何，感谢您的回复。我真的很感激，我现在知道如何定义我的循环了。

【讨论】：

【解决方案3】：

再次感谢大家的想法。不幸的是，我无法运行脚本……但我的任务略有改变，因此不再需要读取非常大的数组。

但是，我已经知道如何构建此类操作，并且很可能会将其转移到我的新任务中。

你现在可以关闭这个话题了；）

干杯

jtothekaob

【讨论】：