MATLAB 数据解析优化答案

【问题标题】：MATLAB data parse optimisationMATLAB 数据解析优化
【发布时间】：2016-04-14 16:00:50
【问题描述】：

我一直在阅读一个相对较大的文本文件，其中包括散布在其他文本中的数字列，但实际上我只想要数字列。还有一堆其他的文本没有在这里显示，它们不是定期显示的。

文件格式：

*** LOTS OF OTHER TEXT AND NUMBERS ***

  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
   112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13
   113  3.1371e-08  4.6175e-10  5.0506e-10  1.2020e-15  1.3419e-13  0.0000e+00  0:00:01   12
   114  3.0016e-08  4.4331e-10  4.7391e-10  1.0388e-15  1.1447e-13  0.0000e+00  0:00:01   11
   115  2.8702e-08  4.2111e-10  4.4778e-10  8.9904e-16  9.7680e-14  0.0000e+00  0:00:01   10
   116  2.7476e-08  4.1484e-10  4.2711e-10  7.7955e-16  8.3342e-14  0.0000e+00  0:00:01    9
   117  2.6436e-08  3.9556e-10  4.0601e-10  6.7890e-16  7.1113e-14  0.0000e+00  0:00:01    8
   118  2.5374e-08  3.8633e-10  3.8826e-10  5.9234e-16  6.0674e-14  0.0000e+00  0:00:00    7
   119  2.4292e-08  3.7473e-10  3.7584e-10  5.1814e-16  5.1786e-14  0.0000e+00  0:00:00    6
   120  2.3474e-08  3.5952e-10  3.5622e-10  4.5405e-16  4.4207e-14  0.0000e+00  0:00:00    5
   121  2.2612e-08  3.4485e-10  3.4159e-10  3.9910e-16  3.7707e-14  0.0000e+00  0:00:00    4
  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   122  2.1992e-08  3.4100e-10  3.2964e-10  3.5272e-16  3.2204e-14  0.0000e+00  0:00:00    3
   123  2.1592e-08  3.2444e-10  3.0170e-10  3.1487e-16  2.7500e-14  0.0000e+00  0:00:00    2
   124  2.1053e-08  3.3145e-10  2.9325e-10  2.8009e-16  2.3485e-14  0.0000e+00  0:00:00    1
   125  2.0390e-08  3.1502e-10  2.7534e-10  2.5433e-16  2.0053e-14  0.0000e+00  0:00:00    0
  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps

Updating solution at time levels N and N-1.
 done.


Writing data to output file.
Current time=0.000000  Position=-0.00000036409265555078  Velocity=0.000015  Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N

Next time=0.000001  Position=-0.00000036400170391852  Velocity=0.000182
Applying motion to dynamic zone.

*** CONTINUING TEXT AND NUMBERS ***

我想要的行是：

111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13

到目前为止，我的脚本可以运行，但完成整个过程大约需要 80 年代。

我想，我的一些文件中的时间冒号会让我更加尴尬。有些文件会有或多或少的列包含不同类型的数据，有些会在主块的末尾有额外的设置，例如：

  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10

我不想获取这些数据，但它可以具有与我想要的行非常相似（有时相同）的格式。

它的主要目的是读取每一行，看看行前的几个字符（基于迭代号的长度）是否与我期望的匹配（从 1、2、3.. 开始） .n)。我这样做的原因是尝试删除我不想要的“步骤...”下的行。但是，该文件大约有 180,000 行长（这是我最短的），因此您可以想象这会有点慢。

% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};

% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';

% beginning number
iterNum = 1;

% loop through lines
for line = 1:length(raw);

    % convert to string for comparison
    iterStr = num2str(iterNum);
    thisLine = raw{line, 1};

    % if the right length and the right string,
    if length(iterStr) <= length(thisLine) && ...
            strcmp(thisLine(1:length(iterStr)), iterStr)

        % split the string
        result(iterNum,:) = regexp(thisLine,colExpr, 'match');

        iterNum = iterNum + 1;

    end

end

% convert to matrix
residuals = cellfun(@str2num, result);

使用分析器，我意识到 num2str() 函数是最慢的部分（20 秒），其次是 int2str()（10 秒），但我看不到在没有它的情况下读取数据的方法循环。

想知道我是否缺少一些东西来尝试优化这个过程？

编辑：

我已经包含了更多我不想要的行以及一种可能的不同格式来尝试帮助答案。

【问题讨论】：

你能告诉我们一些“其他文本和数字”是什么样的吗？至少要忽略的文本行数是否一致？
没有一个是特别一致的，这就是为什么我选择寻找我想要的，而不是忽略我没有的，所以我不确定它是否有帮助？跨度>
您还可以在将文件加载到 MATLAB 之前对文件进行预处理以删除仅保留数字的文本行（诸如 grep、sed 或 awk 之类的东西可以轻松做到这一点）。然后，您将使用一行代码load -ascii file.txt 非常快速地将文件导入 MATLAB

标签： matlab file-io

【解决方案1】：

这是一种不同的方法：我们首先在外部处理文件，例如：

# only keep lines starting with a digit
$ grep '^\s*[0-9]' file.txt > file2.txt

在 Windows 上，您可以使用 findstr 等同于 grep：

C:\> findstr /R /c:"^[ \t]*[0-9]" file.txt > file2.txt

现在在 MATLAB 中，很容易将生成的数值数据加载为矩阵：

>> load -ascii file2.txt
>> t = array2table(file2, 'VariableNames',...
    {'iter','continuity','xvelocity','yvelocity','k','epsilon','vf_vapour_ph'})
t = 
    iter    continuity    xvelocity     yvelocity        k          epsilon      vf_vapour_ph
    ____    __________    __________    _________    __________    __________    ____________
     1             0      6.2376e-07            0     0.0018988        2708.2    0           
     2             0         0.21656      0.23499     0.0097531       0.13395    0           
     3             0         0.11755      0.12824     0.0032109        0.1146    0           
     4             0        0.068112     0.072691    0.00089801      0.062219    0           
     5             0        0.043498     0.045244    0.00020248      0.025923    0           
     6        0.1938        0.029107     0.029029    4.8399e-05     0.0099171    0           
     7       0.13594        0.020037     0.019577    1.5502e-05     0.0043624    0           
     8      0.097518        0.013805     0.013249    5.1736e-06     0.0023341    0           
     9      0.070467       0.0098312    0.0091925    1.8272e-06     0.0012615    0           
    10      0.051538       0.0071181    0.0064673    7.2446e-07     0.0007012    0           
    11      0.038065       0.0052115    0.0046128    4.2786e-07    0.00040619    0           
    12      0.028369       0.0038465    0.0033381    2.8256e-07    0.00025864    0           
    13      0.021326        0.002857    0.0024454    1.9279e-07    0.00016126    0

【讨论】：

【解决方案2】：

由于您已经将整个内容加载到一个单元格数组中 (raw)，您可以在此调用 regexp 直接以删除坏行。

%// Find lines that contain your data
matches = regexp(raw, '^\s*\d(.*?\de[+\-]\d){6}');

%// Empty matches (header lines) should be removed
toremove = cellfun(@isempty, matches);
raw = raw(~toremove);

然后您可以使用str2num 结合strjoin 将结果转换为数值数组。

data = reshape(str2num(strjoin(raw)), 7, []).';

这个答案的好处是您可以避免使用任何类型的循环或重复函数调用，这些循环或重复函数调用众所周知会降低 MATLAB 的速度。

更新

@Pursuit 的答案的替代版本类似于：

numbers = cellfun(@(x)sscanf(x, '%f %f %f %f %f %f %f').', raw, 'uni', 0);
numbers = cat(1, numbers{:});

【讨论】：

我认为这与我最初的计划相似，尽管文本文件中还有其他行以数字开头，长度不同，并且不相关，这可能会导致问题？
@LADransfield 没有循环，因此它必须比您的初始解决方案更快。查看最后一个正则表达式和子，它应该只检测相关行
@LADransfield 如果您显示其中一些行，这也将有助于我们提供更好的答案。也许将其上传到外部网站，以便我们进行一些基准测试？
我已更新以包含一些其他可能的数字格式和几行重复出现在主要数字块之后的额外行。主要块并不总是设定长度或设定列数，尽管这些数字总是递增一。当主块之后的行包含其他格式非常相似的信息时，问题就出现了，使用上述方法，这些信息被拾取。

【解决方案3】：

我会尝试在每一行上运行sscanf，并且只使用命中率高的行。

请注意，如果：

raw{11} = '11  3.8065e-02  5.2115e-03  4.6128e-03  4.2786e-07  4.0619e-04  0.0000e+00'
raw{12} = 'iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph'

然后

>> sscanf(raw{11},'%f')
ans =
                        11
                  0.038065
                 0.0052115
                 0.0046128
                4.2786e-07
                0.00040619
                         0

还有：

>> sscanf(raw{12},'%f')
ans =
     []

要完成这个想法，您的代码将如下所示：

%% Read the file
file = 'dataFile.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1}

%% Parse the file into the "residuals" variable

nextLine = 1; %This is the index of next line to insert

%Go through each line, one at a time
for ix = 1:length(raw)    
    %Parse the line with sscanf
    numbers = sscanf(raw{ix},'%f');

    if ~isempty(numbers)  %Skip any row that did not parse, otherwise ...
        %If you know the number of columns, you could replace "~isempty()" with "length()== "

        if nextLine == 1
            %If this is the first line of numbers, then initialize the
            %"residuals" variable.
            residuals= zeros(length(raw), length(numbers));
        end

        %Store the data, and increment "nextLine"
        residuals(nextLine,:) = numbers;
        nextLine = nextLine + 1;
    end
end

%Now, trim the excess alloction from "residuals"
residuals = residuals(1:(nextLine-1),:)

（请告诉我它的速度比较。）

【讨论】：