从具有可变列数的 ASCII 文件中读取浮点值答案

【问题标题】：read float values from ASCII file with variable number of columns从具有可变列数的 ASCII 文件中读取浮点值
【发布时间】：2014-06-14 19:48:19
【问题描述】：

我有包含浮点数的 ASCII 文件。大多数行有 10 列，但有些行的列较少。一个例子是这样的：

* lat =   33.2813
  19.61  19.92  21.82  21.94  22.77  25.81  29.48  29.86  29.92  28.98
  27.94  25.78  23.68  23.37
* lat =   33.3438
  20.16  23.62  27.73  31.12  33.06  34.01  35.78  37.03  37.79  35.74
  34.12  31.83  33.98  28.57
* lat =   33.4063
  28.26  30.04  35.00  37.92  41.50  44.55  45.44  46.74  46.74  43.47
  37.67  35.67  35.67  31.64
* lat =   33.4688
  34.02  36.07  38.95  44.24  46.49  47.98  50.62  51.95  51.95  51.95
  48.31  41.03  38.01  34.58
* lat =   33.5313
  36.94  37.12  44.04  48.41  51.70  52.71  54.18  55.71  56.98  62.10
  57.26  49.05  44.18  41.50

以* 开头的行是cmets。

如何使用 numpy 有效地读取此文件？（这是一个玩具示例；我的实际数据文件中有 >> 1E6 值）。 numpy 函数 loadtxt / genfromtxt 似乎无法处理可变的列数：

   In [25]: np.loadtxt(fn, comments="*", dtype=float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-2419eebb6114> in <module>()
----> 1 np.loadtxt(fn, comments="*", dtype=float)

/usr/lib/pymodules/python2.7/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834 
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: setting an array element with a sequence.

genfromtxt 更冗长，但也不起作用：

    In [27]: np.genfromtxt(fn, comments="*", dtype=float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-6c6e8879e4b9> in <module>()
----> 1 np.genfromtxt(fn, comments="*", dtype=float)

/usr/lib/pymodules/python2.7/numpy/lib/npyio.pyc in genfromtxt(fname, dtype, comments, delimiter, skiprows, skip_header, skip_footer, converters, missing, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise)
   1636             # Raise an exception ?
   1637             if invalid_raise:
-> 1638                 raise ValueError(errmsg)
   1639             # Issue a warning ?
   1640             else:

ValueError: Some errors were detected !
    Line #2 (got 4 columns instead of 10)
    Line #5 (got 4 columns instead of 10)
    Line #8 (got 4 columns instead of 10)
    Line #11 (got 4 columns instead of 10)
    Line #14 (got 4 columns instead of 10)
    Line #17 (got 4 columns instead of 10)
    Line #20 (got 4 columns instead of 10)
    Line #23 (got 4 columns instead of 10)
    Line #26 (got 4 columns instead of 10)
    Line #29 (got 4 columns instead of 10)

似乎有一个 kwarg invalid_raise，但将其设置为 False 会导致值少于 10 个的行被忽略。

如果能帮我解决这个问题，我将不胜感激。我很乐意在 Cython 中编写自己的文件解析器，但实际上无法找到有关 Cython 中高效字符串->浮点转换的信息 ...

【问题讨论】：

如果带有* 的行只是cmets，那么它们不应该用于分隔不同的数组。看起来它们不是 cmets，而是指示某些值的下一个开始的分隔符。否则，为什么不直接在文件上使用生成器，跳过* 行，然后将数据附加到一个大数组中？
换句话说，您是否需要来自* 行的lat 值来对下面列出的值进行任何组织、过滤、处理等？如果没有，那么为什么这些值要分开保存？如果是，那么您需要一个不同的数据结构，例如 Pandas DataFrame、记录数组或 dict，将 lat 值与其值数组相关联。

标签： python numpy cython

【解决方案1】：

这是一种使用pandas 解析器的方法。如果您只想要 numpy 数组，请使用 df.values

In [239]: import pandas as pd

In [240]: df = pd.read_csv('input.txt', header=None, skiprows=1, delim_whitespace=True)

In [242]: df = df[df[0] != '*']  #filter out comment rows

In [245]: df = df.convert_objects(convert_numeric=True)

In [246]: df
Out[246]: 
        0      1      2      3      4      5      6      7      8      9
0   19.61  19.92  21.82  21.94  22.77  25.81  29.48  29.86  29.92  28.98
1   27.94  25.78  23.68  23.37    NaN    NaN    NaN    NaN    NaN    NaN
3   20.16  23.62  27.73  31.12  33.06  34.01  35.78  37.03  37.79  35.74
4   34.12  31.83  33.98  28.57    NaN    NaN    NaN    NaN    NaN    NaN
6   28.26  30.04  35.00  37.92  41.50  44.55  45.44  46.74  46.74  43.47
7   37.67  35.67  35.67  31.64    NaN    NaN    NaN    NaN    NaN    NaN
9   34.02  36.07  38.95  44.24  46.49  47.98  50.62  51.95  51.95  51.95
10  48.31  41.03  38.01  34.58    NaN    NaN    NaN    NaN    NaN    NaN
12  36.94  37.12  44.04  48.41  51.70  52.71  54.18  55.71  56.98  62.10
13  57.26  49.05  44.18  41.50    NaN    NaN    NaN    NaN    NaN    NaN

【讨论】：