从 gromacs 文件中读取数据并将其写入 hdf5 文件格式答案

【问题标题】：Reading data from gromacs file and write it to the hdf5 file format从 gromacs 文件中读取数据并将其写入 hdf5 文件格式
【发布时间】：2021-11-02 09:54:11
【问题描述】：

我正在尝试从 .gro 文件中逐行读取数据，并希望将其写入数据为 .h5 文件格式。但得到 Typeerror："No conversion path ford type: type('<U7')"。我猜读取的数据是字符串格式的。我尝试使用 np.arrays 将其转换为数组，但它不起作用。谁能帮我解决这个问题？还是有更好的方法来读取数据？我无法使用np.loadtxt，因为数据大小约为 50GB。

.gro 文件的格式如下所示

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   10.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   20.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   30.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
Generated by trjconv : P/L=1/400 t=   40.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96

错误：

ValueError: Some errors were detected !
    Line #5 (got 7 columns instead of 6)
    Line #6 (got 1 columns instead of 6)
    Line #9 (got 7 columns instead of 6)
    Line #10 (got 1 columns instead of 6)
    Line #13 (got 7 columns instead of 6)
    Line #14 (got 1 columns instead of 6)
    Line #17 (got 7 columns instead of 6)
    Line #18 (got 1 columns instead of 6)

这是我的小代码：

import h5py
import numpy as np
# First step is to read .gro file
f = open('pep.gro', 'r')
data = f.readlines()
for line in data:
    reading = line.split()
    #print(type(reading))
    #dat = np.array(reading).astype(int)

# Next step is to write the data to .h5 file
with h5py.File('pep1.h5', 'w') as hdf:
    hdf.create_dataset('dataset1', data=reading)

【问题讨论】：

标签： python hdf5 h5py

【解决方案1】：

首先创建具有大量行 [shape=(1_000_000)] 的 HDF5 数据集，然后使用 maxshape 参数使其可扩展。 maxshape=(None,) 的值将允许无限行。我定义了一个简单的 dtype 来匹配您的数据。如果需要，可以自动为不同的文件格式创建匹配的 dtype。

您收到了 Unicode 错误，因为 h5py 不支持将字符串作为 Unicode 数据。（默认情况下，NumPy 从字符串创建 Unicode 数据。）解决此限制的方法是提前为数组定义一个 dtype（使用 'S#'，其中 NumPy 具有“

接下来使用np.genfromtxt 将您的数据直接读入 NumPy 数组。使用skip_header 和max_rows 参数进行增量读取。将dtype 参数包含在用于创建上述数据集的dtype 中。

为了测试增量读取，我将您的文件扩展为 54 行（用于 3 个读取循环）。出于性能原因，您需要使用更大的值来读取 50GB（将 incr 设置为您可以读入内存的值 - 从 100_000 行开始）。

以下代码：（修改为跳过前两行

import h5py
import numpy as np

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Next, create an empty .h5 file with the dtype
with h5py.File('pep1.h5', 'w') as hdf:
    ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,)) 

    # Next read line 1 of .gro file
    f = open('pep.gro', 'r')
    data = f.readlines()
    ds.attrs["Source"]=data[0]
    f.close()

    # loop to read rows from 2 until end
    skip, incr, row0 = 2, 20, 0 
    read_gro = True
    while read_gro:
        arr = np.genfromtxt('pep.gro', skip_header=skip, max_rows=incr, dtype=gro_dt)
        rows = arr.shape[0]
        if rows == 0:
            read_gro = False 
        else:    
            if row0+rows > ds.shape[0] :
                ds.resize((row0+rows,))
            ds[row0:row0+rows] = arr
            skip += rows
            row0 += rows

【讨论】：

非常感谢！！我真的不明白你的意思 我将你的文件扩展到 54 行（用于 3 个读取循环）。除此之外，代码给出 Value error: Some errors were detected $\\$
** 第 2 行（得到 1 列而不是 8 列）第 3 行（得到 6 列而不是 8 列）**
我只在创建示例输入时包含了列数据。您的文件是否包含前 2 个“非数据”行？（不清楚您是否/如何阅读它们。）我修改了我的代码以处理前两行。第一行看起来很有趣，所以我将其作为属性添加到数据集。此外，我研究了循环逻辑，因此我可以正确处理奇数大小。您还可以通过len(data) 来调整数据集的大小并避免调整大小的逻辑。
是的，它确实包含 2 个“非数据”行。顺便说一句，它现在与skip = 2一起工作。我在阅读 .gro 文件时还有一个问题。如问题中所述，第一帧末尾有 11214 行 + 标题和另一行。之后，第二帧以 Generated by trjconv : P/L=1/400 t= 1000 开头，除了现在的 t=1000。是否有可能以某种方式跳过它们？
代码修改为跳过 1 组标题。标题和数据是否在同一个文件中重复？如果是这样，我可以使用第二个标题行上的计数作为在接下来的 2 个标题行之前读取的数据行数吗？如果是这样，您是想要 1 个数据集中的所有数据，还是每个标题后面的数据都需要一个单独的数据集？

【解决方案2】：

读取 2000 个时间步的过程类似，但步骤的顺序不同。此外，由于我们知道每个时间步的数据行数（来自第二个标题行），这简化了数据集的创建。我们可以使用它来调整每个数据集的大小，并避免创建必须调整大小的可扩展数据集。

我正在写一个新的答案来说明如何做到这一点。关于 dtype 和 Numpy Unicode 字符串的评论仍然适用。此代码适用于我创建的具有 2 个时间步长的测试文件。修改 n_steps= 以使用 2000 步。

import h5py
import numpy as np

n_steps = 2 # 2000
csv_file = 'pep2.gro'

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Open the file for reading and
# create an empty .h5 file with the dtype above   
with open(csv_file, 'r') as f, \
     h5py.File('pep2.h5', 'w') as hdf:

    data = f.readlines()
    skip= 2

    for step in range(n_steps):
        # Read text header line for THIS time step
        header = data[skip-2]
        # get number of data rows
        no_rows = int(data[skip-1]) 
        arr = np.genfromtxt(csv_file, skip_header=skip, max_rows=no_rows, dtype=gro_dt)
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
        skip += no_rows + 3

【讨论】：

谢谢！但仍然得到同样的错误！我更新了问题中的数据文件以及错误，只是为了显示数据的样子。我猜想读取下一个标题需要更多的列数。
通过查看错误，似乎空格算作列。
genfromtxt 的列分隔符是空格（默认情况下）。给出错误的行是否有额外的空格？如果是这样，您可能需要另一种方法来解析数据。
是的，他们有额外的空间。好的！我会尝试其他方法。感谢所有的努力！
在之前的评论中，您说“第一帧末尾还有一行”查看您的文件更新，我没有看到该行。根据评论，我阅读了标题标题（“由 trjconv 生成：P/L=1/400 t=0.00000”），然后是数据计数（11214），然后是 11214 行数据，然后跳过一行。 np.genfromtxt() 如果没有多余的行会抛出错误（因为行数已关闭。它将从标题中读取 7 个字段，从数据计数行中读取 1 个字段。这可能是问题吗？如果不是，请发布带有额外空格的示例数据行。