如何仅使用python从文件中提取数据答案

【问题标题】：How to extract data from file using only python如何仅使用python从文件中提取数据
【发布时间】：2021-06-13 06:57:18
【问题描述】：

我有一个看起来像这样的数据文件：

# GROMACS
#
@    title "GROMACS Energies"
@    xaxis  label "Time (ps)"
@    yaxis  label "(K)"
@TYPE xy
@ view 0.15, 0.15, 0.75, 0.85
@ legend on
@ legend box on
@ legend loctype view
@ legend 0.78, 0.8
@ legend length 2
@ s0 legend "Temperature"
    0.000000  301.204895
    1.000000  299.083496
    2.000000  293.100250
    3.000000  301.090637
    4.000000  293.024811
    5.000000  297.068481
    6.000000  298.065125
    7.000000  300.354370
    8.000000  304.322693
    9.000000  297.093170
   10.000000  297.186615
   11.000000  298.112732
   12.000000  293.396545
   13.000000  295.803162
   14.000000  293.432037
   15.000000  298.306702
   16.000000  297.545715
   17.000000  294.283875
   18.000000  295.527771
   19.000000  297.193665

我想要做的是提取@ s0 legend "Temperature" 表达式下方的所有数据点并将其放入数据帧中，或者只是一个可以被python轻松访问的数据结构。我目前正在使用 awk 和 python 的组合来执行此操作。

我先做 awk '/@ s0 legend/{flag=1; next} flag' temp.xvg > temp.dat 获取只有两列数据的 temp.dat 文件。然后，我使用 panda read_csv 将数据作为列来执行我的分析。

我想切断将临时文件写入磁盘以将信息发送到python的中间人。这可能吗？我可以通过简单的 python 脚本提取数据列吗？

【问题讨论】：

为什么投反对票？

标签： python awk text-processing

【解决方案1】：

您可以从文件中读取行，直到到达legend 行；然后在文件的余额上使用read_csv。在阅读开头的行时，您还可以提取 xaxis 和 yaxis 标签以用作列名。例如：

import pandas as pd
import re

with open('test.dat', 'r') as f:
    for line in f:
        m = re.search(r'xaxis\s+label\s+"([^"]+)"', line)
        if m is not None:
            xaxis = m.group(1)
        m = re.search(r'yaxis\s+label\s+"([^"]+)"', line)
        if m is not None:
            yaxis = m.group(1)
        if line.startswith('@ s0 legend'):
            break
    df = pd.read_csv(f, names=[xaxis, yaxis], delim_whitespace=True)
    f.close()
    
print(df)

输出

    Time (ps)         (K)
0         0.0  301.204895
1         1.0  299.083496
2         2.0  293.100250
3         3.0  301.090637
4         4.0  293.024811
5         5.0  297.068481
6         6.0  298.065125
7         7.0  300.354370
8         8.0  304.322693
9         9.0  297.093170
10       10.0  297.186615
11       11.0  298.112732
12       12.0  293.396545
13       13.0  295.803162
14       14.0  293.432037
15       15.0  298.306702
16       16.0  297.545715
17       17.0  294.283875
18       18.0  295.527771
19       19.0  297.193665

【讨论】：

【解决方案2】：

它是python中的单行代码。比如：

file_data = [list(map(float,x.strip().split())) for x in open("filedata.txt","rt") if x.strip()[:1] not in "@#"]

读取文件，去除空格，消除非数据行，拆分字符串，转换为浮点数。结果是数据对列表。

【讨论】：

【解决方案3】：

在python中等效的程序是：

import re
# Using readlines()
file1 = open('temp.xvg', 'r')
Lines = file1.readlines()
 
count = 0
# Strips the newline character
for line in Lines:
    if count==0:
        x=re.search("^@ s0 legend",line)
        if x: # FOUND!!!
            count += 1

    else:
        print("{}".format(line.strip()))

您可以将其保存为program.py。然后执行：

python2 program.py

你会得到这个输出：

0.000000  301.204895
1.000000  299.083496
2.000000  293.100250
3.000000  301.090637
4.000000  293.024811
5.000000  297.068481
6.000000  298.065125
7.000000  300.354370
8.000000  304.322693
9.000000  297.093170
10.000000  297.186615
11.000000  298.112732
12.000000  293.396545
13.000000  295.803162
14.000000  293.432037
15.000000  298.306702
16.000000  297.545715
17.000000  294.283875
18.000000  295.527771
19.000000  297.193665

【讨论】：