Python：如何从文本文件中拆分位置数据答案

【问题标题】：Python: how to split positional data from textfilePython：如何从文本文件中拆分位置数据
【发布时间】：2021-05-16 21:06:01
【问题描述】：

我有一个文本文件，其中包含我试图在 Python 中读取的数据：

OMEGA2    1.450E+00 1.500E+00 1.550E+00 1.600E+00 1.650E+00 1.700E+00
OMEGA2    1.800E+00 1.850E+00 1.900E+00 1.950E+00 2.000E+00 2.050E+00
F2REAL    1.146E+00 -1.015E+03-2.206E+03-2.618E+03-2.288E+03-1.400E+03
F2REAL    6.255E+00 -3.254E+02-8.150E+02-1.060E+03-9.749E+02-5.995E+02
F2REAL    1.754E+01 -1.530E+02-4.375E+02-5.932E+02-5.618E+02-3.536E+02
F2REAL    1.740E+01 -7.981E+01-2.525E+02-3.748E+02-3.891E+02-2.739E+02
OMEGA2    1.800E+00 1.850E+00 1.900E+00 1.950E+00 2.000E+00 2.050E+00

现在，我只想获得以 F2REAL 开头的值；每行，我想提取 6 个值。 Value1 是从索引 11 到索引 20，值是从索引 21 到 30，...，值 6 是从索引 61:70

我尝试了以下方法：

file = 'file.txt'
STR1 = 'F2REAL'

def get_data():
    with open(file) as f:
        hyd_all = f.readlines()
        for line in hyd_all:
            if STR1 in line:
                print([float(line[10:19]),float(line[20:29])])

get_data()

这不会读取 E-power，因为我得到 [1.146,-1.015,..]。如何正确获取？
有没有比写 10:19,20:29,..60:69 更好的方法？所有感兴趣的行都有 6 列，并且总是从 10*i 开始
我想将每个结果附加到一个矩阵中。在这个 4 行 6 列的例子中

【问题讨论】：

像line[10:19] 这样的切片包含第一个索引，但不包含最后一个索引（即行[19]）不包含在切片中。使用line[10:20]等。

标签： python

【解决方案1】：

电子符号就是这样 - 一个符号。值被正确解析，只是表示方式不同
你可以使用列表理解
假设您正在谈论numpy-matrix（否则只需切换到pandas DataFrame）：

import numpy as np


def get_data(path: str, target: str, width: int = 10):
    values = []
    with open(path, 'r') as f:
        for line in f.readlines():
            # 'F2REAL' should be at the beginning of the line not just anywhere
            if line.startswith(target):
                # map sequential fixed widths to float
                values.append([float(line[width*i:width*(i+1)]) for i in range(1, 7)])

    return np.asarray(values)
    

print(get_data('file.txt', 'F2REAL'))

输出：

[[ 1.146e+00 -1.015e+03 -2.206e+03 -2.618e+03 -2.288e+03 -1.400e+03]
 [ 6.255e+00 -3.254e+02 -8.150e+02 -1.060e+03 -9.749e+02 -5.995e+02]
 [ 1.754e+01 -1.530e+02 -4.375e+02 -5.932e+02 -5.618e+02 -3.536e+02]
 [ 1.740e+01 -7.981e+01 -2.525e+02 -3.748e+02 -3.891e+02 -2.739e+02]]

【讨论】：

【解决方案2】：

您的文件属于固定宽度格式的文件。我建议你使用 Pandas 库，它具有读取此类文件的特定功能，read_fwf。函数read_fwf 接受一个colspecs 参数，它是一个元组列表，每个元组包含特定列的开始和结束。由于您的文件没有标题列，您应该使用header=None，您的列将自动分配编号（然后您可以更改为专有名称，或在文件中添加标题列）。

此函数识别科学记数法 (E+0...) 并将您的数字解析为实际数字，而不是字符串。然后您可以将数字的显示格式更改为whatever you like。

import pandas as pd


colspecs = [(0, 6), (10, 19), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70)]
df = pd.read_fwf("file.txt", header=None, colspecs=colspecs)

如果可能，我建议你使用 Pandas：它是一个非常强大的库，你可以对你的数据执行很多操作，例如绘图、查询或统计计算。代码也很简洁。

【讨论】：

感谢您的回复。我对 Pandas 很熟悉，但我不想在这里使用它，因为我有一个原始文本文件，其中包含许多其他不同大小的行、cmets 等等

【解决方案3】：

def parse_scientific(s):
    root = float(s.split('E')[0])
    exp  = int(s.split('E')[1])
    return root*(10**exp)

def get_data():
    with open(file) as f:
        hyd_all = f.readlines()
        for line in hyd_all:
            if line.startswith(STR1):
                item_values = [parse_scientific(line[offset*10:offset*10+10]) for offset in range(1,7)]

使用item_values 插入矩阵

【讨论】：

这对我不起作用：AttributeError: 'generator' object has no attribute 'split'

【解决方案4】：

file = 'file.txt'
STR1 = 'F2REAL'

def get_data():
    with open(file) as f:
        hyd_all = f.readlines()
        for line in hyd_all:
            if STR1 in line:
                print(line[10:20],line[20:30],line[30:40],line[40:50],line[50:60],line[60:70])

get_data()

结果如下：

【讨论】：