Pandas：在分隔符关键字后开始和停止解析答案

【问题标题】：Pandas: start and stop parsing after a delimiter keywordPandas：在分隔符关键字后开始和停止解析
【发布时间】：2018-01-26 14:38:06
【问题描述】：

我是一名处理势能分布的化学家，输出有点混乱（有些行使用的列比其他行多），我们在一个文件中进行了多项分析，所以我想在看到时开始和停止解析一些特定的“关键字”或“***”等标志。

这是我的输入示例：

Average max. Potential Energy <EPm> = 41.291
TED Above 100 Factor TAF=0.011
Average coordinate population 1.000
s 1     1.00   STRE    4    7   NH    1.015024  f3554 100
s 2     1.00   STRE    2    1   CH    1.096447  f3127 13  f3126 13  f3073 37  f3073 34
s 3     1.00   STRE    2    5   CH    1.094347  f3127 38  f3126 36  f3073 12  f3073 11
s 4     1.00   STRE    6    8   CH    1.094349  f3127 36  f3126 38  f3073 11  f3073 13
s 5     1.00   STRE    2    3   CH    1.106689  f2950 48  f2944 46
s 6     1.00   STRE    6    9   CH    1.106696  f2950 47  f2944 47
s 7     1.00   STRE    6   10   CH    1.096447  f3127 12  f3126 13  f3073 33  f3073 38
s 8     1.00   STRE    4    2   NC    1.450644  f1199 43  f965 39
s 9     1.00   STRE    4    6   NC    1.450631  f1199 43  f965 39
s 10    1.00   BEND    7    4    6   HNC   109.30  f1525 12  f1480 42  f781 18
s 11    1.00   BEND    1    2    3   HCH   107.21  f1528 33  f1525 21  f1447 12
s 12    1.00   BEND    5    2    1   HCH   107.42  f1493 17  f1478 36  f1447 20
s 13    1.00   BEND    8    6   10   HCH   107.42  f1493 17  f1478 36  f1447 20
s 14    1.00   BEND    3    2    5   HCH   108.14  f1525 10  f1506 30  f1480 14  f1447 13
s 15    1.00   BEND    9    6    8   HCH   108.13  f1525 10  f1506 30  f1480 14  f1447 13
s 16    1.00   BEND   10    6    9   HCH   107.20  f1528 33  f1525 21  f1447 12
s 17    1.00   BEND    6    4    2   CNC   112.81  f383 85
s 18    1.00   TORS    7    4    2    1   HNCH  -172.65  f1480 10  f781 55
s 19    1.00   TORS    1    2    4    6   HCNC    65.52  f1192 27  f1107 14  f243 18
s 20    1.00   TORS    5    2    4    6   HCNC  -176.80  f1107 17  f269 35  f243 11
s 21    1.00   TORS    8    6    4    2   HCNC  -183.20  f1107 17  f269 35  f243 11
s 22    1.00   TORS    3    2    4    6   HCNC   -54.88  f1273 26  f1037 22  f243 19
s 23    1.00   TORS    9    6    4    2   HCNC    54.88  f1273 26  f1037 22  f243 19
s 24    1.00   TORS   10    6    4    2   HCNC   -65.52  f1192 30  f1107 18  f243 21
****
 9 STRE modes:
  1  2  3  4  5  6  7  8  9
 8 BEND modes:
 10 11 12 13 14 15 16 17
 7 TORS modes:
 18 19 20 21 22 23 24
 19 CH modes:
  2  3  4  5  6  7 11 12 13 14 15 16 18 19 20 21 22 23 24
 0 USER modes:


alternative coordinates 25 
k 10    1.00   BEND    7    4    2   HNC   109.30
k 11    1.00   BEND    1    2    4   HCN   109.41
k 12    1.00   BEND    5    2    4   HCN   109.82
k 13    1.00   BEND    8    6    4   HCN   109.82
k 14    1.00   BEND    3    2    1   HCH   107.21
k 15    1.00   BEND    9    6    4   HCN   114.58
k 16    1.00   BEND   10    6    8   HCH   107.42
k 18    1.00   TORS    7    4    2    5   HNCH   -54.98
k 18    1.00   TORS    7    4    2    3   HNCH    66.94
k 18    1.00   OUT     4    2    6    7   NCCH    23.30
k 19    1.00   OUT     2    3    5    1   CHHH    21.35
k 19    1.00   OUT     2    1    5    3   CHHH    21.14
k 19    1.00   OUT     2    3    1    5   CHHH    21.39
k 20    1.00   OUT     2    1    4    5   CHNH    21.93
k 20    1.00   OUT     2    5    4    1   CHNH    21.88
k 20    1.00   OUT     2    1    5    4   CHHN    16.36
k 21    1.00   TORS    8    6    4    7   HCNH    54.98
k 21    1.00   OUT     6   10    9    8   CHHH    21.39
k 22    1.00   OUT     2    1    4    3   CHNH    20.12
k 22    1.00   OUT     2    5    4    3   CHNH    19.59
k 23    1.00   TORS    9    6    4    7   HCNH   -66.94
k 23    1.00   OUT     6    8    4    9   CHNH    19.59
k 24    1.00   TORS   10    6    4    7   HCNH  -187.34
k 24    1.00   OUT     6    9    4   10   CHNH    20.32
k 24    1.00   OUT     6    8    4   10   CHNH    21.88

我想跳过前 3 行（我知道如何使用 skiprows=3 执行此操作）然后我想在“***”处停止解析并将我的内容容纳到具有预定义名称的 11 列中比如“tVib1”“%PED1”“tVib2”“%PED2”等

之后，我将在同一个文件中开始将“替代坐标”一词解析为 11 列。

对我来说似乎很难实现。

非常感谢任何帮助。

【问题讨论】：

我刚刚注意到这些列真的很乱。似乎在两个段中都随机省略了某些列。你怎么知道，哪个值属于你的 11 列中的哪一个？
@Piinthesky，你是对的，这是一团糟。并且看起来很难创建一个好的 Python 脚本来组织它。通过将每一行与整个 11 列进行比较，我知道我们缺少列。
@Piinthesky，感谢您查看此内容。这是链接1drv.ms/f/s!AscFK8cOesFhiOI2tk2k_n_Ysu4SGA

标签： python-3.x pandas csv dataframe

【解决方案1】：

对于提供的.dd2 文件，我使用了另一种策略。隐含的假设是
1) 仅当一行以小写 - 空格 - 数字或至少五个空格开头，后跟至少一个大写单词
2) 如果缺失，则从最后一行重用第一、第三和每个 f 列
3) 第三列包含第一个大写单词
4）如果第一个大写单词之间的差异小于给定变量max_col，则为缺失值引入NaN 5) f 值列在第二个大写列之后开始两列

import re
import pandas as pd
import numpy as np

def align_columns(file_name, col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6"], max_col = 4):
    #max_col: number of columns between the two capitalised columns
    #column names for the first values N = number, S = string, F = f number, adapt to your needs
    #both optional parameters 

    #collect all data sets as a list of lists
    all_lines = []
    last_id, last_cat, last_fval = 0, 0, []

    #opening file to read
    for line_in in open(file_name, "r"):
        #use only lines that start either
        #with lower case - space - digit or at least five spaces
        #and have an upper case word in the line
        start_str = re.match("([a-z]\s\d|\s{5,}).*[A-Z]+", line_in)
        if not start_str:
            continue

        #split data columns into chunks using 2 or more whitespaces as a delimiter
        sep_items = re.split("\s{2,}", line_in.strip())
        #if ID is missing use the information from last line
        if not re.match("[a-z]\s\d", sep_items[0]):
            sep_items.insert(0, last_id)
            sep_items.insert(2, last_cat)
            sep_items.extend(last_fval)
        #otherwise keep the information in case it is missing from next line
        else:
            last_id = sep_items[0]
            last_cat = sep_items[2]

        #get index for the two columns with upper case words
        index_upper = [i for i, item in enumerate(sep_items) if item.isupper()]

        if len(index_upper) < 2 or index_upper[0] != 2 or index_upper[1] > index_upper[0] + max_col + 1:
            print("Irregular format, skipped line:")
            print(line_in)
            continue

        #get f values in case they are missing for next line
        last_fval = sep_items[index_upper[1] + 2:]

        #if not enough rows between the two capitalised columns, fill with NaN
        if index_upper[1] < 3 + max_col:
            fill_nan = [np.nan] * (3 + max_col - index_upper[1])
            sep_items[index_upper[1]:index_upper[1]] = fill_nan
        #append to list
        all_lines.append(sep_items)

    #create pandas dataframe from list
    df = pd.DataFrame(all_lines)
    #convert columns to float, if possible
    df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
    #label columns according to col_names list and add f0, f1... at the end
    df.columns = [col_names[i] if i < len(col_names) else "f" + str(i - len(col_names)) for i in df.columns] 
    return df

#-----------------main script--------------
#use standard parameters of function
conv_file = align_columns("a1-91a.dd2")
print(conv_file)

#use custom parameters for labels and number of fill columns 
col_labels = ["X1", "Y1", "Z1", "A1", "A2", "A3", "A4", "A5", "A6", "Z2", "B1"]
conv_file2 = align_columns("a1-91a.dd2", col_labels, 6)
print(conv_file2)

这比第一个解决方案更灵活。 f 值列的数量不限于特定数量。
该示例向您展示了如何将其与函数定义的标准参数和自定义参数一起使用。这肯定不是最漂亮的解决方案，我很高兴支持任何更优雅的解决方案。但它至少在我的 Python 3.5 环境中有效。如果数据文件有任何问题，请告诉我。

P.S.：将相应列转换为浮点数的解决方案是provided by jezrael

【讨论】：

多么优雅的解决方案！这是惊人的。感谢您抽出宝贵的时间。
不客气。导入后请仔细查看您的数据。很有可能，我可能没有考虑数据结构中的异常。但话又说回来，科学家在数据处理的每一步都仔细检查他们的数据。最后一个问题：这些文件是由商业科学软件生成的吗？
它是由一个叫VEDA的免费软件生成的：smmg.pl/software/veda我开始看后发现生成的数据是一团糟，但我研究组的同事用它很多，所以减少痛苦很重要。再次感谢您。

【解决方案2】：

似乎并不难，你已经描述了你想要的一切，你所需要的只是将它翻译成 Python。首先，您可以解析整个文件并将其存储在行列表中：

with open(filename,'r') as file_in:
    lines = file_in.readlines()

然后您可以从第 3 行开始阅读并解析，直到找到“***”：

ind = 3
while x[ind].find('***') != -1:
    tmp = x[ind]
    ... do what you want with tmp ...
    ind = ind + 1

然后您可以继续做任何您需要的事情，将find("...") 替换为您需要的任何关键字。

要管理您的每一行“tmp”，您可以使用非常有用的 Python 函数，如 tmp.split()、tmp.strip()，将任何字符串转换为数字等。

【讨论】：

【解决方案3】：

我根据您在 SO 上的示例制作了第一个脚本。它不是很灵活 - 它假设前三列填充了值，然后将两列与大写单词对齐，如有必要，用NaN 填充中间的四列。用这个值填充它的原因是像 .sum() 或 .mean() 这样的 pandas 函数在计算列的值时会忽略这个。

import re
import io 
import pandas as pd

#adapt this part to your needs    
#enforce to read 12 columns, N = number, S = string, F = f number
col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6", "F1", "F2", "F3"]
#only import lines that start with these patterns
startline = ("s ", "k ")    
#number of columns between the two capitalised columns
max_col = 4                

#create temporary file like object to feed later to the csv reader
pan_wr = io.StringIO()

#opening file to read
for line in open("test.txt", "r"):
    #checking, if row should be ignored
    if line.startswith(startline):
        #find the text between the two capitalized columns
        col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", line).group(1)
        #determine, how many elements we have in this segment
        nr_col_betw = len(re.split(r"\s{2,}", col_betw.strip()))
        #test, if there are not enough numbers  
        if nr_col_betw <= max_col:
            #fill with NA, which is interpreted by pandas csv reader as NaN
            subst = col_betw + "   NA" * (max_col - nr_col_betw + 1) 
            line = line.replace(col_betw, subst, 1)
        #write into file like object the new line
        pan_wr.writelines(line)

#reset pointer for csv reader 
pan_wr.seek(0)

#csv reader creates data frame from file like object, splits at delimiter with more than one whitespace
#index_col: the first column is not treated as an index, names: name for columns
df = pd.read_csv(pan_wr, delimiter = r"\s{2,}", index_col = False, names = col_names, engine = "python")
print(df)

这很好用，但无法处理您稍后发布的.dd2 文件。我目前正在为此测试一种不同的方法。待续……

P.S.：我发现关于 csv 阅读器使用 index_col = False 的信息相互矛盾。有人说，你现在应该使用index_col = None，以抑制第一列被转换为索引，但它在我的测试中不起作用。

【讨论】：