使用 Python 中的第一列值将 txt 文件拆分为两个文件答案

【问题标题】：Split a txt file into two files using first column value in Python使用 Python 中的第一列值将 txt 文件拆分为两个文件
【发布时间】：2022-01-01 03:45:35
【问题描述】：

我想根据第一列的值将 INPUT.txt 文件拆分为两个 .txt 文件（标题和数据）。 “H1000”之前的数据将保存在header.txt文件中，之后/等于“H1000”的数据将保存在data.txt文件中。

INPUT.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81                                                                                                                       
H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

输出文件为：

header.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81

data.txt

H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

我面临的几个问题：

“H1000”位置在不同的txt文件中是动态的。如果您看到另一个输入文件，则看到“H1000”位置不同（检查 Input File2）。所以我的python代码是先找到H1000的位置。
我使用 H1000 的位置来分隔 Header & Data 文件。逻辑在分离文件时无法正常工作。

我的python代码：

if path_txt.is_file():
        txt_files = [Path(path_txt)] 
    else:
        txt_files = list(Path(path_txt).glob("*.txt"))
    
    for fn in txt_files:
       with open(fn) as fd_read:
            for line in fd_read:
               h_value = line.split(maxsplit=1)[0]
               value = int(h_value[1:]) #Finding the position of H1000
                   
            splitLen = 5  # Position of H1000
            HeaderBase = 'Header.txt'  # Header.txt
            DataBase = 'Data.txt'  # Data.txt

            with open(fn, 'r') as fp:
                input_list = fp.readlines()
                # to skip empties: input_list = [l for l in fp if l.strip()]

            for i in range(0, len(input_list), splitLen):
                with open(HeaderBase, 'w') as fp:
                    fp.write(''.join(input_list[0:(i-1)])) #Header.txt
                with open(DataBase, 'w') as fp:
                    fp.write(''.join(input_list[i:]))   #Data.txt

我的逻辑都不行。任何帮助，因为我坚持如何处理这个逻辑。

输入文件2

H0002   Version 9                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAAAAA                                                                                                                      
H1000   Tene_no/Combined_rept_no    E79/38975                                                                                                                       
H1001   Tene_holder Magne Resources NL  
D   abc3SCO1    NORM    26  27  9483531 4.15    0.05    0.65    0.02    0.15    0   0.04    0.09    87.51   0.29

Python代码和txt文件附here

【问题讨论】：

你的代码有很多问题，但在修复它们之前，我有一个问题：当一个目录被传递时，所有 .txt 文件中的所有标题是否都指向同一个header.txt？ data.txt 也一样。

标签： python file-io

【解决方案1】：

您的代码存在许多问题：

您实际上并没有找到H1000 的位置。我没有看到它写在代码中。
您将拆分设置为5，忽略H1000 的位置。
我不明白你的range() 函数。您是在 5 次跳线中从头跳到尾吗？
对于每一次跳转i，您将编写从文档开头到i 到header.txt 的所有内容，其余的到data.txt。这意味着您要多次编写整个文档。
您将path_txt 更改为Path 对象，然后像字符串一样定期使用它。

我不知道在传递目录的情况下该怎么办，因为所有标题都在同一个文件中，所有数据都在同一个文件中，这不是你希望我相信的。

单个文件的固定代码：

SPLIT_TOKEN = "H1000"

def split_file(path, header_path="header.txt", data_path="data.txt"):
    """Split a file to a header and data file upon encountering a token."""
    header = []
    data = []
    with open(path, "r") as f:
        for line in f:
            if line.startswith(SPLIT_TOKEN):
                break
            header.append(line)
        
        data.append(line)  # Add the line with the token
        data.extend(f)

    with open(header_path, "w") as f:
        f.writelines(header)
    with open(data_path, "w") as f:
        f.writelines(data)

【讨论】：