【问题标题】:Parse a Text File extracting values according to its index position根据索引位置解析文本文件提取值
【发布时间】:2021-07-05 08:43:34
【问题描述】:

大家好,你们好吗?我希望你没事! 如何解析使用索引位置提取特定值的文本文件,将值附加到列表,然后将其转换为 pandas 数据框。到目前为止,我能够编写以下代码: 文本示例:

标题:0RCPF049100000084220210407
正文:1927907801100032G 00成功
1067697546140032G 00成功
1053756666000032G 00成功
1321723368900032G 00成功
1037673956810032G 00成功

例如,第一行是标题,从中,我只需要位于以下索引位置的日期: date_from_header = linhas[0][18:26] 其余的值在正文中

import csv
import pandas as pd

headers = ["data_mov", "chave_detalhe", "cpf_cliente", "cd_clube",
           "cd_operacao","filler","cd_retorno","tc_recusa"]

# This is the actual code
with open('RCPF0491.20210407.1609.txt', "r")as f:
  linhas = [linha.rstrip() for linha in f.readlines()]
  for i in range(0,len(linhas)):
     data_mov = linhas[0][18:26]
     chave_detalhe=linhas[1][0:1]
     cpf_cliente=linhas[1][1:12]
     cd_clube=linhas[1][12:16]
     cd_operacao=linhas[1][16:17]
     filler=linhas[1][17:40]
     cd_retorno=linhas[1][40:42]
     tx_recusa=linhas[1][42:100]
data = [data_mov,chave_detalhe,cpf_cliente,cd_clube,cd_operacao","filler,cd_retorno,tc_recusa]

预期的结果如下所示:

data_mov chave_detalhe cpf_cliente cd_clube cd_operacao filler cd_retorno tx_recusa
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'

【问题讨论】:

  • 这个问题有点难以理解。你能:发布一个 filename.txt 的例子吗?
  • 但是已经在查看您的代码:您的 for loops 一遍又一遍地重复相同的事情(从 filename.txt 中读取第 0 行和第 1 行)(因为您不使用迭代器变量, i 在循环内)
  • 但我希望您的数据可能是 csv 或类似的,并且 pandas 具有读取该数据的功能:read_csv。见:datacamp.com/community/tutorials/pandas-read-csv
  • @SamBob 谢谢我正在试图弄清楚如何循环文件并根据索引位置提取所有值
  • 啊,所以你试图从第一行提取 data_mov,然后是“chave_detalhe”、“cpf_cliente”、“cd_clube”、“cd_operacao”、“filler”、“cd_retorno”、“tc_recusa” “来自其他每一行?暂时忽略第一行,stackoverflow.com/a/10851479/1581658 是否有助于拆分行?

标签: python parsing etl


【解决方案1】:

使用stackoverflow.com/a/10851479/1581658

def parse_file(filename):
    indices = [0,1,12,16,17,18,20] # list the indices to split on
    parsed_data = [] # returned array by line
    with open(filename) as f:
        header = next(f) #skip the header
        data_mov = header[18:26] # and get data_mov from header
        for line in f: #loop through lines
            #split each line by the indices
            parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
            parsed_data.append(parts)
    return parsed_data

print(parse_file("filename.txt"))

【讨论】:

  • 感谢您的宝贵时间,我做了一些调整,现在可以正常使用了!最好的问候
【解决方案2】:

感谢 SamBob 的帮助,如果有人需要,请遵循最终解决方案:

import itertools
import pandas as pd

pd.options.display.width = 0

def parse_file(filename):
    indices=[0,1,12,16,17,18,42]  # list of indexes
    parsed_data = [] # return a list
    with open(filename) as f:
        header = next(f) 
        data_mov = header[18:26]
        for line in itertools.islice(f,1,100): 
            # dividr de acordo com os índices.
            parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
            parsed_data.append(parts)
            
            # convert to dataframe
            cols = ['data_mov', 'chave_detalhe', 'cpf_cliente','cd_clube','cd_operacao','filler','cd_retorno','tx_recusa']
            df = pd.DataFrame(parsed_data, columns=cols)

    return df


df = (parse_file("filename.txt"))

【讨论】:

    猜你喜欢
    • 2013-03-12
    • 1970-01-01
    • 2016-01-21
    • 2014-04-20
    • 1970-01-01
    • 2023-01-18
    • 2016-03-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多