python - 如何将多个.txt文件中的信息组织到python pandas的一个数据库中？答案

【问题标题】：How to organize the information from multiple .txt files into one database on python pandas?python - 如何将多个.txt文件中的信息组织到python pandas的一个数据库中？
【发布时间】：2021-01-13 14:04:40
【问题描述】：

链接上方是一个名为“19031783_result.txt”的结果文件。每个 .txt 文件都包含统计结果，我想将其组织到数据库中。

所有结果文件的数据库输出应如下所示：

所以我有数百个结果文件需要合并到一个数据库中。最后三列是每个 Bin 缺陷计数的限制，例如 Bin 1 限制为 10，Bin 2 限制为 5，Bin 3 限制为 3，Bin 4 限制为 0。所以完美意味着没有缺陷，好意味着它在规格范围内失败意味着它超出了限制。

我在 python 方面没有太多经验，我需要指导如何从 .txt 文件创建这个数据库。 Python 更好用，因为它可以处理大量数据，而且速度更快。

import os
import pandas as pd
from glob import glob

stock_files = sorted(glob('*result.txt'))

stock_files

df = pd.concat([pd.read_csv(file, sep="\t").assign(filename = file) for file in stock_files], ignore_index = True)

df = pd.DataFrame() #this is the bit I am stuck on

这是我当前的输出，我需要清理它并将其转换为我有 excel 电子表格屏幕截图的数据库（2:https://i.stack.imgur.com/SebTl.png）

    Delaminated area fraction: 9.63722329310847E-06             filename  \
0   Bin1 Defect count with diameter between 1 µm a...  19031781_result.txt   
1   Bin2 Defect count with diameter between 76 µm ...  19031781_result.txt   
2   Bin3 Defect count with diameter between 301 µm...  19031781_result.txt   
3   Bin4 Defect count with diameter exceeding 1001...  19031781_result.txt   
4                                                 NaN  19031782_result.txt   
5                                                 NaN  19031782_result.txt   
6                                                 NaN  19031782_result.txt   
7                                                 NaN  19031782_result.txt   
8                                                 NaN  19031783_result.txt   
9                                                 NaN  19031783_result.txt   
10                                                NaN  19031783_result.txt   
11                                                NaN  19031783_result.txt

【问题讨论】：

据我所知，没有一个模块可以轻松完成此操作，因此最好分多个步骤完成此操作。你有任何现有的代码，你试图让它工作吗？
我添加了启动模块

标签： python python-3.x excel pandas dataframe

【解决方案1】：

开始简单 - python 可能不适合这种基本的数据吸盘和重塑活动

使用基本的操作系统命令将所有文件合并到一个文件中，我将使用 bash 和 Windows CMD 进行演示

对于 bash - 安装 WSL 和首选的 Linux 发行版使用 Bash 脚本（如果没有太多假设，几乎是单行）读取结果文件列表 - 并对每个文件进行分类，并在每个文件的开头插入文件名 - 存储到一个文件“AllResultsFiles.txt”中

ls -1 *_result.txt | while read fname; do cat $fname | while read _line; do echo $fname:$_line; done; done > AllResultsFiles.txt

对于 windows - 稍微复杂一些 - 但这也是一样 - 将其存储到 "C:\users\me\data\mergers.cmd" 文件中：

@echo off
del /q AllResultsFilesWin.txt
FOR /F "tokens=* delims=" %%x in ('dir /b *_results.txt') DO ( 
           for /f "tokens=*" %%C in ('type %%x') do echo %%x:%%C
)  >> AllResultsFilesWin.txt

然后如下运行

C:\users\me\data\> mergers.cmd

这将创建一个包含所有文件内容的单个文件，并使用冒号 (:) 分隔符 - 第一列中的每一行都带有原始文件的名称

然后将其轻松导入电子表格或具有三列的数据库中，使用冒号作为分隔符

create table imported_results_statistics 
(Orig_filename varchar(100),
Metric varchar(200),
Value int
)

一旦导入到数据库表中，您就可以使用 SQL 操作来创建新表 - 每个文件名转置每组记录 -

（sqlite 很简单 - 但需要更多步骤）

select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename

如果您有更强大的数据库工具包 - 您可以按如下方式获得压缩结果

with cte_allresults as (
select orig_filename
     , max(case when substring(Metric,1,4) in 'Dela' then Value) as percent_area
     , max(case when substring(Metric,1,4) in ('Bin1') then Value end) as Bin1
     , max(case when substring(Metric,1,4) in ('Bin2') then Value end) as Bin2
     , max(case when substring(Metric,1,4) in ('Bin3') then Value end) as Bin3
     , max(case when substring(metric,1,4) in ('Bin4') then Value end) as Bin4
  from imported_results_statistics
group by orig_filename
) 
select orig_filename as scribe_no /* I assume this is meant to reflect the file name?
     , percent_area
     , Bin1 as LessThan75UM     , Bin1 as From75To300UM
     , Bin3 as From30to1MM     , Bin4 as MoreThan1MM
     , case when Bin1 + Bin2 + Bin3 + Bin4 < 10 then 0 as perfect /* (apply your  own specific ranges and rules here *
/*    ...  */
   from cte_allresults

对于完美/好/失败 - 你可以在你的 python 脚本中定义它

作为一个 SQL 和 shell 脚本的疯子，我已经不记得了——最近还在学习 python——我可以保证这种方法会更快地为你服务——一定要把单个文件放入你的 python 脚本中—— 将 bash / cmd 脚本生成为 Python 脚本中的文本变量如果您要定期执行此操作，请使用 python subprocess.call

【讨论】：

【解决方案2】：

你认为这样可以解决问题吗？

import os
import pandas as pd
import numpy as np
from glob import glob

stock_files = sorted(glob('*result.txt'))

# first, create an empty dataframe with the columns names you want
df = pd.DataFrame(columns = ['percent_area', 'less_than_75um', '75_to_300um', '301_to_1mm', 'more_than_1mm', 'perfect', 'good', 'fail'])
# create auxiliary functions to evaluate the quality of the bins
bin_limits = [10, 5, 3, 0]
check_perfect = lambda x: 1 if sum(x) == 0 else 0
check_good = lambda x: 1 if all(np.array(x) <= bin_limits) else 0
check_fail = lambda x: 1 if all(np.array(x) > bin_limits) else 0

# loop through each file
for file in stock_files:
    # read each file and split the lines with ':' delimiter to have the value for each line and convert it to float
    # store those 4 values under 'values' list variable
    values = list(
                seq(
                    pd.read_csv(
                        file, header = None
                        )
                    )\
                .map(
                    lambda x: x[0].split(':')[1]
                    )\
                .map(
                    lambda x: float(x)
                    )
                )
    
    # check the quality of the bins and store the 3 results in 'bins_quality' list variable
    bins_quality = [check_perfect(values[1::]), check_good(values[1::]), check_fail(values[1::])]

    # add a line to the DataFrame with the name of the file in the index and the above values that correspond to the defined columns
    df.loc[file] = values + bins_quality

【讨论】：