合并/平均多个数据文件答案

【问题标题】：combine/average multiple data files合并/平均多个数据文件
【发布时间】：2013-12-28 19:34:49
【问题描述】：

我有一组数据文件（例如，“data####.dat”，其中 #### = 0001，...，9999），它们都具有相同的 x 数据结构- 第一列中的值，然后是具有不同 y 值的多个列。

data0001.dat：

#A < comment line with unique identifier 'A'
#B1 < this is a comment line that can/should be dropped
1 11 21
2 12 22
3 13 23

data0002.dat：

#A < comment line with unique identifier 'A'
#B2 < this is a comment line that can/should be dropped
1 13 23
2 12 22
3 11 21

它们基本上源自我的程序的不同运行，具有不同的种子，我现在想将这些部分结果组合成一个公共直方图，以便保留以“#A”（所有文件都相同）开头的注释行，而其他注释行被删除。第一列保持不变，然后所有其他列应在所有数据文件上进行平均：

dataComb.dat：

#A < comment line with unique identifier 'A'
1 12 22 
2 12 22 
3 12 22

12 = (11+13)/2 = (12+12)/2 = (13+11)/2 和 22 = (21+23)/2 = (22+22)/2 = (23+21)/2 的位置

我已经有一个 bash 脚本（可能是可怕的代码；但我不是很有经验...），它通过在命令行中运行 ./merge.sh data* > dataComb.dat 来完成这项工作。它还会检查所有数据文件是否具有相同的列数和第一列中的相同值。

merge.sh：

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "at least two files please"
    exit 1;
fi

i=1
for file in "$@"; do
    cols[$i]=$(awk '
BEGIN {cols=0}
$1 !~ /^#/ {
  if (cols==0) {cols=NF}
  else {
    if (cols!=NF) {cols=-1}
  }
}
END {print cols}
' ${file})
    i=$((${i}+1))
done

ncol=${cols[1]}
for i in ${cols[@]}; do
    if [ $i -ne $ncol ]; then
        echo "mismatch in the number of columns"
        exit 1
    fi
done

echo "#combined $# files"
grep "^#A" $1

paste "$@" | awk "
\$1 !~ /^#/ && NF>0 {
  flag=0
  x=\$1
  for (c=1; c<${ncol}; c++) { y[c]=0. }
  i=1
  while (i <= NF) {
    if (\$i==x) {
      for (c=1; c<${ncol}; c++) { y[c] += \$(i+c) }
      i+= ${ncol}
    } else { flag=1; i=NF+1; }
  }
  if (flag==0) {
    printf(\"%e \", x)
    for (c=1; c<${ncol}; c++) { printf(\"%e \", y[c]/$#) }
    printf(\"\n\")
  } else { printf(\"# x -coordinate mismatch\n\") }
}"

exit 0

我的问题是，对于大量数据文件，它会很快变慢，并且有时会引发“打开的文件过多”错误。我看到一次性粘贴所有数据文件 (paste "$@") 是这里的问题，但是分批进行并以某种方式引入临时文件似乎也不是理想的解决方案。我将不胜感激任何帮助以使其更具可扩展性，同时保留调用脚本的方式，即所有数据文件作为命令行参数传递

我决定也将其发布在 python 部分，因为我经常被告知处理此类问题非常方便。然而，我几乎没有使用 python 的经验，但也许这是最终开始学习它的机会；)

【问题讨论】：

标签： python bash

【解决方案1】：

下面附加的代码在 Python 3.3 中工作并产生所需的输出，但有一些小警告：

它从它处理的第一个文件中获取初始注释行，但不会费心检查之后的所有其他注释是否仍然匹配（即，如果您有多个以 #A 开头的文件和一个以 #A 开头的文件）以#C 开头，它不会拒绝#C，即使它可能应该）。我主要想说明合并函数在 Python 中是如何工作的，并认为添加这种杂项有效性检查最好留作“家庭作业”问题。
它也不会检查行数和列数是否匹配，如果不匹配，可能会崩溃。把它当作另一个小作业问题。
它将第一列右侧的所有列打印为浮点值，因为在某些情况下，它们可能就是这样。初始列被视为标签或行号，因此打印为整数值。

你可以用几乎和以前一样的方式调用代码；例如，如果您将脚本文件命名为 merge.py，您可以使用 python merge.py data0001.dat data0002.dat，它会将合并后的平均结果打印到标准输出，就像使用 bash 脚本一样。与较早的答案之一相比，该代码还具有额外的灵活性：它的编写方式原则上应该（我没有实际测试过以确保）能够合并具有任意数量列的文件，而不仅仅是恰好具有三列的文件。另一个不错的好处：文件完成后它不会保持打开状态； with open(name, 'r') as infile: 行是一个 Python 习惯用法，它会在脚本完成从文件中读取后自动导致文件关闭，即使从未显式调用 close()。

#!/usr/bin/env python

import argparse
import re

# Give help description
parser = argparse.ArgumentParser(description='Merge some data files')
# Add to help description
parser.add_argument('fname', metavar='f', nargs='+',
                    help='Names of files to be merged')
# Parse the input arguments!
args = parser.parse_args()
argdct = vars(args)

topcomment=None
output = {}
# Loop over file names
for name in argdct['fname']:
    with open(name, "r") as infile:
        # Loop over lines in each file
        for line in infile:
            line = str(line)
            # Skip comment lines, except to take note of first one that
            # matches "#A"
            if re.search('^#', line):
                if re.search('^#A', line) != None and topcomment==None:
                    topcomment = line
                continue
            items = line.split()
            # If a line matching this one has been encountered in a previous
            # file, add the column values
            currkey = float(items[0])
            if currkey in output.keys():
                for ii in range(len(output[currkey])):
                    output[currkey][ii] += float(items[ii+1])
            # Otherwise, add a new key to the output and create the columns
            else:
                output[currkey] = list(map(float, items[1:]))

# Print the comment line
print(topcomment, end='')
# Get total number of files for calculating average
nfile = len(argdct['fname'])              
# Sort the output keys
skey = sorted(output.keys())
# Loop through sorted keys and print each averaged column to stdout
for key in skey:
    outline = str(int(key))
    for item in output[key]:
        outline += ' ' + str(item/nfile)
    outline += '\n'
    print(outline, end='')

【讨论】：

【解决方案2】：

就像快速检查您拥有/使用的文件处理程序的数量一样，试试这个（unix）：

cat /proc/sys/fs/file-nr

这将为您提供（分配的文件处理程序的数量）-（已分配但未使用的文件处理程序的数量）-（文件处理程序的最大数量）---请参阅here。

可以更改 sysctl.conf 中的限制（在 linux 上 - 参见上面的 link） - 但这可能不是资源管理的好主意，因此不是t 真正可扩展。而且，是的，随着越来越多的处理程序用于打开每个文件（因为它们直到 shell 执行停止/结束后才会关闭），事情开始变得更慢，最终当没有更多的处理程序可用时它会失败。

一个潜在的解决方案可以结合Python/SciPy/Pandas 和一个简单的数据库。有很棒的文档和大型社区支持。与您的帖子密切相关的一个示例是here。关于连接 Pandas 和数据库的小帖子 here。

我还没有尝试过，但我会试一试：

对于数据库，您可以使用 pandas io.sql 模块之类的东西来创建每个 dat 文件的有用表示（可能使用 A# 标头作为每个表的标识符）。然后可以通过任意数量的方法来操作数据，例如glued。这不会保留您要求的 ./merge.sh data* > dataComb.dat 功能，但一个简单的 python 命令行脚本可能会处理所有步骤，以便按照您的意愿获取和处理数据。

我认为这将是一个相当长的学习曲线，但它可以在未来的可扩展性/灵活性方面获得回报。

【讨论】：

【解决方案3】：

您似乎对一次打开的文件过多这一事实感到困惑。您似乎已经知道如何处理其余的处理（即根据唯一 ID 对文件进行排序并访问单个 .dat 文件中包含的值），所以我将只关注这个问题

在处理多个来源时，一个常见的技巧是记住您不需要一次获得所有值来计算平均值。您只需要添加的值的总和和数量。

我不熟悉 awk 语法，所以我会用伪代码编写。

创建一个与您的数据结构匹配的表sum。假设sum[x][y] 保存x 列和y 行的值。该表最初用零填充。
设置计数器n = 0
打开第一个文件。我会跳过你似乎已经处理过的处理部分，所以说data 包含你提取的值。该访问权限类似于为sum 描述的访问权限。
将值添加到您的 sum 表中：sum[x][y] += data[x][y] 对应每个 x 和 y 值
关闭文件。
增加计数器：n += 1
重复步骤 3 到 6，直到处理完所有文件
计算平均值：sum[x][y] = sum[x][y] / n，对于每个 x 和 y 值
你明白了！ sum 现在包含您要查找的平均值。

此算法可处理任意数量的文件，并且在任何给定时间仅打开一个文件。

【讨论】：

【解决方案4】：

你可以试试这段代码，主要思路是迭代读取文件，用每个数字中第二个和第三个值的计数和总和更新一个字典对象，祝你好运！

#first you get the paths for all the dat files:

import os
dat_dir=r'c:\dat_dir'
our_files=[path for os.path.join(dat_dir,f) for f in os.listdir(dat_dir)]

#then you iterate over them and update a dictionary object with the results for each file:

dict_x_values={}
for f in our_files:
    fopen=open(f,'r')
    for line in fopen:
        line=line.strip('\n')
        split=[int(v) for v in line.split()]
        if len(split)==3:
            key=split[0]
            if dict_x_values.has_key(key):
                second_count,second_sum=dict_x_values[key][0] #the second value in the row
                second_count+=1 #we increment the count
                second_sum+=split[1] #we increment the sum
                third_count,third_sum=dict_x_values[key][1] #the third value in the row
                third_count+=1
                third_sum+=split[2]
                dict_x_values[key]=[[second_count,second_sum],[third_count,third_sum]]
            else:
                dict_x_values[key]=[[1,split[1]],[1,split[1]]] #if the dictionary doesn't have the left x-value, we initialize it
    fopen.close()


#Then we write our output combined file

comb_open=open('comb_dat.txt','w')


for key in dict_x_values:
    second_count,second_sum=dict_x_values[key][0] #the second value in the row
    third_count,third_sum=dict_x_values[key][1] #the third value in the row
    second_avg=float(second_sum)/second_count
    third_avg=float(third_sum)/third_count
    line='%s\t%s\t%s'%(key,second_avg,third_avg)

comb_open.close()

【讨论】：