AWK：对多列数据进行简单的数学运算，并进行后续数据转换答案

【问题标题】：AWK: simple math operations on multi-column data with subsequent data conversionAWK：对多列数据进行简单的数学运算，并进行后续数据转换
【发布时间】：2021-07-09 09:25:04
【问题描述】：

作为我的 Bash 例程的一部分，我正在处理由许多文件夹组成的目录，具有以下命名模式：

1000_cne_lig1, 1000_cne_lig2, 1000_cne_lig3, 1000_cne_lig4, 1000_cne_lig5  ... 1000_cne_ligN
2000_cne_lig1, 2000_cne_lig2, 2000_cne_lig3, 2000_cne_lig4, 2000_cne_lig5  ... 2000_cne_ligN
3000_cne_lig1, 3000_cne_lig2, 3000_cne_lig3, 4000_cne_lig4, 5000_cne_lig5  ... 3000_cne_ligN
7000_cne_lig1, 7000_cne_lig2, 7000_cne_lig3, 7000_cne_lig4, 7000_cne_lig5  ... 4000_cne_ligN
...
xxxx_cne_lig1, xxxx_cne_lig2, xxxx_cne_lig3, xxxx_cne_lig4, xxxx_cne_lig5  ... xxxx_cne_ligN

请注意，所有文件夹都可以根据系统名称分为X个类别（1000,2000 ... xxxx），系统名称由名称中第一个斜杠“_”之前出现的模式定义每个文件夹的。在每个文件夹内都有一个 CSV 文件，其中包含以多行格式排列的数据：

ID, POP, dG
1, 40, -5.7600
2, 2, -5.4000
3, 8, -5.3300

我需要迭代地遍历属于不同系统的文件夹（例如，对于 1000 个的 5 个文件夹，然后对于 2000 个的 5 个文件夹等）并检测 CSV 文件。然后我需要对每个 CSV 日志（针对特定系统）执行一些简单的数学运算：计算负数的平均值（在 CSV 的第三列中）并将其保存在包含系统名称的新文件中（例如 1000.csv）在一行中包含：特定文件夹的名称、平均值。例如对于系统 1000，1000.csv 应该是：

# system 1000; dG(mean)
lig1: -5.555
lig2: -6.003
lig3: -3.031
lig4: -3.222
lig5: -10.300
ligN: -NN.NNN

注意，我在每一行（原始文件的名称）中删除了 1000_cne_，但将其添加到 CSV 的头部。

最后，对于 X 系统，脚本应该根据文件夹的数量生成包含 N 行的 X 个新 CSV 填充（1000.csv、2000.csv、XXXX.csv 等）。

这是 bash 例程的实际实现，它已经对文件夹进行了分类，然后应该由 AWK 完成，AWK 将完成所有数学运算并将计算的平均值传输到新的 CSV：

#!/bin/bash
home=$PWD
# folder with the folders to analyse
storage="${home}"/results
# folder with the outputs
rescore="${home}"/rescore 
# pattern to recognize csv file for analysis
csv_pattern='*_filt3b.csv'



# this will iteratively do something on the group of the folders belonged to one syst
for folder in "${storage}"/*; do
# this is the name of each folder
folder_name=$(basename "$folder")
# detect the name of the system (X) determined by 4 characters near the first _ >> this is the name of output.csv
syst_name=$(basename "$folder" | cut -d'_' -f 1)
# detect the name of the sample (N) the last entry after the last _ >> the name of the lines in new CSV
sample_name=$(basename "$folder" | cut -d'_' -f 3)
pushd ${folder}
# apply AWK on each CSV to calculate MEAN and store it in new output.csv :
    awk 'FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s+=$3
   ++n
}
END {
   mean[suffix] = s/n
   print "# system", prefix, "; dG(mean)"
   for (i in mean)
      print i ":", mean[i]
}' "${folder}"/*filt3b.csv >> ${rescore}/${syst_name}.csv  
    popd
done

这给了我以下 output.csv（对于 1000 系统的十个已处理 CSV 填充）：

# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -6.44
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.59
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.96
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.17
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -4.73
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.04
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -6.625
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -2
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -5.34
# system /Users/gleb/Desktop/scripts/analys ; dG(mean)
filt3b.csv: -8.14

表明 BASH 和 AWK 部分之间存在一些不匹配：在输出中 filt3b.csv 应替换为包含此 filt3b.csv 的文件夹的名称部分（如 lig1 等）。此外，路径 /Users/gleb/Desktop/scripts/analys 不应存在，而是由系统名称替换（如 1000，始终出现在最终输出的底行）。最后，平均值的数量应该以 -X.XX 格式提供，例如 -4.59（不是 -X 或 -X.XXX）

更新：找到解决 output.csv 标头问题的可能性，方法是在同一脚本中的 AWK 处理之前通过 ECHO 引入它：

home=$PWD
# folder with the results
storage="${home}"/results
tmp="${home}"/tmp
rescore="${home}"/rescore

# csv for rescoring
csv_pattern='*_filt3b.csv'
    
if [ -d "${rescore}" ]; then
  rm -rf "${rescore}"
  mkdir "${rescore}"
  else
  mkdir "${rescore}"
fi


for folder in "${storage}"/*; do
# this is the name of each folder
folder_name=$(basename "$folder")
# detect the name of the system (X) determined by 4 characters near the first _ >> this is the name of output.csv
syst_name=$(basename "$folder" | cut -d'_' -f 1)
# detect the name of the sample (N) the last entry after the last _ >> the name of the lines in new CSV
sample_name=$(basename "$folder" | cut -d'_' -f 3)
pushd $folder
# a simple example of output format w/o calculations
if [ ! -f ${rescore}/${syst_name}.csv ]; then
# add the header to the CSV contained name of the system
echo "${syst_name}: dG(rescored)" > ${rescore}/${syst_name}.csv
fi
# apply AWK on each CSV to calculate mean for the numbers in third column 
awk 'FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s+=$3
   ++n
}
END {
   mean[suffix] = s/n
   #print "# system", prefix, "; dG(mean)"
   for (i in mean)
      print i ":", mean[i]
}' ${folder}/${csv_pattern} >> ${rescore}/${syst_name}.csv
popd
done

它为 10 个文件夹（如 1000_cne_ligN）提供 1000.csv：

1000: dG(rescored)
filt3b.csv: -6.3825
filt3b.csv: -4.455
filt3b.csv: -5.28
filt3b.csv: -5.76
filt3b.csv: -5.52
filt3b.csv: -3.92
filt3b.csv: -7.505
filt3b.csv: -1.8
filt3b.csv: -5.79
filt3b.csv: -5.61

【问题讨论】：

所以你被困在编写 awk 程序上？我在您的“问题”中没有看到真正的问题。
所以，整个问题都与 awk 有关，因为 bash 部分无关紧要。在这种情况下，我建议你只标记你的问题 awk，并定义你想要做什么样的计算，然后发布你到目前为止得到的 awk 程序。
愚蠢的问题：如果您发现使用 awk 很棘手（它不是每个人都喜欢的语言），为什么不使用您觉得更舒服的其他语言？
请不要将整个 awk 程序压缩成一行；很难阅读。使用多行并正确格式化。对于如此复杂的 awk 程序，无论如何我都会把它放到一个单独的文件中。
这实际上是这个主题的答案中给出的 AWK 解决方案，我刚刚修改了我的帖子。在我的情况下，将它保存在 bash 脚本中很重要 :-)

标签： bash awk

【解决方案1】：

您可以使用这个单一的awk 脚本，它也符合 POSIX：

awk 'FNR==1 {if (n) mean[suffix] = s/n; prefix=suffix=FILENAME; sub(/_.*/, "", prefix); sub(/^.*_/, "", suffix); s=n=0} FNR > 1 {s+=$3; ++n} END {mean[suffix] = s/n; print "# system", prefix, "; dG(mean)"; for (i in mean) print i ":", mean[i]}' 1000_*

# system 1000 ; dG(mean)
lig1: -5.49667
lig2: -6.76333

根据您编辑的问题，这里是完整的 shell 脚本：

#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
cd "${home}"/results
# pattern to recognize csv file for analysis
csv_pattern='*_filt3b.csv'

while read -r d; do
awk '
FNR==1 {
   if (n)
      mean[suffix] = s/n
   prefix=suffix=FILENAME
   sub(/_.*/, "", prefix)
   sub(/\/[^\/]+$/, "", suffix)
   sub(/^.*_/, "", suffix)
   s=n=0
}
FNR > 1 {
   s += $3
   ++n
}
END {
   mean[suffix] = s/n
   print "# system", prefix, "; dG(mean)"
   for (i in mean)
      printf "%s: %.2f\n", i, mean[i]
}' "${d}_"*/$csv_pattern | sort > "$rescore/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')

【讨论】：

我已经用示例编辑了我的第一篇文章（结尾）...
抱歉，刚刚编辑了第一个主题中的脚本。其实就是 csv_pattern='*_filt3b.csv'
好的，检查我更新的答案。请记住，这是从"${home}"/results 目录运行的，因为我使用了cd "${home}"/results
我可以请求一个新问题，因为这已经是一个相当复杂的 bash+awk 脚本。请使用示例数据阐明问题中的要求，以便我们快速为您提供帮助。
anubhara，我在 Code Review 中创建了我的主题，因为它主要与书面算法的改进有关 >> codereview.stackexchange.com/questions/260466/… 非常感谢您的帮助！干杯