在具有不同输出文件linux的多个文件上运行数组答案

【问题标题】：Running an array on multiple files with different output files linux在具有不同输出文件linux的多个文件上运行数组
【发布时间】：2020-08-18 09:20:54
【问题描述】：

我想将每个大约有 2e9 行的 8 个文件（每个代表一个染色体）分解成大约 5 个 4e8 行的块。这些是 VCF 文件 (https://en.wikipedia.org/wiki/Variant_Call_Format)，它们有一个标题，然后是遗传变异，所以我需要保留每个标题并将它们重新附加到特定于染色体的标题。我在 Linux 上的 HPC 上执行此操作。

在使用之前我已经用一个文件完成了这个：

#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants

我可以使用所有 8 条染色体手动执行此操作，但我在具有数组功能的 HPC 中工作，并且觉得使用 for 循环可以更好地完成此操作，但是语法对我来说有点混乱。

我试过了：

#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf 
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done

这会将所有内容放在同一个标题中。我如何将输出放入每个染色体的唯一标题中？同样，这将如何拆分变体并将标头重新附加到每个染色体的每个块？

期望的输出是：

chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf

每个 vcf 块都具有来自其各自染色体“父级”的标头。

非常感谢

【问题讨论】：

原始循环每次都会覆盖header，需要一个驱动至少部分输出文件名的变量。请确认或修改：对于每个.vfc文件：私下提取头信息，从同一个.vcf文件中提取非头数据，拆分成40m行的几个部分文件；然后将私有头和每个部分文件组合起来，命名为chunk#。还有你的壳是什么？回复后我可以发布一个脚本。
@Milag 非常感谢您花时间回复。您的工作流程是正确的，是的。外壳是 CentOS 7
好的。 CentOS 是一个 Linux 发行版；你在运行bash 来解释脚本吗？
正确的是。使用 bash 通过 HPC 运行脚本。
发布了初始脚本。让我知道你的结果。

标签： arrays bash shell loops vcf-variant-call-format

【解决方案1】：

#!/bin/bash

#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#

# for tuning
lines=40000000

vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
    echo no .vcf files
    exit 1
fi

tmpv=variants
hdr=header

for chrfile in "${vcf_list[@]}"; do
    # isolate without . extn
    base=${chrfile%%.*}
    echo $chrfile

    # extract header lines
    head -1000 $chrfile | grep "^#" > $hdr

    # extract variants
    grep -v "^#" $chrfile > $tmpv

    #
    # split variants into files with max $lines;
    # output files are created with a filter to combine header data and
    # partial variant data in 1 pass, avoiding additional file I/O;
    # output files are named with a leading 'p' to support multiple
    # runs without filename collision
    #
    split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
        --filter="cat $hdr - > \$FILE; echo \"  \$FILE\""
done

rm -f $tmpv $hdr

exit 0

【讨论】：