输出文件（染色体块）在 nextflow 中合并答案

【问题标题】：output files (chromosomal chunks) merging in nextflow输出文件（染色体块）在 nextflow 中合并
【发布时间】：2021-08-07 23:41:16
【问题描述】：

我有一个 nextflow 过程，它为每个染色体生成多个块到一个通道中，比如imputation，看起来像，

chr1.imputed.chunk1.gen.gz chr1.imputed.chunk2.gen.gz chr1.imputed.chunk3.gen.gz 
chr1.imputed.chunk1.stats chr1.imputed.chunk2.stats chr1.imputed.chunk3.stats
chr1.imputed.chunk1.bgen chr1.imputed.chunk2.bgen chr1.imputed.chunk3.bgen
.....

每条染色体有很多块（22 条染色体）。我怎样才能有效地合并它们为每种类型的文件集获取相应的染色体，

chr1.imputed.merged.gen.gz
chr1.imputed.merged.stats
chr1.imputed.merged.bgen

得到合并后的输出后，我想删除所有的块。有什么帮助吗？

生成这些块的实际代码是：

process imputation {
publishDir params.out, mode:'copy'
input:
tuple val(chrom),val(chunk_array),val(chunk_start),val(chunk_end),path(in_haps),path(refs),path(maps) from imp_ch
output:
tuple val("${chrom}"),path("${chrom}.*") into imputed
script:
def (haps,sample)=in_haps
def (haplotype, legend, samples)=refs
"""
impute4.1.2_r300.3 -g "${haps}" -h "${haplotype}" -l "${legend}" -m "${maps}" -o "${chrom}.step10.imputed.chunk${chunk_array}" -no_maf_align -o_gz -int "${chunk_start}" "${chunk_end}" -Ne 20000 -buffer 1000 -seed 54321

if [[ \$(gunzip -c "${chrom}.step10.imputed.chunk${chunk_array}.gen.gz" | head -c1 | wc -c) == "0" ]]
then 
 echo  "${chrom}.step10.imputed.chunk${chunk_array}.gen.gz" is empty
else
 qctool_v2.0.8_rhel -g "${chrom}.step10.imputed.chunk${chunk_array}.gen.gz" -snp-stats -osnp "${chrom}.step10.imputed.chunk${chunk_array}.snp.stats"
 qctool_v2.0.8_rhel -g "${chrom}.step10.imputed.chunk${chunk_array}.gen.gz" -og "${chrom}.step10.imputed.chunk${chunk_array}.bgen" -os "${chrom}.step10.imputed.chunk${chunk_array}.sample"
fi
 """

【问题讨论】：

标签： nextflow

【解决方案1】：

您能否发布生成您显示的 sn-p 的实际代码

不看你的代码，我建议你试试这个http://nextflow-io.github.io/patterns/index.html#_process_per_file_range

【讨论】：

嗨，谢谢。您共享的链接在这种特殊情况下没有帮助，因为它是流程的输出。但是，我添加了生成这些块的实际代码。希望它能澄清这个问题。再次感谢。

【解决方案2】：

你有这个

output:
tuple val("${chrom}"),path("${chrom}.*") into imputed

使用之前的输出通道规范，您可能必须在下游 process 中执行类似的操作

input:
tuple val(name), path(chr_files) from imputed

script:  
gen_files = chr_files.findAll { it.toString().endsWith('.gen.gz') }.sort()
stat_files = chr_files.findAll { it.toString().endsWith('.stats') }.sort()
"""
# try with echo first to see if you get what you want
echo ${gen_files.join(' ')} > ${name}_gen_fileList.txt
echo ${stat_files.join(' ')} > ${name}_stat_fileList.txt
"""

一旦您确定上面的 echo 正在按预期打印，那么您可以在该 process 块中执行其他操作

【讨论】：

谢谢@user10101904。我只得到输出 *txt 文件中的最后一个块。我尝试使用echo ${gen_files.join(' ')} >> ${name}_gen_fileList.txt，但输出相同。另外，我收到一个错误：WARN: failed to publish file。其他进程工作正常，没有给出这样的警告。
我稍微修改了input 声明tuple val("${chrom}"),path("${chrom}.*") into imputed.groupTuple().collect{chrom, files -> [ chrom, files.collect{it.string()}.join(' ')]}，它给出了一个新错误：input tuple does not match input set cardinality declared by process 'merging'。有什么帮助吗？

【解决方案3】：

显然以下代码行解决了这个问题。

imputed.into{impute_bgen;impute_gen;impute_sample;impute_stat}
bgens=impute_bgen.groupTuple().transpose().map{chrom,bfiles -> tuple(chrom,bfiles[0])}.groupTuple()
gens=impute_gen.groupTuple().transpose().map{chrom,bfiles -> tuple(chrom,bfiles[1])}.groupTuple()
samples=impute_sample.groupTuple().transpose().map{chrom,bfiles -> tuple(chrom,bfiles[2])}.groupTuple()
stats=impute_stat.groupTuple().transpose().map{chrom,bfiles -> tuple(chrom,bfiles[3])}.groupTuple()

【讨论】：