子进程：无法将“_io.BufferedReader”对象隐式转换为 str答案

【问题标题】：Subprocess: Can't convert '_io.BufferedReader' object to str implicitly子进程：无法将“_io.BufferedReader”对象隐式转换为 str
【发布时间】：2017-04-10 16:01:14
【问题描述】：

我正在编写一个脚本，该脚本结合了snakemake 和 python 代码来自动化大量成对出现的文件。更准确地说，我正在将读取与 BWA MEM 与配对末端读取 (http://bio-bwa.sourceforge.net/bwa.shtml) 对齐。在脚本的第一部分，我遍历了文件中的名称列表（它们是 fastq bunzipped 文件），然后在列表中对它们进行了相应的排序。下面是一些文件的快速浏览：

['NG-8653_1A_lib95899_4332_7_1', 'NG-8653_1A_lib95899_4332_7_2', 'NG-8653_1B_lib95900_4332_7_1'、'NG-8653_1B_lib95900_4332_7_2'、 'NG-8653_1N_lib95898_4332_7_1'、'NG-8653_1N_lib95898_4332_7_2']

如您所见，读取按两两排序（1A_...1 和 1A..._2，等等...）。现在使用子进程，我想通过使用 bunzip2 解压缩它们然后将它们通过 stdin 传递给 bwa mem 来对齐它们。 bwa mem 命令将 fastq 格式文件转换为 .sam 文件，然后我必须使用 samtools 将它们转换为 .bam 格式。到目前为止的脚本如下：

import re, os, subprocess, bz2

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "/home/alaa/Documents/snakemake/fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"

FILE_FASTQ = glob_wildcards("fastq/{samples}.fastq.bz2")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append(x)

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)

rule fastq_to_bam:
    run:
        for x in range(0, len(LIST_FILE_SAMPLES), 2):
            # get the name of the sample (1A, 1B ...)
            samp = ""
            samp += LIST_FILE_SAMPLES[x].split("_")[1]

            # get the corresponding read (1 or 2)
            r1 = SAMPLESDIR + LIST_FILE_SAMPLES[x] + ".fastq.bz2"
            r2 = SAMPLESDIR + LIST_FILE_SAMPLES[x+1] + ".fastq.bz2"

            # gunzipping the files and pipping them
            p1 = subprocess.Popen(['bunzip2', '-kc', r1], stdout=subprocess.PIPE)
            p2 = subprocess.Popen(['bunzip2', '-kc', r2], stdout=subprocess.PIPE)           


            # now write the output file to .bam format after aligning them
            with open("sam/" + samp + ".bam", "w") as stdout:
                fastq2sam = subprocess.Popen(["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout], stdout=subprocess.PIPE)
                fastq2samOutput = subprocess.Popen(["samtools", "view", "-Sb", "-"], shell = True, stdin=fastq2sam.stdout, stdout=stdout)

我试图通过逐行尝试来调试脚本。将 bunzip2 写入输出文件时，它工作正常。现在，如果我尝试通过管道传输它，我会收到一个错误：

Error in job fastq_to_bam while creating output file .
RuleException:
TypeError in line 39 of /home/alaa/Documents/snakemake/Snakefile:
Can't convert '_io.BufferedReader' object to str implicitly
  File "/home/alaa/Documents/snakemake/Snakefile", line 39, in __rule_fastq_to_bam
  File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
  File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
 Exiting because a job execution failed. Look above for error message
 Will exit after finishing currently running jobs.
 Exiting because a job execution failed. Look above for error message

你能告诉我脚本有什么问题吗？从今天早上开始，我一直在寻找问题，但我似乎无法弄清楚。任何帮助深表感谢。提前致谢。

编辑 1：

在阅读了来自@bli 和@Johannes 的更多反馈后，我已经做到了这一点：

import re, os, subprocess, bz2, multiprocessing
from os.path import join
from contextlib import closing

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"


FILE_FASTQ = glob_wildcards("fastq/{samples, NG-8653_\d+[a-zA-Z]+_.+}")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append("_".join(x.split("_")[0:5]))

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)


rule final:
    input:
        expand('bam/' + '{sample}.bam', sample = LIST_FILE_SAMPLES)

rule bunzip_fastq:
    input:
        r1 = SAMPLESDIR + '{sample}_1.fastq.bz2',
        r2 = SAMPLESDIR + '{sample}_2.fastq.bz2'
    output:
        o1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        o2 = SAMPLESDIR + '{sample}_r2.fastq.gz'
    shell:
        """
        bunzip2 -kc < {input.r1} | gzip -c > {output.o1}
        bunzip2 -kc < {input.r2} | gzip -c > {output.o2}
        """

rule fastq_to_bam:
    input:
        r1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        r2 = SAMPLESDIR + '{sample}_r2.fastq.gz',
        ref = REF
    output:
        'bam/' + '{sample}.bam'
    shell:
        """
        bwa mem {input.ref} {input.r1} {input.r2} | samtools -b > {output}
        """

非常感谢您的帮助！我想我可以从这里开始。

最好的问候，唉

【问题讨论】：

我对 Johannes Köster 在他的回答中的评论表示赞同。您可能会考虑为 bunzipping 设置一个单独的规则，您可以在其中使用“shell”部分，而不必使用 subprocess 手动运行。然后将此规则的输出作为映射规则的输入（并删除循环并改用通配符）。

标签： python subprocess bioinformatics snakemake

【解决方案1】：

你的问题在这里：

["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout]

p1.stdout 和 p2.stdout 的类型为 BufferedReader，但 subprocess.Popen 需要一个字符串列表。您可能想要使用的是例如p1.stdout.read().

但是，请注意，您的方法并不是使用 Snakemake 的惯用方式，事实上，目前脚本中没有任何内容真正利用了 Snakemake 的功能。

使用 Snakemake，您希望有一个规则来处理带有 bwa mem 的单个样本，将 fastq 作为输入并将 bam 作为输出存储。请参阅官方 Snakemake 教程中的this example。它完全符合您在此处尝试完成的任务，但所需的样板文件要少得多。只需让 Snakemake 完成这项工作，不要尝试自己重新实现。

【讨论】：

感谢约翰内斯的回复。你的评论很有帮助。而我发布的只是剧本的开头。我完全理解 Snakemake 的工作原理，因为我在下游有进一步的规则！再次感谢您的帮助。我仍在编写脚本，完成后我会发布我的解决方法。