如何在大于 RAM gzip 的 csv 文件上聚合值？答案

【问题标题】：How to aggregate values over a bigger than RAM gzip'ed csv file?如何在大于 RAM gzip 的 csv 文件上聚合值？
【发布时间】：2016-11-10 13:15:27
【问题描述】：

对于初学者，我是生物信息学的新手，尤其是编程新手，但我已经构建了一个脚本，该脚本将通过所谓的 VCF 文件（仅包括个人，一个 clumn = 一个个人），并使用搜索字符串找出每个变体（系）个体是纯合子还是杂合子。

这个脚本至少在小的子集上有效，但我知道它将所有内容都存储在内存中。我想在非常大的压缩文件（甚至整个基因组）上执行此操作，但我不知道如何将此脚本转换为逐行执行所有操作的脚本（因为我想计算整个列我只是不看看如何解决）。

所以输出是每个个体 5 个东西（总变体、纯合子数量、杂合子数量以及纯合子和杂合子的比例）。请看下面的代码：

#!usr/bin/env python
import re
import gzip

subset_cols = 'subset_cols_chr18.vcf.gz'
#nuc_div = 'nuc_div_chr18.txt'

gz_infile = gzip.GzipFile(subset_cols, "r")  
#gz_outfile = gzip.GzipFile(nuc_div, "w") 

# make a dictionary of the header line for easy retrieval of elements later on

headers = gz_infile.readline().rstrip().split('\t')             
print headers                                                   

column_dict = {}                                        
for header in headers:
        column_dict[header] = []                        
for line in gz_infile:                                     
        columns = line.rstrip().split('\t')             
        for i in range(len(columns)):                   
                c_header=headers[i]                     
                column_dict[c_header].append(columns[i])
#print column_dict

for key in column_dict:                         
        number_homozygotes = 0          
        number_heterozygotes = 0        

        for values in column_dict[key]: 
                SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'     
#this search string contains the regexp (this regexp was tested)
                Result = re.search(SearchStr,values)                    
                if Result:
#here, it will skip the missing genoytypes ./.
                        variant_one = int(Result.group(1))              
                        variant_two = int(Result.group(2))              

                        if variant_one == 0 and variant_two == 0:
                                continue
                        elif variant_one == variant_two:                  
#count +1 in case variant one and two are equal (so 0/0, 1/1, etc.)
                                number_homozygotes += 1
                        elif variant_one != variant_two:
#count +1 in case variant one is not equal to variant two (so 1/0, 0/1, etc.)
                                number_heterozygotes += 1

        print "%s homozygotes %s" % (number_homozygotes, key) 
        print "%s heterozygotes %s" % (number_heterozygotes,key)

        variants = number_homozygotes + number_heterozygotes
        print "%s variants" % variants

        prop_homozygotes = (1.0*number_homozygotes/variants)*100
        prop_heterozygotes = (1.0*number_heterozygotes/variants)*100

        print "%s %% homozygous %s" % (prop_homozygotes, key)
        print "%s %% heterozygous %s" % (prop_heterozygotes, key)

任何帮助将不胜感激，因此我可以继续调查大型数据集，谢谢你:)

顺便说一下，VCF 文件看起来像这样：个人_1 个人_2 个人_3 0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,0,26 1/1:0,13:13:33:347,33,0

然后是带有个人 ID 名称的标题行（我总共有 33 个具有更复杂 ID 标签的个人，我在这里简化了），然后我有很多具有相同特定模式的这些信息行。我只对带有斜线的第一部分感兴趣，因此是常规表达式。

【问题讨论】：

如果您可以编辑问题以包含 VCF 文件顶部的示例以及预期结果，这将有所帮助。
你用的是什么版本的 Python？
基本上你想要的是逐步解压缩一个csv文件并产生行
VCF 文件的示例会很有帮助。 TIA。
我在 VCF 文件中添加了简化的数据，但是通过查看下面的代码，我发现您已经自己弄清楚了 :) 通常每行也有常规信息（也在列中，在个人列之前），但我会在应用此脚本之前将它们过滤掉。

标签： python csv gzip bioinformatics vcf-variant-call-format

【解决方案1】：

披露：我全职从事 Hail 项目。

你好！欢迎来到编程和生物信息学！

amirouche 正确识别出您需要某种“流媒体”或 “逐行”算法来处理太大而无法放入 RAM 的数据你的机器。不幸的是，如果你仅限于没有库的 python，你必须手动分块文件并处理 VCF 的解析。

Hail project 是供科学家使用的免费开源工具遗传数据太大而无法放入 RAM 一直到太大而无法放入机器（即数十 TB 的压缩 VCF 数据）。冰雹可以利用一台机器上的所有核心或机器云上的所有核心。冰雹在 Mac OS X 和大多数 GNU/Linux 版本上运行。冰雹暴露了统计数据遗传学领域特定语言，使您的问题更短快递。

最简单的答案

你的 Python 代码最忠实地翻译成 Hail 是这样的：

/path/to/hail importvcf -f YOUR_FILE.vcf.gz \
  annotatesamples expr -c \
    'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()'
  annotatesamples expr -c \
    'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled' \
  exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv

我在双核笔记本电脑上运行上述命令，文件大小为 2.0GB：

# ls -alh profile225.vcf.bgz
-rw-r--r--  1 dking  1594166068   2.0G Aug 25 15:43 profile225.vcf.bgz
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
  annotatesamples expr -c \
    'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()' \
  annotatesamples expr -c \
    'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled' \
  exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: while importing:
    file:/Users/dking/projects/hail-data/profile225.vcf.bgz  import clean
hail: info: timing:
  importvcf: 34.211s
  annotatesamples expr: 6m52.4s
  annotatesamples expr: 21.399ms
  exportsamples: 121.786ms
  total: 7m26.8s
# head sampleInfo.tsv 
sample  pHomRef pHet    nHom    nHet    nCalled
HG00096 9.49219e-01 5.07815e-02 212325  11359   223684
HG00097 9.28419e-01 7.15807e-02 214035  16502   230537
HG00099 9.27182e-01 7.28184e-02 211619  16620   228239
HG00100 9.19605e-01 8.03948e-02 214554  18757   233311
HG00101 9.28714e-01 7.12865e-02 214283  16448   230731
HG00102 9.24274e-01 7.57260e-02 212095  17377   229472
HG00103 9.36543e-01 6.34566e-02 209944  14225   224169
HG00105 9.29944e-01 7.00564e-02 214153  16133   230286
HG00106 9.25831e-01 7.41687e-02 213805  17128   230933

哇！ 2GB 7 分钟，太慢了！不幸的是，这是因为 VCF 不是数据分析的好格式！

优化存储格式

让我们转换成Hail优化的存储格式，一个VDS，重新运行命令：

# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
hail: info: running: write -o profile225.vds
[Stage 1:>                                                         (0 + 4) / 65]
[Stage 1:========================================================>(64 + 1) / 65]
# ../hail/build/install/hail/bin/hail read -i profile225.vds \
       annotatesamples expr -c \
         'sa.nCalled = gs.filter(g => g.isCalled).count(),
          sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
          sa.nHet = gs.filter(g => g.isHet).count()' \
       annotatesamples expr -c \
         'sa.pHom =  sa.nHom / sa.nCalled,
          sa.pHet =  sa.nHet / sa.nCalled' \
       exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 1:>                                                          (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================>              (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
         sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
         sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom =  sa.nHom / sa.nCalled,
         sa.pHet =  sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: timing:
  read: 2.969s
  annotatesamples expr: 1m20.4s
  annotatesamples expr: 21.868ms
  exportsamples: 151.829ms
  total: 1m23.5s

大约快五倍！对于更大的规模，在代表完整 VCF 的 VDS 上的 Google 云上运行相同的命令，1000 Genomes Project（2535 个全基因组，大约 315GB 压缩）使用 328 个工作核心耗时 3 分 42 秒。

使用内置的冰雹

Hail 还有一个sampleqc 命令，它可以计算您想要的大部分内容（以及更多！）：

../hail/build/install/hail/bin/hail  read -i profile225.vds \
      sampleqc \
      annotatesamples expr -c \
        'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
         sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
      exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 0:>                                                          (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================>              (3 + 1) / 4]hail: info: running: sampleqc
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
         sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: timing:
  read: 2.928s
  sampleqc: 1m27.0s
  annotatesamples expr: 229.653ms
  exportsamples: 353.942ms
  total: 1m30.5s

安装冰雹

安装 Hail 非常简单，我们有 help you 的文档。需要更多帮助？你可以得到 Hail 用户聊天室中的实时支持，或者，如果您更喜欢论坛，Hail 话语（两者都从主页链接到，不幸的是我没有足够的声誉来创建真正的链接）。

不久的将来

在不久的将来（从今天起不到一个月），Hail 团队将完成一个 Python API，它将允许您将第一个 sn-p 表示为：

result = importvcf("YOUR_FILE.vcf.gz")
  .annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),
                    sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
                    sa.nHet = gs.filter(g => g.isHet).count()')
  .annotatesamples('sa.pHom =  sa.nHom / sa.nCalled,
                    sa.pHet =  sa.nHet / sa.nCalled')

for (x in result.sampleannotations):
  print("Sample " + x +
        " nCalled " + x.nCalled +
        " nHom " + x.nHom +
        " nHet " + x.nHet +
        " percent Hom " + x.pHom * 100 +
        " percent Het " + x.pHet * 100)

result.sampleannotations.write("sampleInfo.tsv")

编辑：在 tsv 文件中添加了head 的输出。

EDIT2：最新的 Hail 不需要对 sampleqc 进行双等位基因

EDIT3：关于扩展到具有数百个内核的云的注意事项

【讨论】：

哇，非常感谢您提供的详细帮助！它的语言对我来说仍然有点压倒性（我对这一切都很陌生，从未听说过 HAIL）。我会看看我是否可以使用这个谢谢！
对不起，我不小心按了回车。如果您在安装冰雹或理解语言方面需要任何帮助，请前往我上面链接的聊天室。我想你会发现你想问的问题比 python 更容易表达。

【解决方案2】：

为了能够处理大于 RAM 的数据集，您需要重新设计算法以逐行处理数据，现在您正在处理每一列。

但在此之前，您需要一种方法从 gzip 文件中流式传输行。

以下 Python 3 代码执行此操作：

"""https://stackoverflow.com/a/40548567/140837"""
#!/usr/bin/env python3
import zlib
from mmap import PAGESIZE


CHUNKSIZE = PAGESIZE


# This is a generator that yields *decompressed* chunks from
# a gzip file. This is also called a stream or lazy list.
# It's done like so to avoid to have the whole file into memory
# Read more about Python generators to understand how it works.
# cf. `yield` keyword.
def gzip_to_chunks(filename):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS + 16)
    with open(filename, 'rb') as f:
        chunk = f.read(CHUNKSIZE)

        while chunk:
            out = decompressor.decompress(chunk)
            yield out
            chunk = f.read(CHUNKSIZE)

        out = decompressor.flush()

        yield out


# Again the following is a generator (see the `yield` keyword).
# What id does is iterate over an *iterable* of strings and yields
# rows from the file

# (hint: `gzip_to_chunks(filename)` returns a generator of strings)
# (hint: a generator is also an iterable)

# You can verify that by calling `chunks_to_rows` with a list of
# strings, where every strings is a chunk of the VCF file.
# (hint: a list is also an iterable)

# inline doc follows
def chunks_to_rows(chunks):
    row = b''  # we will add the chars making a single row to this variable
    for chunk in chunks:  # iterate over the strings/chuncks yielded by gzip_to_chunks
        for char in chunk:  # iterate over all chars from the string
            if char == b'\n'[0]:  # hey! this is the end of the row!
                yield row.decode('utf8').split('\t')  # the row is complete, yield!
                row = b''  # start a new row
            else:
                row += int.to_bytes(char, 1, byteorder='big')  # Otherwise we are in the middle of the row
        # at this point the program has read all the chunk
    # at this point the program has read all the file without loading it fully in memory at once
    # That said, there's maybe still something in row
    if row:
        yield row.decode('utf-8').split('\t')  # yield the very last row if any


for e in chunks_to_rows(gzip_to_chunks('conceptnet-assertions-5.6.0.csv.gz')):
    uid, relation, start, end, metadata = e
    print(start, relation, end)

编辑：修改答案并使其适用于已压缩的 concetpnet's tsv file

【讨论】：

这看起来确实是我能想到的（看起来不太复杂，我还是个初学者）所以非常感谢！！你为什么使用那个特定的块大小？这是否意味着我得到每个块大小的输出？
CHUNKSIZE 是一个任意数字，实际上最好匹配mmap.PAGESIZE 我会用它来更新我的答案。它是一次从硬盘读取的字符数。操作系统使用mmap.PAGESIZE 来读取文件......例如some_file.read(1) 操作系统将从硬盘读取（并缓存）mmap.PAGESIZE 字符并且只返回1个字符。由于操作系统缓存了文件的某些部分，因此对some_file.read(1) 的以下调用不会命中硬盘。无论如何，这是细节。
@visse226 感谢您提醒我您是初学者，我将使用更多 cmets 更新代码。让我知道你的想法。
我将为每个函数添加一些测试用例。请注意，对于初学者来说，这个非常重要的代码是因为您以前从未遇到过生成器。生成器非常需要正确处理此类问题（大于 ram）。
不要忘记投票和/或标记最能解决您问题的问题；）