【问题标题】:GenomicRanges add coverageGenomicRanges 增加了覆盖范围
【发布时间】:2017-02-22 19:10:46
【问题描述】:

我正在处理 RNA seq 数据并尝试按基因型绘制平均覆盖率概况,类似于此处所做的

每个基因型的 RNA seq 覆盖率(来源:pickrell 等人,Nature,2010)

为了绘制这个图,我有来自 100 个人的重要文件,其中包含来自 RNA-seq 数据(在特定区域中)的覆盖信息,并且我在 R 中读取这些信息,作为 GenomicRanges 对象。

这给了我 GRanges 对象,例如在以下玩具示例中获得的对象:

gr1=GRanges(seqname=1,range=IRanges(start=c(1,5,10,15,30,55), end=c(4,9,14,29,39,60)))

gr1$cov=c(3,1,8,6,2,10)

gr2=GRanges(seqname=1,range=IRanges(start=c(3,20,24), end=c(7,23,26)))

gr2$cov=c(3,5,3)

start=unique(sort(c(ranges(gr1)@start,ranges(gr2)@start)))

gr1

GRanges object with 6 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 1,  4]      * |         3
       1  [ 5,  9]      * |         1
       1  [10, 14]      * |         8
       1  [15, 29]      * |         6
       1  [30, 39]      * |         2
       1  [55, 60]      * |        10 
        -------
 seqinfo: 1 sequence from an unspecified genome; no seqlengths

gr2

GRanges object with 3 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 3,  7]      * |         3
       1  [20, 23]      * |         5
       1  [24, 26]      * |         3
       -------
 seqinfo: 1 sequence from an unspecified genome; no seqlengths

问题是我每个人都有这些(gr1 和 gr2 将是 2 个不同的人),我想将它们结合起来创建一个基因组范围对象,该对象为我提供了每个人在每个位置的总覆盖率,1和 2 如下所示:

gr3

GRanges object with 6 ranges and 1 metadata column:
seqnames    ranges strand |       cov
   <Rle> <IRanges>  <Rle> | <numeric>
       1  [ 1,  2]      * |         3
       1  [ 3,  4]      * |         6 (=3+3)
       1  [ 5,  7]      * |         4 (=1+3)
       1  [ 8,  9]      * |         1
       1  [10, 14]      * |         8
       1  [15, 19]      * |         6
       1  [20, 23]      * |         11 (=6+5)
       1  [24, 26]      * |         9 (=6+3)
       1  [27, 29]      * |         6
       1  [30, 39]      * |         2
       1  [55, 60]      * |        10 

有谁知道一个简单的方法来做到这一点?还是我注定要失败?

感谢您的回答。

PS: 我的数据没有搁浅,但如果你有它来处理搁浅的数据,那就更好了。

PPS:理想情况下,我还希望能够进行乘法运算,或应用具有两个参数 x 和 y 的任何函数,而不是简单地添加覆盖范围。

【问题讨论】:

    标签: r


    【解决方案1】:

    已经快一年了,但这是我的答案,以供将来参考。

    每当我没有找到一个函数来直接执行这样的任务时,我只需将 GRanges 对象扩展为单 bp 分辨率。这允许我对元数据列执行任何所需的操作,将它们视为简单的 data.frame 列,因为 IRanges 现在在两个 Granges 对象之间匹配。

    在这个问题的具体情况下,以下工作。

    ### Sort seqlevels
    # (not necessary here, but in real world examples,
    # with multiple sequences, you will want to do this)
    gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1))
    gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2))
    
    ### Add seqlengths
    # (this corresponds to the actual sequence lengths;
    # here we use the highest position between the two objects: 60)
    seqlengths(gr1) <- 60
    
    ### Make 1-bp tiles covering the genome
    # (using either one of gr1 and gr2 as a reference)
    bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1),
                                      tilewidth=1,
                                      cut.last.tile.in.chrom=TRUE)
    
    ### Get coverage signal as Rle object
    gr1_cov <- coverage(gr1, weight="cov")
    gr2_cov <- coverage(gr2, weight="cov")
    
    ### Get average coverage in each bin
    # (since the bins are 1-bp wide, this just keeps the original coverage value)
    gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov")
    gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov")
    
    ### Make final object:
    # We can now sum the values in the metadata columns
    # Addressing the PPS, you could do any other operation or apply a function
    gr3 <- gr1_bins
    gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov
    

    这会以单 bp 分辨率生成最终的 GRanges 对象。

    > gr3
    
    GRanges object with 60 ranges and 1 metadata column:
         seqnames    ranges strand | binned_cov
            <Rle> <IRanges>  <Rle> |  <numeric>
     [1]        1    [1, 1]      * |          3
     [2]        1    [2, 2]      * |          3
     [3]        1    [3, 3]      * |          6
     [4]        1    [4, 4]      * |          6
     [5]        1    [5, 5]      * |          4
     ...      ...       ...    ... .        ...
    [56]        1  [56, 56]      * |         10
    [57]        1  [57, 57]      * |         10
    [58]        1  [58, 58]      * |         10
    [59]        1  [59, 59]      * |         10
    [60]        1  [60, 60]      * |         10
    -------
    seqinfo: 1 sequence from an unspecified genome
    

    要压缩它并获得问题中的确切gr3,我们可以执行以下操作。

    ### Compress back to variable-width IRanges (by cov)
    gr3_Rle <- coverage(gr3, weight='binned_cov')
    gr3 <- as(gr3_Rle, "GRanges")
    
    ### Drop 0-score rows
    gr3 <- gr3[gr3$score > 0]
    
    ### Rename metadata column
    names(mcols(gr3)) <- 'cov'
    
    > gr3
    
    GRanges object with 11 ranges and 1 metadata column:
           seqnames    ranges strand |       cov
              <Rle> <IRanges>  <Rle> | <numeric>
       [1]        1  [ 1,  2]      * |         3
       [2]        1  [ 3,  4]      * |         6
       [3]        1  [ 5,  7]      * |         4
       [4]        1  [ 8,  9]      * |         1
       [5]        1  [10, 14]      * |         8
       [6]        1  [15, 19]      * |         6
       [7]        1  [20, 23]      * |        11
       [8]        1  [24, 26]      * |         9
       [9]        1  [27, 29]      * |         6
      [10]        1  [30, 39]      * |         2
      [11]        1  [55, 60]      * |        10
      -------
      seqinfo: 1 sequence from an unspecified genome
    

    【讨论】:

      猜你喜欢
      • 2016-03-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-25
      • 2019-11-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多