如何按r中其他列中的条件对行进行排序？答案

【问题标题】：How to order rows by conditions in other columns in r?如何按r中其他列中的条件对行进行排序？
【发布时间】：2020-10-22 06:47:28
【问题描述】：

我有一个基因数据集，我希望在其中订购样本/基因，按基因组中彼此相距一定距离的样本/基因进行分组。

例如，我的数据集如下所示：

#dt1
Gene    chromosome position      CP 
Gene1      1       70000200   1:70000200
Gene2      5       10000476   5:10000476
Gene3      1       70000201   1:70000201
Gene4      5       10000475   5:10000475

我还有一个原点位置数据集：

#dt2
chromosome   position  CP
    1        70005000  1:70005000
    5        10005000  5:10005000

如果基因在我的第二个 dt2 数据集中的任何位置的 +/- 500000 距离内并且位于同一染色体上，我正在尝试对我的第一个数据集中的基因进行分组。我的实际数据中存在一个问题，对于一个针对多个起源 dt2 位置的基因来说，这可能是正确的，所以我也在尝试排序到它最接近的那个。

输出旨在给有序组：

Gene   chromosome  position   Group 
Gene1      1       70000200    1
Gene3      1       70000201    1
Gene4      5       10000475    2
Gene2      5       10000476    2

Gene1 和 Gene3 在 dt2 原点的 500000 范围内，并且都在同一条染色体上，因此被归为一组，对于基因 4 和 2 来说是相同的

目前我正在尝试这样做：

dt2[, c("low", "high") := .(position - 500000, position  + 500000)]

#find matches on chromosome, with position between low&high
dt1[ dt2, match := i.CP,
     on = .(chromosome, position > low, position < high ) ]

#outputs:
    Gene chromosome position    CP        match
1   Gene1   1   70000200    1:70000200  1:70005000
2   Gene2   5   10000476    5:10000476  5:10005000
3   Gene3   1   70000201    1:70000201  1:70005000
4   Gene4   5   10000475    5:10000475  5:10005000

我在 2 个级别上遇到了问题，似乎没有在我的实际数据上获得匹配列的输出，所以我想知道是否有其他方法可以尝试对此进行编码。我还在努力将匹配列转换为分组匹配并在预期的输出中识别我想要的组，我有生物学背景，所以我不确定如何改变这一点 - 任何帮助将不胜感激。

输入数据：

#dt1:
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"), 
    chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L, 
    70000201L, 10000475L), CP = c("1:70000200", "5:10000476", 
    "1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000", 
    "1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

#dt2:
structure(list(chromosome = c(1L, 5L), position = c(70005000L, 
10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000, 
9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"))

【问题讨论】：

标签： r data.table conditional-statements bioinformatics

【解决方案1】：

如果我理解正确，OP 想要

如果dt1 中的基因与dt2 中任何位置的距离在+/- 500000 范围内并且位于同一条染色体上，则对它们进行分组。
如果一个基因与多个来源dt2 位置匹配，则选择最接近的一个。

因此，滚动连接到最近的以及后续过滤可能是答案。

library(data.table)
dt2[, Group := .I][
  dt1, on = .(chromosome, position), roll = "nearest",
  .(Gene, chromosome, position, CP = i.CP, 
    Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][
      order(Group, CP)]

     Gene chromosome position         CP Group
1:  Gene1          1 70000200 1:70000200     1
2:  Gene3          1 70000201 1:70000201     1
3: Gene12          1 70005199 1:70005199     1
4: Gene13          1 70005900 1:70005900     2
5:  Gene4          5 10000475 5:10000475     3
6:  Gene2          5 10000476 5:10000476     3
7: Gene11          1 80000200 1:80000200    NA

请注意，此处使用扩展数据集是为了检查该方法是否适用于不同的用例。

扩展数据集中有一个基因Gene11，由于在dt2中没有匹配位置，所以没有分配到任何组，即Group == NA在距离阈值内 +/- 500000。

该方法类似于StupidWolf's GenomicRanges answer，但仅使用data.table 功能。

数据

两个数据集都已扩展以涵盖不同的用例：

library(data.table)
dt1 <- fread("Gene    chromosome position      CP 
Gene1      1       70000200   1:70000200
Gene2      5       10000476   5:10000476
Gene3      1       70000201   1:70000201
Gene4      5       10000475   5:10000475
Gene11     1       80000200   1:80000200
Gene12     1       70005199   1:70005199
Gene13     1       70005900   1:70005900")
dt2 <- fread("chromosome   position  CP
    1        70005000  1:70005000
    1        70006000  1:70006000
    5        10005000  5:10005000")

【讨论】：

【解决方案2】：

我认为您正在寻找执行非相等连接然后通过引用更新的惯用方式，即

dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position  + 500000L)]
#note that you perform the non-equi join first, and then 
#extract the result column before `:=`, which updates by reference
dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]]
dt1

编辑多个匹配项。在这种情况下，您将需要左连接：

dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)]

【讨论】：

谢谢你。我已经尝试过了，它给出了一个错误 Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 26020579 rows; more than 169740 = nrow(x)+nrow(i). Check for duplicate key values 我尝试添加 allow.cartesian=TRUE 但这也给出了一个错误 - 有没有办法让我设置代码以删除重复项？
有很多重叠。也就是说，dt1 的每一行都连接到 dt2 的多行。在这种情况下，你会选择哪一行？
我认为我需要将 dt1 中的行附加到它在 dt2 中连接的多行中的每一行 - 这是通过应用rep()吗？
@DN1，在这种情况下，您将需要左连接。我已将其添加到答案中
谢谢你，新代码在运行时添加了allow.cartesian=TRUE，但是我有没有办法为此生成一个组列？

【解决方案3】：

在 Bioconductor 中有专门为此而设计的软件包。因此，您可以使用的一件事是 distanceToNearest() 来自 GenomicRanges 。首先我们将它们转换为 Granges 对象：

library(GenomicRanges)
gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position")
values(gr1) = dt1[,c("Gene","CP")]

给 dt2 分组：

dt2$Group = 1:nrow(dt2)
gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position")

这一步会将 gr1 (dt1 GRanges) 中的每一行匹配到 gr2 (dt2) 中最近的 Range：

matches = distanceToNearest(gr1,gr2)
Hits object with 4 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |      4799
  [2]         2           2 |      4523
  [3]         3           1 |      4798
  [4]         4           2 |      4524
  -------
  queryLength: 4 / subjectLength: 2

我们将这个结果分配回去：

dt1$group = NA
dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)]
dt1$distance = NA
dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)]
dt1

    Gene chromosome position         CP      match group distance
1: Gene1          1 70000200 1:70000200 1:70005000     1     4799
2: Gene2          5 10000476 5:10000476 5:10005000     2     4523
3: Gene3          1 70000201 1:70000201 1:70005000     1     4799
4: Gene4          5 10000475 5:10000475 5:10005000     2     4523

现在您可以过滤掉大于 500000 的那些

【讨论】：