比较两列并检查其他列的值是否增加或减少答案

【问题标题】：Compare two columns and check if the values of other columns have increased or decreased比较两列并检查其他列的值是否增加或减少
【发布时间】：2016-10-11 11:48:12
【问题描述】：

我有一个包含多列的 data.frame。我有一列（序列）的唯一序列，我想与这个 data.frame 的下一个版本进行比较，并检查它们有多少肽，并检查这个数字是增加还是减少。

我从数据库中获取这个data.frame，但问题是这个数据库在每个版本中都会生成新的随机序列位置（参见2º版本）。

1ºRelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 65       | project 
    2 | tgatgat  | ... | 3        | project 
    3 | actgat   | ... | 32       | project 
    4 | atgtagtt | ... | 25       | project 
    5 | ttttaaat | ... | 32       | project 



2ºrelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 66       | project 
    2 | tgatgat  | ... | 5        | project 
    3 | actgat   | ... | 36       | project 
    4 | ATTTGGGG | ... | 26       | project *** New one ***
    5 | ATTGATGA | ... | 32       | project *** New one ***
    6 | atgtagtt | ... | 47       | project 
    7 | ttttaaat | ... | 38       | project

如果在每个版本中将新序列放在列的末尾，我使用重复函数不会有任何问题，但不幸的是这是随机完成的。

这里有一个例子：

1º 发布：

df <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c("1", "2", "3", "4" ,"5") ), 
sequence = structure(c(1L,2L, 3L, 4L, 5L), 
.Label = c(" actgat   "," atagattg ", " atatagag ", " atggggg  ", " atgtagtt "), class = "factor"), 
peptides = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c(" 54  ", " 84  ",  " 32  ", " 36  ", "12"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA,  -5L))

2º 发布：

df2 <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L), 
.Label = c("1", "2", "3", "4" ,"5" ,"6", "7" ) ), 
sequence = structure(c(1L,2L, 7L, 8L, 3L, 4L, 5L), 
.Label = c(" actgat   "," atagattg ", " atatagag ", " atggggg  ", " atgtagtt ", " gggatgac ", " TATATCC ", " TTTTAAAT "), class = "factor"), 
peptides = structure(c(1L, 2L,7L,8L, 3L, 4L, 5L), 
.Label = c(" 56  ", " 85  ",  " 31  ", " 36  ", "15", "10", "76", "98", "34", "76"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), 
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA,  -7L))

【问题讨论】：

标签： r

【解决方案1】：

首先将您的肽计数转换为数字（它们是带有数字字符标签的因素，这有点混乱）：

> df$peptides=as.numeric(as.character(df$peptides))
> df2$peptides=as.numeric(as.character(df2$peptides))

左连接会将新数据匹配到旧数据：

> require(dplyr)
> left_join(df, df2, c("sequence"="sequence"))
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    3  atatagag          32    project     5         31    project 
4    4  atggggg           36    project     6         36    project 
5    5  atgtagtt          12    project     7         15    project 
Warning message:
In left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
  joining factors with different levels, coercing to character vector

忽略警告。左连接和过滤器将找到肽数增加的那些：

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y>peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    5  atgtagtt          12    project     7         15    project

将其保存为新的数据框或其他任何内容。

作为检查，减少或不变的：

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y<=peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    3  atatagag          32    project     5         31    project 
2    4  atggggg           36    project     6         36    project

【讨论】：

是否可以使用其他条件加入数据。我的意思是例如使用序列=序列和修改=修改。（修改将是另一列）。我看到有些序列是重复的，但有不同的修改，所以当我使用你的方法时，连接不正确
好的，我知道了。 left_join(df, df2, by = c("sequence","modifications"))。但现在我的问题是 left_join 用 NA 完成了列，所以如果肽数增加，过滤函数就不能计数。

【解决方案2】：

@Spacedman 的解决方案，但使用data.table：

library("data.table")
setDT(df, key = 'sequence')
setDT(df2, key = 'sequence')
df2[df]

或作为单线（可能与最新版本的 data.table 一起使用）：

library("data.table")
setDT(df2)[df, on="sequence"]

【讨论】：

连接，但为了完整起见，您可能应该展示如何获取肽数增加的行。

【解决方案3】：

由于您有一个共同的密钥，您可以使用join。

在tidyverse 中看起来像这样：

图书馆（tidyverse）

df %>% 
  full_join(df2, by = "sequence", suffix = c(".1", ".2")) %>%
  # Fix data to convert to character and numeric
  mutate_each(funs(as.numeric(as.character(.))), starts_with("pept")) %>%
  # See difference
  mutate(change = peptides.2 - peptides.1)

#> Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> factors with different levels, coercing to character vector
#>   ID.1   sequence peptides.1 n_project.1 ID.2 peptides.2 n_project.2  change
#> 1    1  actgat            54    project     1         56    project       2
#> 2    2  atagattg          84    project     2         85    project       1
#> 3    3  atatagag          32    project     5         31    project      -1
#> 4    4  atggggg           36    project     6         36    project       0
#> 5    5  atgtagtt          12    project     7         15    project       3
#> 6   NA   TATATCC          NA        <NA>    3         76    project      NA
#> 7   NA  TTTTAAAT          NA        <NA>    4         98    project      NA

我们看到full_join：

df 和 df2 之间的匹配项。
df2 中的新行（肽段的 NA 值）
肽段随时间的变化。

在这种情况下，我假设您的 sequence 数据区分大小写。

基础 R

您也可以在 base R 中使用 merge 执行此操作，但我更喜欢上面的 tidyverse 语法。

merge(df, df2, by = "sequence", all = T)
#>     sequence ID.x peptides.x n_project.x ID.y peptides.y n_project.y
#> 1  actgat       1       54      project     1       56      project 
#> 2  atagattg     2       84      project     2       85      project 
#> 3  atatagag     3       32      project     5       31      project 
#> 4  atggggg      4       36      project     6       36      project 
#> 5  atgtagtt     5         12    project     7         15    project 
#> 6   TATATCC    NA       <NA>        <NA>    3         76    project 
#> 7  TTTTAAAT    NA       <NA>        <NA>    4         98    project

【讨论】：

以后如何计算肽段数？
肽段总数？那些是新的？那些旧的？什么意思？
正如我在介绍中所说，我想加入这些日期，因为我想比较一个旧版本和新版本，并检查我有多少肽，以及肽的数量是增加还是减少.
这不是change 列吗？（0 不会改变）。如果你想要别的东西，我不清楚。
抱歉，我没有看到更改栏。非常感谢，这正是我想要的。再次感谢您。