【问题标题】:How to replace values of several columns based on/ another column in R within each row?如何根据每行中R中的另一列/另一列替换几列的值?
【发布时间】:2021-06-21 22:02:51
【问题描述】:

我正在处理一个数据集 (30000 x 500 ),我需要根据另一列的数据替换列中的一些值。问题是在每一行中,参考值都会发生变化。这是数据集的一个子示例:

#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"), 
                   A_allele = c("C","G","C","G","C","C","A","T","G","C"),
                   B_allele = c("G","A","T","A","A","G","T","A","C","A"),
                   alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
                   line_1 = sample(c("A","B"),10, replace = TRUE),
                   line_2 = sample(c("A","B"),10, replace = TRUE),
                   line_3 = sample(c("A","B"),10, replace = TRUE),
                   line_4 = sample(c("A","B"),10, replace = TRUE),
                   line_5 = sample(c("A","B"),10, replace = TRUE),
                   line_6 = sample(c("A","B"),10, replace = TRUE),
                   line_7 = sample(c("A","B"),10, replace = TRUE),
                   line_8 = sample(c("A","B"),10, replace = TRUE),
                   line_9 = sample(c("A","B"),10, replace = TRUE),
                   line_10 = sample(c("A","B"),10, replace = TRUE)
                   )

df
head(df)
     SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1        C        G     C/G      B      A      B      A      B      B      B      B      B       A
2   SNP2        G        A     G/A      A      B      A      A      A      B      B      A      B       A
3   SNP3        C        T     C/T      B      B      A      B      B      B      A      A      A       A
4   SNP4        G        A     G/A      A      B      B      A      B      A      B      B      B       A
5   SNP5        C        A     C/A      B      A      B      B      B      A      B      A      B       B
6   SNP6        C        G     C/G      B      A      B      A      B      A      B      B      B       B
7   SNP7        A        T     A/T      B      A      A      B      A      A      B      A      B       A
8   SNP8        T        A     T/A      A      B      A      B      A      A      B      B      A       B
9   SNP9        G        C     G/C      B      A      B      B      B      B      A      B      A       B
10 SNP10        C        A     C/A      B      B      B      B      B      A      A      A      A       A

对于每一行,A_allele 和 B_allele 列作为参考值来更改 10 行中的 A 或 B 值。当存在“A”值时 => 使用列 A_allele 中的值,当存在​​“B”值时 => 使用列_B 中的值。

在示例中,应如下所示:

  • 第 1 行:将 A 行更改为 C / 将 B 行更改为 G
  • 第 2 行:将 A 行更改为 G / 将 B 行更改为 A
  • 第 3 行:将 A 行更改为 C / 将 B 行更改为 T
  • 第 10 行:同样的想法。

输出应该是这样的:

SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1   SNP1    C   G   C/G G   C   G   C   G   G   G   G   G   C
2   SNP2    G   A   G/A G   A   G   G   G   A   A   G   A   G
3   SNP3    C   T   C/T T   T   C   T   T   T   C   C   C   C
4   SNP4    G   A   G/A G   A   A   G   A   G   A   A   A   G
5   SNP5    C   A   C/A A   C   A   A   A   C   A   C   A   A
6   SNP6    C   G   C/G G   C   G   C   G   C   G   G   G   G
7   SNP7    A   T   A/T T   A   A   T   A   A   T   A   T   A
8   SNP8    T   A   T/A T   A   T   A   T   T   A   A   T   A
9   SNP9    G   C   G/C C   G   C   C   C   C   G   C   G   C
10  SNP10   C   A   C/A A   A   A   A   A   C   C   C   C   C

由于大约有 30000 行,如果可能的话,我想要一个高效的代码来运行。

有什么建议吗?

【问题讨论】:

    标签: r dataframe if-statement conditional-statements


    【解决方案1】:

    你可以的

    library(tidyverse)
    
    df %>% mutate(across(starts_with("line"), ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
    
    #output with df generated with set.seed(2021)
         SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
    1   SNP1        C        G     C/G      C      C      G      C      C      C      G      G      C       G
    2   SNP2        G        A     G/A      A      A      A      A      G      G      G      G      G       G
    3   SNP3        C        T     C/T      T      T      C      C      T      T      T      T      T       C
    4   SNP4        G        A     G/A      A      G      A      A      A      G      G      A      G       A
    5   SNP5        C        A     C/A      C      C      C      A      C      A      A      C      C       A
    6   SNP6        C        G     C/G      G      C      C      C      C      C      G      C      G       G
    7   SNP7        A        T     A/T      T      A      T      T      T      T      T      A      T       A
    8   SNP8        T        A     T/A      A      T      A      T      A      A      A      T      A       T
    9   SNP9        G        C     G/C      C      C      C      C      C      G      G      G      C       C
    10 SNP10        C        A     C/A      A      C      A      C      A      C      C      C      C       A
    

    如 cmets 中所述,如果列名不遵循模式,Option-1 您可以将它们存储在一个向量中,比如 vars,并在 across 中使用它

    set.seed(2021)
    df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"), 
                     A_allele = c("C","G","C","G","C","C","A","T","G","C"),
                     B_allele = c("G","A","T","A","A","G","T","A","C","A"),
                     alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
                     line_1 = sample(c("A","B"),10, replace = TRUE),
                     line_2 = sample(c("A","B"),10, replace = TRUE),
                     line_3 = sample(c("A","B"),10, replace = TRUE),
                     line_4 = sample(c("A","B"),10, replace = TRUE),
                     line_5 = sample(c("A","B"),10, replace = TRUE),
                     line_6 = sample(c("A","B"),10, replace = TRUE),
                     line_7 = sample(c("A","B"),10, replace = TRUE),
                     cat = sample(c("A","B"),10, replace = TRUE),
                     dog = sample(c("A","B"),10, replace = TRUE),
                     rabbit = sample(c("A","B"),10, replace = TRUE)
    )
    
    vars <- c("line_1", "line_2", "line_3", "line_4", "line_5", "line_6", "line_7", "cat", "dog", "rabbit")
    
    df %>% mutate(across(.cols = vars, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
    
    Note: Using an external vector in selections is ambiguous.
    i Use `all_of(vars)` instead of `vars` to silence this message.
    i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
    This message is displayed once per session.
         SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
    1   SNP1        C        G     C/G      C      C      G      C      C      C      G   G   C      G
    2   SNP2        G        A     G/A      A      A      A      A      G      G      G   G   G      G
    3   SNP3        C        T     C/T      T      T      C      C      T      T      T   T   T      C
    4   SNP4        G        A     G/A      A      G      A      A      A      G      G   A   G      A
    5   SNP5        C        A     C/A      C      C      C      A      C      A      A   C   C      A
    6   SNP6        C        G     C/G      G      C      C      C      C      C      G   C   G      G
    7   SNP7        A        T     A/T      T      A      T      T      T      T      T   A   T      A
    8   SNP8        T        A     T/A      A      T      A      T      A      A      A   T   A      T
    9   SNP9        G        C     G/C      C      C      C      C      C      G      G   G   C      C
    10 SNP10        C        A     C/A      A      C      A      C      A      C      C   C   C      A
    

    Option-2你也可以直接列索引

    df %>% mutate(across(5:14, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
    
         SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
    1   SNP1        C        G     C/G      C      C      G      C      C      C      G   G   C      G
    2   SNP2        G        A     G/A      A      A      A      A      G      G      G   G   G      G
    3   SNP3        C        T     C/T      T      T      C      C      T      T      T   T   T      C
    4   SNP4        G        A     G/A      A      G      A      A      A      G      G   A   G      A
    5   SNP5        C        A     C/A      C      C      C      A      C      A      A   C   C      A
    6   SNP6        C        G     C/G      G      C      C      C      C      C      G   C   G      G
    7   SNP7        A        T     A/T      T      A      T      T      T      T      T   A   T      A
    8   SNP8        T        A     T/A      A      T      A      T      A      A      A   T   A      T
    9   SNP9        G        C     G/C      C      C      C      C      C      G      G   G   C      C
    10 SNP10        C        A     C/A      A      C      A      C      A      C      C   C   C      A
    

    【讨论】:

    • 非常感谢!知道如何选择不共享任何常见模式(即行)的列。我的台词名称不同(狗、猫、鸟、猴子等),我只使用这些名称作为参考。谢谢
    • 感谢您的澄清。我使用 vars
    【解决方案2】:

    您可以在dplyr 中使用acrossifelse

    library(dplyr)
    df %>% mutate(across(starts_with('line'), ~ifelse(. == 'A', A_allele, B_allele)))
    

    lapply 在基础 R 中:

    cols <- grep('line', names(df))
    df[cols] <- lapply(df[cols], function(x) ifelse(x == 'A', df$A_allele, df$B_allele))
    

    【讨论】:

    • 感谢您的回复,但我有两个 cmets: 1. 我得到的是数字而不是 C/G/A/T。例如,SNP1 有 3 3 3 2 2 3 2 3 3 而不是 G G C C。数字看起来与有序字母 1 = A 相同; 2 = C; 3 = G; 4 = T 但不确定。 2. 如果线条没有相似的图案(即线条)并且有很大的不同,例如猫、狗、猴子、鸟等,我用线条来简化解释,但名称本来就不同,没有共同的图案。
    • @Javier_HV 我认为您在 R data.frame 调用中添加stringsAsFactors = FALSE
    • 是的,你是对的,包括字符串作为因素。关于第二点,选择不共享任何共同模式(即行)的列的任何想法。我的台词名称不同,我只使用这些名称作为参考。谢谢
    • 如何确定要更改的列?您可以使用数字索引。 cols &lt;- 5:14 或者如果没有共同点,您可能需要手动将它们添加到向量中。 cols &lt;- c(1, 3, 4, 5, 8) 等。在lapply 解决方案中使用此cols,并在across 中将starts_with('line') 替换为cols
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-06-23
    • 1970-01-01
    • 2015-11-21
    • 2015-12-31
    相关资源
    最近更新 更多