【问题标题】:Convert genotypes format '12' to ACTG format将基因型格式“12”转换为 ACTG 格式
【发布时间】:2021-04-13 15:03:27
【问题描述】:

您好,

请帮助我根据 snp_info 文件将我的“12”基因型格式转换为 ACGT 格式。

df 示例(第 1 行)

sample1 1/2 1/2 2/1 2/1

sample1 的第一个 SNP 是杂合子 (1/2)。 1 是 ref 等位基因 (A),2 是 alt 等位基因 (G)(参见 snp_info)。

我想自动化该过程以转换我真实数据中的所有基因型(900 个样本/30 个基因型)。

# df
df = t(data.frame( 
sample1 =c('1/2','1/2','2/1','2/1'), 
sample2 =c('2/1','1/1','1/2','2/1'), 
sample3 =c('2/1','2/1','1/1','1/2'), 
sample4 =c('1/1','2/2','2/2','2/2')))

# snp_info

snp_info = data.frame(
snp =c('11_524568','12_542656','12_558659','13_8457658'), 
position =c('524568','542656','558659','8457658'), 
ref =c('A','T','T','G'), 
alt=c('G','C','C','A'))

期望的输出

desired_output = t(data.frame(
sample1 =c('A/G','T/C','C/T','A/G'), 
sample2 =c('G/A','T/T','T/C','A/G'), 
sample3 =c('G/A','C/T','T/T','G/A'), 
sample4 =c('A/A','C/T','C/C','A/A')
))

我为第一个 SNP 尝试了 for 循环,但没有得到所需的输出

desired_output = t(data.frame(
sample1 =c('','','',''), 
sample2 =c('','','',''), 
sample3 =c('','','',''), 
sample4 =c('','','','')))
tp = as.data.frame(tp)

geno = list()
    for (i in 1:nrow(df)) {
    geno[i] = paste(snp_info[i,3],'/',file_info[i,4])
    desired_output[i,1] = geno[i]
    }

感谢您的帮助。

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这是一个不优雅的解决方案:

    # EXAMPLE
    df <- t(data.frame( 
      sample1 =c('1/2','1/2','2/1','2/1'), 
      sample2 =c('2/1','1/1','1/2','2/1'), 
      sample3 =c('2/1','2/1','1/1','1/2'), 
      sample4 =c('1/1','2/2','2/2','2/2')))
    
    snp_info = data.frame(
      snp =c('11_524568','12_542656','12_558659','13_8457658'), 
      position =c('524568','542656','558659','8457658'), 
      ref =c('A','T','T','G'), 
      alt=c('G','C','C','A'))
    
    desired_output = t(data.frame(
      sample1 =c('','','',''), 
      sample2 =c('','','',''), 
      sample3 =c('','','',''), 
      sample4 =c('','','','')))
    
    for (i in 1L:ncol(df)) {
      desired_output[,i] <- gsub("1", snp_info[i,3L], df[,i])
      desired_output[,i] <- gsub("2", snp_info[i,4L], desired_output[,i])
    }
    

    但是,速度似乎对您的数据量并不起主要作用。

    【讨论】:

      【解决方案2】:

      @tacoman 的回答是完全可以接受的。 我会改进一下:

      df <- t(data.frame(
        sample1 = c('1/2', '1/2', '2/1', '2/1'),
        sample2 = c('2/1', '1/1', '1/2', '2/1'),
        sample3 = c('2/1', '2/1', '1/1', '1/2'),
        sample4 = c('1/1', '2/2', '2/2', '2/2')
      ))
      
      snp_info = data.frame(
        snp = c('11_524568', '12_542656', '12_558659', '13_8457658'),
        position = c('524568', '542656', '558659', '8457658'),
        ref = c('A', 'T', 'T', 'G'),
        alt = c('G', 'C', 'C', 'A')
      )
      
      
      stopifnot(ncol(df) == ncol(snp_info))
      output <- matrix(rep("", ncol(df)*nrow(df)), ncol = ncol(df))
      rownames(output) <- paste0("sample", 1:nrow(df))
      
      for (i in 1L:ncol(df)) {
        output[, i] <- gsub("1", snp_info$ref[i], df[, i])
        output[, i] <- gsub("2", snp_info$alt[i], output[, i])
      }
      output
      #>         [,1]  [,2]  [,3]  [,4] 
      #> sample1 "A/G" "T/C" "C/T" "A/G"
      #> sample2 "G/A" "T/T" "T/C" "A/G"
      #> sample3 "G/A" "C/T" "T/T" "G/A"
      #> sample4 "A/A" "C/C" "C/C" "A/A"
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-12-24
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-01-20
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多