将基因型格式“12”转换为 ACTG 格式答案

【问题标题】：Convert genotypes format '12' to ACTG format将基因型格式“12”转换为 ACTG 格式
【发布时间】：2021-04-13 15:03:27
【问题描述】：

您好，

请帮助我根据 snp_info 文件将我的“12”基因型格式转换为 ACGT 格式。

df 示例（第 1 行）

sample1 1/2 1/2 2/1 2/1

sample1 的第一个 SNP 是杂合子 (1/2)。 1 是 ref 等位基因 (A)，2 是 alt 等位基因 (G)（参见 snp_info）。

我想自动化该过程以转换我真实数据中的所有基因型（900 个样本/30 个基因型）。

# df
df = t(data.frame( 
sample1 =c('1/2','1/2','2/1','2/1'), 
sample2 =c('2/1','1/1','1/2','2/1'), 
sample3 =c('2/1','2/1','1/1','1/2'), 
sample4 =c('1/1','2/2','2/2','2/2')))

# snp_info

snp_info = data.frame(
snp =c('11_524568','12_542656','12_558659','13_8457658'), 
position =c('524568','542656','558659','8457658'), 
ref =c('A','T','T','G'), 
alt=c('G','C','C','A'))

期望的输出

desired_output = t(data.frame(
sample1 =c('A/G','T/C','C/T','A/G'), 
sample2 =c('G/A','T/T','T/C','A/G'), 
sample3 =c('G/A','C/T','T/T','G/A'), 
sample4 =c('A/A','C/T','C/C','A/A')
))

我为第一个 SNP 尝试了 for 循环，但没有得到所需的输出

desired_output = t(data.frame(
sample1 =c('','','',''), 
sample2 =c('','','',''), 
sample3 =c('','','',''), 
sample4 =c('','','','')))
tp = as.data.frame(tp)

geno = list()
    for (i in 1:nrow(df)) {
    geno[i] = paste(snp_info[i,3],'/',file_info[i,4])
    desired_output[i,1] = geno[i]
    }

感谢您的帮助。

【问题讨论】：

标签： r dplyr

【解决方案1】：

这是一个不优雅的解决方案：

# EXAMPLE
df <- t(data.frame( 
  sample1 =c('1/2','1/2','2/1','2/1'), 
  sample2 =c('2/1','1/1','1/2','2/1'), 
  sample3 =c('2/1','2/1','1/1','1/2'), 
  sample4 =c('1/1','2/2','2/2','2/2')))

snp_info = data.frame(
  snp =c('11_524568','12_542656','12_558659','13_8457658'), 
  position =c('524568','542656','558659','8457658'), 
  ref =c('A','T','T','G'), 
  alt=c('G','C','C','A'))

desired_output = t(data.frame(
  sample1 =c('','','',''), 
  sample2 =c('','','',''), 
  sample3 =c('','','',''), 
  sample4 =c('','','','')))

for (i in 1L:ncol(df)) {
  desired_output[,i] <- gsub("1", snp_info[i,3L], df[,i])
  desired_output[,i] <- gsub("2", snp_info[i,4L], desired_output[,i])
}

但是，速度似乎对您的数据量并不起主要作用。

【讨论】：

【解决方案2】：

@tacoman 的回答是完全可以接受的。我会改进一下：

df <- t(data.frame(
  sample1 = c('1/2', '1/2', '2/1', '2/1'),
  sample2 = c('2/1', '1/1', '1/2', '2/1'),
  sample3 = c('2/1', '2/1', '1/1', '1/2'),
  sample4 = c('1/1', '2/2', '2/2', '2/2')
))

snp_info = data.frame(
  snp = c('11_524568', '12_542656', '12_558659', '13_8457658'),
  position = c('524568', '542656', '558659', '8457658'),
  ref = c('A', 'T', 'T', 'G'),
  alt = c('G', 'C', 'C', 'A')
)


stopifnot(ncol(df) == ncol(snp_info))
output <- matrix(rep("", ncol(df)*nrow(df)), ncol = ncol(df))
rownames(output) <- paste0("sample", 1:nrow(df))

for (i in 1L:ncol(df)) {
  output[, i] <- gsub("1", snp_info$ref[i], df[, i])
  output[, i] <- gsub("2", snp_info$alt[i], output[, i])
}
output
#>         [,1]  [,2]  [,3]  [,4] 
#> sample1 "A/G" "T/C" "C/T" "A/G"
#> sample2 "G/A" "T/T" "T/C" "A/G"
#> sample3 "G/A" "C/T" "T/T" "G/A"
#> sample4 "A/A" "C/C" "C/C" "A/A"

【讨论】：