【问题标题】:Splitting columns by values in a lookup table in R在R中的查找表中按值拆分列
【发布时间】:2021-07-28 21:45:18
【问题描述】:

我有一个表格,每个 hpo_term 一行,因此单个患者的每个 ID 可以有很多行。

ID hpo_term
123 kidney failure
123 hand tremor
123 kidney transplant
432 hypertension
432 exotropia
432 scissor gait

我还有另外两张表,一张是肾脏术语,另一张是非肾脏术语,肾脏一个看起来像这样:

kidney failure
kidney transplant
hypertension

非肾的长这样:

hand tremor
exotropia
scissor gait

我想要的结果是这样的表格:

ID kidney_hpo_term                   non_kidney_hpo_term
123 kidney failure;kidney transplant hand tremor
432 hypertension                     exotropia;scissor gait

实际上有数百名患者和数百个 HPO 术语。

我可以访问基础 R; dplyr,但我真的不知道如何解决这个问题。

非常感谢您的帮助。

非常感谢

编辑:

真正的 table1 有更多不相关的额外列,并且每个 ID 始终相同,我也想导入它。例如:

 ID hpo_term              year_of_birth  affected_relative   genome
    123 kidney failure    2000               Y                38
    123 hand tremor       2000               Y                38
    123 kidney transplant 2000               Y                38
    432 hypertension      1980               N                37
    432 exotropia         1980               N                37
    432 scissor gait      1980               N                37

【问题讨论】:

  • 可以dput(data)方便测试吗?

标签: r merge data.table lookup


【解决方案1】:

这是一个不同的方法,使用tidyr::pivot_wider 使用values_fn 进行总结而不是单独进行:

library(dplyr); library(tidyr)
pt.data %>% 
   mutate(kidney = hpo_term %in% kidney.hpo) %>%
   pivot_wider(names_from = kidney, values_from = hpo_term,
               values_fn = function(x)paste(x,collapse = ";"), values_fill = NA) %>%
   setNames(c("ID","Kidney","Non.kidney"))
## A tibble: 2 x 3
#     ID Kidney                           Non.kidney            
#  <int> <chr>                            <chr>                 
#1   123 kidney failure;kidney transplant hand tremor           
#2   432 hypertension                     exotropia;scissor gait

数据:

pt.data <- structure(list(ID = c(123L, 123L, 123L, 432L, 432L, 432L), hpo_term = c("kidney failure", "hand tremor", "kidney transplant", "hypertension", "exotropia", "scissor gait")), class = "data.frame", row.names = c(NA, -6L))
kidney.hpo <- c("kidney failure", "kidney transplant", "hypertension")

【讨论】:

  • 感谢您抽出宝贵时间回复。这很好用,我接受了它,因为它减少了对第二张桌子的需求,这是我没有考虑过的一种方法,可以节省我的时间!
  • 亲爱的@Ian 我已经编辑了这个问题,因为整个表格要长得多,这会改变解决方案吗?非常感谢
【解决方案2】:
library(dplyr); library(tidyr)
patients %>%
  left_join(terms) %>%
  group_by(ID, type) %>%
  summarize(ID.hpo_term = paste(ID.hpo_term, collapse = ", "), .groups = "drop") %>%
  tidyr::pivot_wider(names_from = type, values_from = ID.hpo_term)

结果

Joining, by = "ID.hpo_term"
# A tibble: 2 x 3
     ID kidney_hpo_term                   non_kidney_hpo_term    
  <dbl> <chr>                             <chr>                  
1   123 kidney failure, kidney transplant hand tremor            
2   432 hypertension                      exotropia, scissor gait

输入数据

patients <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(123, 123, 123, 432, 432, 432),
       ID.hpo_term = c("kidney failure",
                       "hand tremor","kidney transplant","hypertension",
                       "exotropia","scissor gait")
)


terms <- data.frame(
  stringsAsFactors = FALSE,
  type = rep(c("kidney_hpo_term", "non_kidney_hpo_term"), each = 3),
  ID.hpo_term = c("kidney failure", "kidney transplant",
                       "hypertension",
                       "hand tremor","exotropia","scissor gait")

【讨论】:

  • 感谢您抽出宝贵时间回复!
【解决方案3】:

这是一个 dplyr 解决方案:

library(dplyr)

table1 = data.frame(ID = c(123,123,123,432,432,432),
                    hpo_term = c("kidney failure","hand tremor","kidney transplant","hypertension","exotropia","scissor gait"))

kid_terms = c("kidney failure","kidney transplant","hypertension")
nonkid_terms = c("hand tremor","exotropia","scissor gait")

table1$term_type = NA
table1$term_type[table1$hpo_term %in% kid_terms] = "kidney_hpo_term"
table1$term_type[table1$hpo_term %in% nonkid_terms] = "non_kidney_hpo_term"

table2 = table1 %>% group_by(ID,term_type) %>%
  summarize(term_list=paste(hpo_term,collapse=";")) %>%
  spread(term_type,term_list)

> table2
    ID kidney_hpo_term                  non_kidney_hpo_term   
1   123 kidney failure;kidney transplant hand tremor           
2   432 hypertension                     exotropia;scissor gait

这是data.table 解决方案:

library(data.table)

table1 = data.table(ID = c(123,123,123,432,432,432),
                    hpo_term = c("kidney failure","hand tremor","kidney transplant","hypertension","exotropia","scissor gait"))

kid_terms = c("kidney failure","kidney transplant","hypertension")
nonkid_terms = c("hand tremor","exotropia","scissor gait")

table1$term_type = NA
table1$term_type[table1$hpo_term %in% kid_terms] = "kidney_hpo_term"
table1$term_type[table1$hpo_term %in% nonkid_terms] = "non_kidney_hpo_term"

table2 = table1[,.(term_list=paste(hpo_term,collapse=";")),by=.(ID,term_type)]

table3 = dcast(table2, ID~term_type, value.var = "term_list")

> table3
    ID                  kidney_hpo_term    non_kidney_hpo_term
1: 123 kidney failure;kidney transplant            hand tremor
2: 432                     hypertension exotropia;scissor gait

【讨论】:

  • 效果很好,感谢您抽出宝贵时间回复。我已经接受了另一个答案,因为我没想过总结而不是拥有两个 hpo 列表。
猜你喜欢
  • 2016-01-17
  • 1970-01-01
  • 1970-01-01
  • 2021-11-06
  • 1970-01-01
  • 2018-07-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多