【问题标题】:filling in multiple columns based on a conditional?根据条件填写多列?
【发布时间】:2021-09-05 06:20:10
【问题描述】:

我有两行有大约 4000 个不同癌症的父母和子女 ID,我在它们旁边创建了两列,标题为“Child_Name”和“Parent_Name”。 df1 是这样的:

Parent ID Child ID Parent Name Child Name
D015179 D003110
D018307 D002294

在另一个数据框中,我在一行中有父 ID 和子 ID,相邻列中有它们的名称:

Cancer_ID Cancer
D015179 Colorectal Neoplasms
D018307 Neoplasms, Squamous Cell
D002294 Carcinoma, Squamous Cell
D003110 Colonic Neoplasms

本质上,我想通过在df2中找到相应的cancer_ID来填写df1中的Parent Name和Child Name,并将癌症名称分别放在子名称或父名称中,使其看起来像这样:

Parent ID Child ID Parent Name Child Name
D015179 D003110 Colorectal Neoplasms Colonic Neoplasms
D018307 D002294 Neoplasms, Squamous Cell Carcinoma, Squamous Cell

我相信这可能有一个 dplyr 解决方案,但我一直无法想出任何可靠的东西,一如既往,任何帮助将不胜感激!

这些分别是 df1 和 df2 的前 20 行的 dput(),可以直接使用(我希望我以正确的格式呈现这些):

df1:

structure(list(Parent_ID = c("D015179", "D015179", "D001932", 
"D002528", "D018307", "D018307", "D003110", "D012004", "D015179", 
"D015179", "D009442", "D009455", "D018358", "D018358", "D018295", 
"D018295", "D001984", "D001984", "D010235", "D010235"), Child_ID = c("D003110", 
"D003110", "D002528", "D001932", "D002294", "D002294", "D012004", 
"D003110", "D012004", "D012004", "D009455", "D009442", "D018278", 
"D018278", "D002280", "D002280", "D002283", "D002283", "D010236", 
"D010236"), Child_Name = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Parent_Name = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA)), row.names = c(1L, 2L, 3L, 4L, 7L, 8L, 11L, 12L, 
15L, 16L, 27L, 28L, 31L, 32L, 37L, 38L, 39L, 40L, 53L, 54L), class = "data.frame")

df2:

structure(list(Cancer_ID = c("D009369", "D003560", "D001845", 
"D017824", "D007570", "D009631", "D009807", "D003803", "D018333", 
"D010509", "D011842", "D001935", "D047688", "D001994", "D017043", 
"D015529", "D004814", "D004934", "D005497", "D045888"), Cancer = c("Neoplasms", 
"Cysts", "Bone Cysts", "Bone Cysts, Aneurysmal", "Jaw Cysts", 
"Nonodontogenic Cysts", "Odontogenic Cysts", "Dentigerous Cyst", 
"Odontogenic Cyst, Calcifying", "Periodontal Cyst", "Radicular Cyst", 
"Branchioma", "Breast Cyst", "Bronchogenic Cyst", "Chalazion", 
"Choledochal Cyst", "Epidermal Cyst", "Esophageal Cyst", "Follicular Cyst", 
"Ganglion Cysts")), row.names = c(NA, 20L), class = "data.frame")

【问题讨论】:

  • 我对你的命名约定有点困惑。 Parent IDCancer ID 是同一个标识符吗?你能dputdata.frames 让我们直接使用它吗?
  • 你可以从 library(data.table) 中尝试 setDT。请使用数据框附加您的数据,例如 df1

标签: r dataframe dplyr data-wrangling


【解决方案1】:

我朋友的回答略有不同,在across.names = 参数中使用match()gsub()

df <- read.table(header = T, text = "Parent_ID  Child_ID
D015179 D003110     
D018307 D002294")


lookup <- read.table(header = T, text = "Cancer_ID  Cancer
D015179 'Colorectal Neoplasms'
D018307 'Neoplasms, Squamous Cell'
D002294 'Carcinoma, Squamous Cell'
D003110 'Colonic Neoplasms'")

library(dplyr, warn.conflicts = F)

df %>% mutate(across(everything(), ~ lookup$Cancer[match(., lookup$Cancer_ID)],
                     .names = '{gsub("_ID", "_name", .col)}'))

#>   Parent_ID Child_ID              Parent_name               Child_name
#> 1   D015179  D003110     Colorectal Neoplasms        Colonic Neoplasms
#> 2   D018307  D002294 Neoplasms, Squamous Cell Carcinoma, Squamous Cell

reprex package (v2.0.0) 于 2021-06-21 创建

【讨论】:

  • 感谢 AnilGoyal,我对您的代码稍作修改,但这有效!
猜你喜欢
  • 2022-12-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-03-08
  • 2017-10-12
  • 2019-05-04
  • 2016-04-06
  • 1970-01-01
相关资源
最近更新 更多