【发布时间】:2021-09-05 06:20:10
【问题描述】:
我有两行有大约 4000 个不同癌症的父母和子女 ID,我在它们旁边创建了两列,标题为“Child_Name”和“Parent_Name”。 df1 是这样的:
| Parent ID | Child ID | Parent Name | Child Name |
|---|---|---|---|
| D015179 | D003110 | ||
| D018307 | D002294 |
在另一个数据框中,我在一行中有父 ID 和子 ID,相邻列中有它们的名称:
| Cancer_ID | Cancer |
|---|---|
| D015179 | Colorectal Neoplasms |
| D018307 | Neoplasms, Squamous Cell |
| D002294 | Carcinoma, Squamous Cell |
| D003110 | Colonic Neoplasms |
本质上,我想通过在df2中找到相应的cancer_ID来填写df1中的Parent Name和Child Name,并将癌症名称分别放在子名称或父名称中,使其看起来像这样:
| Parent ID | Child ID | Parent Name | Child Name |
|---|---|---|---|
| D015179 | D003110 | Colorectal Neoplasms | Colonic Neoplasms |
| D018307 | D002294 | Neoplasms, Squamous Cell | Carcinoma, Squamous Cell |
我相信这可能有一个 dplyr 解决方案,但我一直无法想出任何可靠的东西,一如既往,任何帮助将不胜感激!
这些分别是 df1 和 df2 的前 20 行的 dput(),可以直接使用(我希望我以正确的格式呈现这些):
df1:
structure(list(Parent_ID = c("D015179", "D015179", "D001932",
"D002528", "D018307", "D018307", "D003110", "D012004", "D015179",
"D015179", "D009442", "D009455", "D018358", "D018358", "D018295",
"D018295", "D001984", "D001984", "D010235", "D010235"), Child_ID = c("D003110",
"D003110", "D002528", "D001932", "D002294", "D002294", "D012004",
"D003110", "D012004", "D012004", "D009455", "D009442", "D018278",
"D018278", "D002280", "D002280", "D002283", "D002283", "D010236",
"D010236"), Child_Name = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Parent_Name = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA)), row.names = c(1L, 2L, 3L, 4L, 7L, 8L, 11L, 12L,
15L, 16L, 27L, 28L, 31L, 32L, 37L, 38L, 39L, 40L, 53L, 54L), class = "data.frame")
df2:
structure(list(Cancer_ID = c("D009369", "D003560", "D001845",
"D017824", "D007570", "D009631", "D009807", "D003803", "D018333",
"D010509", "D011842", "D001935", "D047688", "D001994", "D017043",
"D015529", "D004814", "D004934", "D005497", "D045888"), Cancer = c("Neoplasms",
"Cysts", "Bone Cysts", "Bone Cysts, Aneurysmal", "Jaw Cysts",
"Nonodontogenic Cysts", "Odontogenic Cysts", "Dentigerous Cyst",
"Odontogenic Cyst, Calcifying", "Periodontal Cyst", "Radicular Cyst",
"Branchioma", "Breast Cyst", "Bronchogenic Cyst", "Chalazion",
"Choledochal Cyst", "Epidermal Cyst", "Esophageal Cyst", "Follicular Cyst",
"Ganglion Cysts")), row.names = c(NA, 20L), class = "data.frame")
【问题讨论】:
-
我对你的命名约定有点困惑。
Parent ID和Cancer ID是同一个标识符吗?你能dputdata.frames让我们直接使用它吗? -
你可以从 library(data.table) 中尝试 setDT。请使用数据框附加您的数据,例如 df1
标签: r dataframe dplyr data-wrangling