【发布时间】:2021-01-12 15:58:54
【问题描述】:
我有一个从 API 获得的文章列表,我的数据框如下所示:
PMID Year Title Journal Author
33326729 2020 Avelumab Maintenance PLoS biology T., Powles
33326729 2020 Avelumab Maintenance PLoS biology B., Huang
33326729 2020 Avelumab Maintenance PLoS biology A., Di Pietro
我需要合并到这个:
PMID Year Title Journal Author-1 Author-2 Author-3
33326729 2020 Avelumab Maintenance PLoS biology T., Powles B., Huang A., Di Pietro
所以基本上,我需要将文章的作者合并到一行中。我想到了按 id 排序,方法如下:
test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]
Outputs:
33326729 2020,2020,2020 Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance PLoS biology,PLoS biology,PLoS biology T., Powles,B., Huang,A., Di Pietro
但是,这会使用逗号而不是单独的列生成数据。有谁知道不同的功能或如何调整 setDT 功能以获得我的预期结果?提前致谢
编辑:
根据要求输出dput(head(PubMed_df)):
structure(list(pmid = c("33326729", "33326729", "33326729", "33320856",
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021",
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles",
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas",
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs",
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom",
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom",
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine",
"The New England journal of medicine", "PLoS biology", "PLoS biology",
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018",
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030",
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))
编辑 2: 非常详细和具体的要求:
我需要将上面显示头部的数据转换为每行都有的形式: PMID |出版日期 |作者 1 |隶属关系 |地址 |城市 |州(如果是美国)|国家 |作者 2 |作者 2 的隶属关系 |地址 |城市 |州(如果是美国)|国家 |等等每个共同作者 |期刊 |标题 |摘要* | MH术语
我将不得不分解地址,但那是稍后将重点关注的事情。目前,我的目标是获取添加到正确文章中的每个作者的所有信息,而不是同一文章的 3 行。
编辑 2 - 用于从 @r2evans 获得答案以在我的情况下工作: 如果您将 dcast 用作 data.table::dcast,则提供的答案有效!
【问题讨论】:
-
@RuiBarradas 我想从某种意义上说这就是我想要做的,我也尝试过使用 reshape2 函数,正如here 解释的那样,但我无法让它工作。
-
如果您提供来自
dput(head(PubMed_df))的输出,我们将更容易测试您的数据;它是明确的,并且使我们更容易解决数据中的嵌入空间,这些空间破坏了 R 中的简单复制/read.table分配。 -
@r2evans 感谢您的帮助!我现在正在查看您提出的答案。虽然我还没有尝试过(我会在写完这篇评论后这样做),但它看起来很有希望。我还在我的问题中添加了您要求的数据。我看到了一个复杂情况,即您指定了 Author 并且我要求它用于多个 var(我只是选择了 author 作为它最重要的,并且会使其更直接)。但希望我自己能够弄清楚这一点。谢谢!
-
@r2evans 不用担心!已经非常感谢您的帮助!我会试着弄清楚这意味着什么以及如何解决它。你已经让我比我一个人走得更远了!非常感谢!