【问题标题】:R funtion to merge rows by id and create separate columnsR函数按id合并行并创建单独的列
【发布时间】:2021-01-12 15:58:54
【问题描述】:

我有一个从 API 获得的文章列表,我的数据框如下所示:

PMID        Year     Title                  Journal         Author 
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles
33326729    2020     Avelumab Maintenance   PLoS biology    B., Huang
33326729    2020     Avelumab Maintenance   PLoS biology    A., Di Pietro

我需要合并到这个:

PMID        Year     Title                  Journal         Author-1         Author-2     Author-3
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles       B., Huang    A., Di Pietro

所以基本上,我需要将文章的作者合并到一行中。我想到了按 id 排序,方法如下:

test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]

Outputs:
33326729    2020,2020,2020     Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance   PLoS biology,PLoS biology,PLoS biology    T., Powles,B., Huang,A., Di Pietro

但是,这会使用逗号而不是单独的列生成数据。有谁知道不同的功能或如何调整 setDT 功能以获得我的预期结果?提前致谢

编辑: 根据要求输出dput(head(PubMed_df))

structure(list(pmid = c("33326729", "33326729", "33326729", "33320856", 
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021", 
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles", 
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas", 
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs", 
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom", 
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom", 
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine", 
"The New England journal of medicine", "PLoS biology", "PLoS biology", 
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018", 
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030", 
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))

编辑 2: 非常详细和具体的要求:

我需要将上面显示头部的数据转换为每行都有的形式: PMID |出版日期 |作者 1 |隶属关系 |地址 |城市 |州(如果是美国)|国家 |作者 2 |作者 2 的隶属关系 |地址 |城市 |州(如果是美国)|国家 |等等每个共同作者 |期刊 |标题 |摘要* | MH术语

我将不得不分解地址,但那是稍后将重点关注的事情。目前,我的目标是获取添加到正确文章中的每个作者的所有信息,而不是同一文章的 3 行。

编辑 2 - 用于从 @r2evans 获得答案以在我的情况下工作: 如果您将 dcast 用作 data.table::dcast,则提供的答案有效!

【问题讨论】:

  • @RuiBarradas 我想从某种意义上说这就是我想要做的,我也尝试过使用 reshape2 函数,正如here 解释的那样,但我无法让它工作。
  • 如果您提供来自dput(head(PubMed_df)) 的输出,我们将更容易测试您的数据;它是明确的,并且使我们更容易解决数据中的嵌入空间,这些空间破坏了 R 中的简单复制/read.table 分配。
  • @r2evans 感谢您的帮助!我现在正在查看您提出的答案。虽然我还没有尝试过(我会在写完这篇评论后这样做),但它看起来很有希望。我还在我的问题中添加了您要求的数据。我看到了一个复杂情况,即您指定了 Author 并且我要求它用于多个 var(我只是选择了 author 作为它最重要的,并且会使其更直接)。但希望我自己能够弄清楚这一点。谢谢!
  • @r2evans 不用担心!已经非常感谢您的帮助!我会试着弄清楚这意味着什么以及如何解决它。你已经让我比我一个人走得更远了!非常感谢!

标签: r lapply


【解决方案1】:

这主要是来自 Rui 的评论的欺骗,但它有助于添加一个帮助列来获取它(我将在这里使用 row)。既然你开始使用data.table,我会坚持下去。

已编辑以使用更新后的数据。 (我假设 pmid 唯一地定义了这些组。)

library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]

并且在 Über-wide 格式中:

dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
       pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1                               address_2                               address_3
     <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>                                  <char>                                  <char>
1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns   Thompson     Acosta  Martin A M      Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ...                      Pfizer, Groton, CT                    Pfizer, Milan, Italy

请注意,当您的论文作者少于数据集中作者的最大数量时,它们将有空的/NA 列。例如,如果我删除第 5-6 行并执行相同操作,

PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
#        pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1          address_2            address_3
#      <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>             <char>               <char>
# 1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns       <NA>       <NA>  Martin A M        <NA>        <NA> MRC Human Genetics Unit, MRC Institu...               <NA>                 <NA>
# 2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-10-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-10-28
    相关资源
    最近更新 更多