从单个 PubMed 记录中提取隶属关系数据答案

【问题标题】：Extracting Affiliation data from a single PubMed record从单个 PubMed 记录中提取隶属关系数据
【发布时间】：2020-08-13 16:12:53
【问题描述】：

通过使用 easyPubMed 和大量搜索，我成功地从单个 pubmed 记录中提取了附属数据（我对 R 还是很陌生）。数据的问题在于它只报告了隶属关系信息的一部分，我假设这是由于非标准化字符串中的各种类型的信息造成的。

我的代码如下：

#PubMed query via easyPubMed using the URL of the XML

my_query <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml"
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")
print(my_abstracts_txt[1:16])


my_abstracts_xml <- fetch_pubmed_data(my_entrez_id)
class(my_abstracts_xml)


print(my_titles)


#EasyPubMed Extracting Affiliation data from a single PubMed Record

#Convert XML PubMed records to strings using the articles_to_list function
#Each record in the list is a string that still includes XML tags
my_PM_list <- articles_to_list(my_abstracts_xml)
class(my_PM_list[[4]])
cat(substr(my_PM_list[[4]], 1, 984))

#Affiliation can be extracted from a specific record using the custom_grep() function
#The fields extracted from the record will be returned as elements of a list or a character vector

curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]
Affiliation_Info.data <- custom_grep(curr_PM_record, tag = "AffiliationInfo")

View(Affiliation_Info)


curr_PM_record <- my_PM_list[[(length(my_PM_list) - 3)]]

理想情况下，我希望生成一个数据框，例如： PMID：作者：隶属关系

（但首先只专注于从发布的 URL 中提取所有从属信息）

但我真的很难做到这一点，希望能在这件事上提供任何帮助

提前致谢！

【问题讨论】：

每个r 标签（悬停或点击查看）：使用dput() 获取数据并使用library() 调用指定所有非基础包。为了重现性，请向我们展示 XML 数据的样本或返回的这些包（？）函数调用的提取。我们看不到您的任何class、print、cat 或 View 结果。

标签： r xml string url pubmed

【解决方案1】：

这是xml2 方法...

library( xml2 )
library( magrittr )

#read the xml-data
doc <- xml2::read_xml( "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20301425&retmode=xml" )

pmid    <- xml2::xml_find_first( doc, ".//PMID") %>% xml2::xml_text()
authors <- paste( 
  xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/LastName") %>% xml2::xml_text(),
  xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/ForeName") %>% xml2::xml_text(),
  sep = ", " )
affiliate <- xml2::xml_find_all( doc, ".//AuthorList[@Type = 'authors']/Author/AffiliationInfo/Affiliation") %>% xml2::xml_text()

df <- data.frame( pmid = pmid, authors = authors, affiliate = affiliate )

看起来像：

【讨论】：

谢谢你！这很有意义，使用这些程序让我更清楚。
对不起，我又来了。您知道是否可以将国家/州从附属机构中提取到第四列？
那么你将不得不在第二个逗号之后拆分附属字符串......就像这样.. gsub( ".*, ([a-zA-Z]+, [a-zA-Z]+$)", "\\1", "Moffitt Cancer Center, Tampa, Florida") 但有很多（可能更聪明）的方法可以实现相同的结果。