从 OpenIR 中的相似节点中提取属性答案

【问题标题】：Extract attributes from similar nodes in OpenIR从 OpenIR 中的相似节点中提取属性
【发布时间】：2017-11-12 10:35:35
【问题描述】：

此任务的目标是在 IR 的搜索结果页面中提取关于论文标题的“href”，并将它们作为数据框。这个结果页面的结构不是很好：论文标题、问题信息、作者和下载按钮在同一个字段中，只用“span”（在“title”、“issue”和“authors”之间）和“sup”分隔”（在“作者”内）。

results<-"http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued"
library(rvest)
resultsource <- read_html(results)
itemLine <- html_node(resultsource, xpath ='//tr[@class="itemLine"]')
# gether labels and values of item metadata in miscTable2
titleLine <- html_nodes(itemLine, xpath ='//span/a[@href][@target]')
titlehref <- xml_attrs(titleLine, "href")
resultstxt <- html_text(titleLine, trim = TRUE)

上面的程序运行没有错误，但是“titleLine”有很多冗余，而“titlehref”只有一个比赛作为'类 “itemLine”，但根本没有 URL。我的问题是：

如何准确定位论文标题的href？我用一个第二层“html_nodes”保存所有目标href。然而， “sup”标签下的“href”仍在“titleLine”中，以及 “目标”也是。我们可以使用“目标”属性来定位正确吗 “href”但不让它们出现在“titleLine”中？
如何定位具有复杂“值”的属性？在节目中上面，我只使用“href”。我之前尝试过使用“xpath 风格” 但没有帮助。我想使用名称空间来识别论文的 URL，但我看到也许 ns 只能从“xmlns”属性中提取，并且无法手动分配（如titlehref <- xml_attrs(titleLine, "href", ns=”http://ir.las.ac.cn/handle”)）

如何拟合这个IR的结构才能得到正确的结果？非常感谢。

【问题讨论】：

标签： r rvest xml2

【解决方案1】：

您可以索引所需的<span> 目标以及<td>

library(rvest)

pg <- read_html("http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued")

html_nodes(pg, xpath=".//tr[@class='itemLine']/td[2]/span[1]/a") %>% 
  html_text()
##  [1] "Data-driven Discovery: A New Era of Exploiting the Literature and Data"                                                                                       
##  [2] "Contents Index to Volume 1"                                                                                                                                   
##  [3] "Topic Detection Based on Weak Tie Analysis: A Case Study of LIS Research"                                                                                     
##  [4] "Open Peer Review in Scientific Publishing: A Web Mining Study of <i>PeerJ</i> Authors and Reviewers"                                                          
##  [5] "Mapping Diversity of Publication Patterns in the Social Sciences and Humanities: An Approach Making Use of Fuzzy Cluster Analysis"                            
##  [6] "Under-reporting of Adverse Events in the Biomedical Literature"                                                                                               
##  [7] "Predictive Characteristics of Co-authorship Networks: Comparing the Unweighted, Weighted, and Bipartite Cases"                                                
##  [8] "International Conference on Scientometrics & Informetrics October16-20, 2017, Wuhan · China"                                                                  
##  [9] "Identification and Analysis of Multi-tasking Product Information Search Sessions with Query Logs"                                                             
## [10] "The 1<sup>st</sup> International Conference on Datadriven Knowledge Discovery: When Data Science Meets Information Science. June 19-22, 2016, Beijing · China"
## [11] "The Power-weakness Ratios (PWR) as a Journal Indicator: Testing the “Tournaments” Metaphor in Citation Impact Studies"                                        
## [12] "Document Type Profiles in <i>Nature, Science</i>, and <i>PNAS</i>: Journal and Country Level"                                                                 
## [13] "Can Automatic Classification Help to Increase Accuracy in Data Collection?"                                                                                   
## [14] "Knowledge Representation in Patient Safety Reporting: An Ontological Approach"                                                                                
## [15] "Information Science Roles in the Emerging Field of Data Science"                                                                                              
## [16] "Data Science Altmetrics"                                                                                                                                      
## [17] "Comparative Study of Trace Metrics between Bibliometrics and Patentometrics"                                                                                  
## [18] "Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation"                                       
## [19] "Mining Related Articles for Automatic Journal Cataloging"                                                                                                     
## [20] "Critical Factors for Personal Cloud Storage Adoption in ChinaCritical Factors for Personal Cloud Storage Adoption in China"

^^ 中的 HTML 标记（如“`...”）在其末尾是错误的（它们也出现在呈现的浏览器视图中）。我认为有人在 XSS 预防方面做得太过分了。

【讨论】：

奇怪的是你发的答案是最早的，却是最后出现的。我学会了如何从您的答案中找到节点中的相似元素之一。非常感谢。但是当我修改它以提取论文的 URL 时，结果包括另一个不相关的属性“target”，即使我已经将属性分配为“href”： itemu% html_attrs() 如果我将“[@href]”移动到函数“html_attrs()”中，则会发生错误.这意味着管道无法传输其中一个属性？谢谢。

【解决方案2】：

试试这个。

library(rvest)
url<-"http://ir.las.ac.cn/handle/12502/8473/browse?type=dateissued"
page<-html_session(url)

# DATA EXTRACTION
title<-html_nodes(page,css="strong") %>% html_text()
title<-title[5:length(title)]
download_link<-html_nodes(page, css= "span:nth-child(7) a+ a") %>% html_attr("href")
issue_information<-html_nodes(page, css= "i") %>% html_text()
authors<-html_nodes(page,css=".itemLine span:nth-child(5)") %>% html_text()

# CONVERT TO DATA FRAME
k<-data.frame(title,download_link,issue_information,authors)

在每个页面上运行代码以获取完整的数据框。

为了定位不同的元素，我使用了“SELECTOR GADGET”chrome add in 然后在代码中使用。

【讨论】：

请问，“xml2”包中使用的“CSS 选择器”在哪里可以获得进一步的说明？这项技术是所有网络分析工具和浏览器中的通用技术吗？非常感谢。
人们更频繁地使用 SELECTOR GADGET 进行网页抓取。它在这里可用：chrome.google.com/webstore/detail/selectorgadget/…
你的程序有效，但不是我想要的结果。我想获取那些论文的元数据页面的URL，它记录在“论文标题”节点的属性中。这些“href”似乎很难废弃。感谢您的帮助。