如何通过使用 R rvest 抓取来检索网站的特定元素？答案

【问题标题】：How to retrieve a specific element of a website by scraping with R rvest?如何通过使用 R rvest 抓取来检索网站的特定元素？
【发布时间】：2018-10-09 15:21:36
【问题描述】：

我正在以下网站寻求有关我正在使用的分子的帮助： chebi_molecule

我想准确地提取这个文本（从“角色分类”表中，大约在网页中间）：

生物学作用： 5-羟色胺能激动剂一种对血清素受体具有亲和力并能够通过刺激细胞受体的生理活性来模拟血清素作用的药剂。 5-羟色胺激动剂用作抗抑郁药、抗焦虑药和治疗偏头痛。

应用程序：
5-羟色胺能激动剂一种对血清素受体具有亲和力并能够通过刺激细胞受体的生理活性来模拟血清素作用的药剂。 5-羟色胺激动剂用作抗抑郁药、抗焦虑药和治疗偏头痛。

我尝试使用 firefox 的 firebug v2.0.19 获取 xpath，但将其粘贴到 rvest html_nodes 后，我无法检索任何内容。

xpath_bio <- ".//*[@id='content']/table[2]/tbody/tr/td/table[3]/tbody/tr[2]/td[2]/div"

xpath_appl <- ".//*[@id='content']/table[2]/tbody/tr/td/table[3]/tbody/tr[4]/td[2]/div[2]"

当我尝试使用时：

bio <- rvest::read_html(site) %>% html_nodes(xpath = xpath_bio)

我得到一个空值

你能帮我解决这个问题吗？我怎样才能得到这些文本？我环顾四周寻找其他问题，但我可以找到很多解决方案。谢谢。

【问题讨论】：

试试html_nodes(".chebiTableContent:nth-child(9)")。
您确定可以抓取该网站吗？只要你能，并不意味着你坚持。您可能会因不道德的刮擦而卷入严重的问题。这显然是一个友好的建议！
我的只是学术兴趣，我不打算抓取该网站或任何其他网站。
非常感谢您！ MrFlick 您能解释一下您的解决方案是如何工作的吗？为什么是 9 号？

标签： html r web-scraping rvest

【解决方案1】：

我之前没有使用过rvest，但是您是否尝试过使用检查功能？ ctrl+shift+I 铬。我检查了网站上的“生物角色”元素，你得到了这个 html：

<a href="chebiOntology.do;jsessionid=8D8CE11C3CA44298C0BC62921779562B?chebiId=CHEBI:24432" target="_blank">Biological Role</a>

所以只需使用正则表达式查找字符串"target="_blank">Biological Role" 的位置并查找相应的"class="roleDefinition"" 字符串。

<div class="roleDefinition">An agent that has an affinity for serotonin receptors and is able to mimic the effects of serotonin by stimulating the physiologic activity at the cell receptors. Serotonin agonists are used as antidepressants, anxiolytics, and in the treatment of migraine disorders.</div>

【讨论】：