路透社在 R 中使用 rvest 抓取数据，找到 CSS 选择器答案

【问题标题】：reuters data scraping in R with rvest, find CSS selector路透社在 R 中使用 rvest 抓取数据，找到 CSS 选择器
【发布时间】：2023-04-06 21:40:01
【问题描述】：

是的，我知道有类似的问题，我已阅读答案并尝试了我可以实施的答案。所以，如果问题很愚蠢，请提前抱歉:)

我正在从路透社获取公司董事会成员的年龄以获取公司列表。这是链接：http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT

我正在使用 rvest 库和 selectorgadget 来找到合适的 CSS 选择器。代码如下：

library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")

d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()

结果是

character(0)

我认为我的 CSS 选择器有误。能告诉我怎么选表吗？

【问题讨论】：

" 未经我们事先书面同意，您不得删除、更改、转发、抓取、复制、出售、分发、转发、创建衍生作品或以其他方式将内容提供给第三方，" 除非你能提供其他文件，否则你不太可能这样做是为了自己的启迪。
我这样做是为了我的论文（董事会经验〜公司业绩）。因此，我不会将内容提供给第三方。不过，谢谢你的这一点。我会问他们是否可以为此目的使用汇总数据。我也相信我可以自己使用这些数据，因为我的组织订阅了他们的服务。

标签： r web-scraping css-selectors html-parsing rvest

【解决方案1】：

您需要使用html_session 才能正确加载数据：

library(rvest)

url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()

site %>% html_node('#companyNews:first-child table') %>% html_table()

##                     Name Age Since                                  Current Position
## 1          John Thompson  66  2014                 Independent Chairman of the Board
## 2         Bradford Smith  57  2015                    President, Chief Legal Officer
## 3          Satya Nadella  48  2014                 Chief Executive Officer, Director
## 4          William Gates  60  2014          Founder and Technology Advisor, Director
## 5               Amy Hood  43  2013 Chief Financial Officer, Executive Vice President
## 6  Christopher Capossela  45  2014 Executive Vice President, Chief Marketing Officer
## 7         Kathleen Hogan  49  2014        Executive Vice President - Human Resources
## 8       Margaret Johnson  54  2014   Executive Vice President - Business Development
## 9           Ifeanyi Amah  NA  2016                          Chief Technology Officer
## 10         Keith Lorizio  NA  2016              Vice President - North America Sales
## 11       Teri List-Stoll  53  2014                              Independent Director
## 12       G. Mason Morfit  40  2014                              Independent Director
## 13         Charles Noski  63  2003                              Independent Director
## 14          Helmut Panke  69  2003                              Independent Director
## 15        Charles Scharf  50  2014                              Independent Director
## 16          John Stanton  60  2014                              Independent Director
## 17             Chris Suh  NA    NA              General Manager - Investor Relations

【讨论】：

非常感谢！您是如何意识到 CSS 选择器应该是 #companyNews:first-child table？
selectorgadget 做出了不错的猜测，但很少返回最佳选择器，因此我查看了 HTML 并尝试了一些选项。这些表没有唯一的ID，所以选择器必须是相对的，并且div#companyNews中有两个表，所以我使用:first-child进行子集。 rvest 链接到 a short, fun tutorial，它几乎可以教你所有你需要知道的东西。