【发布时间】:2015-08-02 18:26:55
【问题描述】:
我正在尝试抓取Washington Post's database on police shootings。由于不是 html 我不能使用rvest,所以我使用了RSelenium 和phantomjs。
library(RSelenium)
checkForServer()
startServer()
eCap <- list(phantomjs.binary.path = "C:/Program Files/Chrome Driver/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.washingtonpost.com/graphics/national/police-shootings/")
在查看源代码后,很明显我感兴趣的项目有以下id 和class
<div id="js-list-690" class="listWrapper cf">
或在 Chrome 中:
我可以访问单个项目的文本:
remDr$findElement("css", "#js-list-691")$getElementText()
返回
[[1]]
[1] "An unidentified person, a 47-year-old Hispanic man, was shocked with a stun gun and shot on July 30, 2015, in Whittier, Calif. Los Angeles County deputies were investigating a domestic disturbance when he threatened the officers and struck one of them with a metal rod.\nMALEDEADLY WEAPONHISPANIC45 TO 54\nCBS Los AngelesWhittier Daily News"}
但如果我想获得所有这些项目的列表:
remDr$findElements("class name", "listWrapper cf")
导致错误。
我该怎么做
- 获取共享此
listWrapper cf类的所有元素的列表? - 返回与每个元素关联的文本列表?
【问题讨论】:
-
为什么不直接使用js.washingtonpost.com/graphics/policeshootings/…的数据?
标签: r selenium selenium-webdriver web-scraping rselenium