【发布时间】:2021-03-05 00:52:09
【问题描述】:
我的主管和我正在撰写一篇关于罗德岛室内性工作暂时去罪化的影响的研究论文。作为我们数据收集的一部分,我们正在尝试从 theeroticreview.com 中获取有关性工作者特征、平均价格和其他一些数据的数据。手动输入的配置文件太多了,所以我正在尝试编写一个 R 脚本来自动化该过程。
目前,我的代码如下所示:code。如您所见,我必须分别为每个配置文件输入每个名称,否则我会收到“没有链接有文本”错误。有 2000 个观测值。 Xpath 在格式方面表现不佳。
##Set Main Page
TER <- html_session("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0")
##Locate and follow link to profile
reviews <- TER %>% follow_link('Ashley')
## extract required information
reviews %>% html_nodes('h1') %>% html_text()
##back to main page
rhea <- reviews %>% back()
revieww <- TER %>% follow_link("Lily")
revieww %>% html_nodes('h1') %>% html_text()
rhea <- revieww %>% back()
reviewa <- TER %>% follow_link("Coco")
reviewa %>% html_nodes('h1') %>% html_text()
rhea <- reviewa %>% back()
##Move to Next Page
TER %>% jump_to('https://www.theeroticreview.com/reviews/newreviewsList.asp?Valid=1&mp=0&SortBy=3&searchreview=1&gCity=region1-us-rhode-island&gDistance=0&gCityName=Rhode%20Island%20(State)&page=2')
TER2 <- html_session('https://www.theeroticreview.com/reviews/newreviewsList.asp?Valid=1&mp=0&SortBy=3&searchreview=1&gCity=region1-us-rhode-island&gDistance=0&gCityName=Rhode%20Island%20(State)&page=2')
reviewd <- TER2 %>% follow_link('Danielle')
reviewd %>% html_nodes('h1') %>% html_text()
在网站的 HTML 中,每个链接都是 td-name。有什么方法可以编写算法/编写函数代码,使这个过程自动化吗?
【问题讨论】:
标签: r web-scraping rvest