【发布时间】:2017-12-15 10:05:21
【问题描述】:
我正在使用粮食计划署国家网站 (http://www1.wfp.org/countries) 瞄准网络抓取它,以便建立一个包含定期发布的新闻的数据集,而无需逐页点击。 此外,我会添加一些列,包括关键字计数。 撇开包含国家和网址的脚本部分不谈,我确实会专注于抓取本身。 然而,我正在使用一堆包。
library(rvest)
library(stringr)
library(tidyr)
library(data.table)
library(plyr)
library(xml2)
library(selectr)
library(tibble)
library(purrr)
library(datapasta)
library(jsonlite)
library(countrycode)
library(httr)
library(stringi)
library(tidyverse)
library(dplyr)
library(XML)
我已经为另一个网站准备了数据集,它似乎运行良好。 这里的一个助手为这件事提出了一个非常优雅的解决方案,我已经将它与我以前在国家部分的工作整合在一起,并且一切都很好。然而,该解决方案似乎不符合我目前的需要。 然而,我有这个:
## 11. Creating a function in order to scrape data from a website (in this case, WFP's)
wfp_get_news <- function(iso3) { GET(
url = "http://www1.wfp.org/countries/common/allnews/en/",
query = list(iso3=iso3)
) -> res
warn_for_status(res)
if (status_code(res) > 399) return(NULL)
out <- content(res, as="text", encoding="UTF-8")
out <- jsonlite::fromJSON(out)
out$iso3 <- iso3
tbl_df(out)
}
## 12. Setting all the Country urls in order for them to be automatically scraped
pb <- progress_estimated(length(countrycode_data$iso3c[])) # THIS TAKES LONG TO BE PROCESSED
map_df(countrycode_data$iso3c[], ~{
pb$tick()$print()
Sys.sleep(5)
wfp_get_news(.x)
}) -> xdf
## 13. Setting keywords (of course, this process is arbitrary: one can decide any keywor s/he prefers)
keywords <- c("drought", "food security")
keyword_regex <- sprintf("(%s)", paste0(keywords, collapse="|"))
## 14. Setting the keywords search
bind_cols(
xdf,
stri_match_all_regex(tolower(xdf$bodytext), keyword_regex) %>%
map(~.x[,2]) %>%
map_df(~{
res <- table(.x, useNA="always")
nm <- names(res)
nm <- ifelse(is.na(nm), "NONE", stri_replace_all_regex(nm, "[ -]", "_"))
as.list(set_names(as.numeric(res), nm))
})
) %>%
select(-NONE) -> xdf_with_keyword_counts
特别是,当我运行第 14 点时,如果脚本,我会收到以下错误消息:
Error in overscope_eval_next(overscope, expr) :
object "NONE" not found
Furthermore: Warning message:
Unknown or uninitialised column: 'bodytext'.
预期的结果应该或多或少是:
> glimpse(xdf_with_keyword_counts)
Observations: 12,375
Variables: 12
$ uid <chr> "1071595", "1069933", "1069560", "1045264", "1044139", "1038339", "405003", "1052711", NA, "1062329", "1045248", "...
$ table <chr> "news", "news", "news", "news", "news", "news", "news", "news", NA, "news", "news", "news", "news", "news", NA, "n...
$ title <chr> "Conflicts and drought spur hunger despite strong global food supply", "FAO Calls for Stronger Collaboration on Tr...
$ date <chr> "1512640800", "1511823600", "1511737200", "1508191200", "1508104800", "1505980800", "1459461600", "1293836400", NA...
$ bodytext <chr> " 7 December 2017, Rome- Strong cereal harvests are keeping global food supplies buoyant, but localised drought, f...
$ date_format <chr> "07/12/2017", "28/11/2017", "27/11/2017", "17/10/2017", "16/10/2017", "21/09/2017", "01/04/2016", "01/01/2011", NA...
$ image <chr> "http://www.wfp.org...", "http://www.wfp.org...
$ pid <chr> "2330", "50840", "16275", "70992", "16275", "2330", "40990", "40990", NA, "53724", "53724", "2330", "53724", "5084...
$ detail_pid <chr> "/news/story/en/item/1071595/icode/", "/neareast/news/view/en/c/1069933/", "/asiapacific/news/detail-events/en/c/1...
$ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "ALA", "ALB", "ALB", "ALB", "ALB", "DZA", "ASM", "AND", "A...
$ drought <dbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ food_security <dbl> NA, NA, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
我希望我说得很清楚。 有什么线索吗?
【问题讨论】: