【问题标题】:R Webscraping from WFP website来自 WFP 网站的 R Webscraping
【发布时间】:2017-12-15 10:05:21
【问题描述】:

我正在使用粮食计划署国家网站 (http://www1.wfp.org/countries) 瞄准网络抓取它,以便建立一个包含定期发布的新闻的数据集,而无需逐页点击。 此外,我会添加一些列,包括关键字计数。 撇开包含国家和网址的脚本部分不谈,我确实会专注于抓取本身。 然而,我正在使用一堆包。

library(rvest) library(stringr) library(tidyr) library(data.table) library(plyr) library(xml2) library(selectr) library(tibble) library(purrr) library(datapasta) library(jsonlite) library(countrycode) library(httr) library(stringi) library(tidyverse) library(dplyr) library(XML)

我已经为另一个网站准备了数据集,它似乎运行良好。 这里的一个助手为这件事提出了一个非常优雅的解决方案,我已经将它与我以前在国家部分的工作整合在一起,并且一切都很好。然而,该解决方案似乎不符合我目前的需要。 然而,我有这个:

## 11. Creating a function in order to scrape data from a website (in this case, WFP's)
wfp_get_news <- function(iso3) {                                                          GET(
url = "http://www1.wfp.org/countries/common/allnews/en/",
query = list(iso3=iso3)
) -> res

warn_for_status(res)

if (status_code(res) > 399) return(NULL)

out <- content(res, as="text", encoding="UTF-8")
out <- jsonlite::fromJSON(out)
out$iso3 <- iso3

tbl_df(out)
}




## 12. Setting all the Country urls in order for them to be automatically scraped 
pb <- progress_estimated(length(countrycode_data$iso3c[]))                                   # THIS TAKES LONG TO BE PROCESSED                         
map_df(countrycode_data$iso3c[], ~{
pb$tick()$print()
Sys.sleep(5) 
wfp_get_news(.x)
}) -> xdf



## 13. Setting keywords (of course, this process is arbitrary: one can    decide any keywor s/he prefers)
keywords <- c("drought", "food security")                                        


keyword_regex <- sprintf("(%s)", paste0(keywords, collapse="|"))




## 14. Setting the keywords search
bind_cols(                                                                                  
xdf,
stri_match_all_regex(tolower(xdf$bodytext), keyword_regex) %>% 
map(~.x[,2]) %>% 
map_df(~{ 
  res <- table(.x, useNA="always")
  nm <- names(res)
  nm <- ifelse(is.na(nm), "NONE", stri_replace_all_regex(nm, "[ -]", "_"))
  as.list(set_names(as.numeric(res), nm))
 })
 ) %>% 
 select(-NONE) -> xdf_with_keyword_counts

特别是,当我运行第 14 点时,如果脚本,我会收到以下错误消息:

Error in overscope_eval_next(overscope, expr) : 
object "NONE" not found
Furthermore: Warning message:
Unknown or uninitialised column: 'bodytext'.

预期的结果应该或多或少是:

> glimpse(xdf_with_keyword_counts)
  Observations: 12,375
  Variables: 12
  $ uid           <chr> "1071595", "1069933", "1069560", "1045264", "1044139", "1038339", "405003", "1052711", NA, "1062329", "1045248", "...
  $ table         <chr> "news", "news", "news", "news", "news", "news", "news", "news", NA, "news", "news", "news", "news", "news", NA, "n...
  $ title         <chr> "Conflicts and drought spur hunger despite strong global food supply", "FAO Calls for Stronger Collaboration on Tr...
  $ date          <chr> "1512640800", "1511823600", "1511737200", "1508191200", "1508104800", "1505980800", "1459461600", "1293836400", NA...
  $ bodytext      <chr> " 7 December 2017, Rome- Strong cereal harvests are keeping global food supplies buoyant, but localised drought, f...
  $ date_format   <chr> "07/12/2017", "28/11/2017", "27/11/2017", "17/10/2017", "16/10/2017", "21/09/2017", "01/04/2016", "01/01/2011", NA...
  $ image         <chr> "http://www.wfp.org...", "http://www.wfp.org...
  $ pid           <chr> "2330", "50840", "16275", "70992", "16275", "2330", "40990", "40990", NA, "53724", "53724", "2330", "53724", "5084...
  $ detail_pid    <chr> "/news/story/en/item/1071595/icode/", "/neareast/news/view/en/c/1069933/", "/asiapacific/news/detail-events/en/c/1...
  $ iso3          <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "ALA", "ALB", "ALB", "ALB", "ALB", "DZA", "ASM", "AND", "A...
  $ drought       <dbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
  $ food_security <dbl> NA, NA, NA, 2, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

我希望我说得很清楚。 有什么线索吗?

【问题讨论】:

    标签: r database keyword


    【解决方案1】:

    我认为您在网络抓取中遇到了“陷阱”之一:他们删除了网站上的此功能/路径。

    尝试转到http://www1.wfp.org/countries/common/allnews/en/iso=SLV(萨尔瓦多的新闻页面来自您几天前使用的 cpl 的 URL 方案)。它不存在。

    但是,如果您访问 http://www1.wfp.org/countries/el-salvador,则该页面上有一个指向 http://www.wfp.org/news/el-salvador-177 的链接,即萨尔瓦多新闻项目。

    我认为是相同的内容,只是呈现方式不同,所以只是攻击方式不同:

    library(rvest)
    library(httr)
    library(stringi)
    library(tidyverse)
    

    这是一个助手,因此我们可以获取他们的国家/地区 ID 和名称映射:

    get_countries <- function() {
    
      pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
    
      # find the country popup
      country_sel <- html_nodes(pg, "select[name='tid'] option")
    
      # extract ids and name for each country, ignoring "All"
      data_frame(
        cid = html_attr(country_sel, "value"),
        cname = html_text(country_sel)
      ) %>%
        filter(stri_detect_regex(cid, "[[:digit:]]"))
    
    }
    

    这是获取页面上新闻内容的助手

    get_news <- function(cid, tid) {
    
      GET("http://www.wfp.org/news/news-releases",
          query=list(tid=cid, tid_2=tid)) -> res
    
      warn_for_status(res)
    
      if (status_code(res) > 200) return(NULL)
    
      res <- content(res, as="parsed")
    
      # check for no stories by testing for the presence of the
      # div that has the "no stories are found" text
      if (length(html_node(res, "div.view-empty")) != 0) return(NULL)
    
      # find the news item boxes on this page
      items <- html_nodes(res, "div.list-page-item")
    
      # extract the contents
      data_frame(
        cid = cid,
        tid = tid,
        # significant inconsistency in how they assign CSS classes to date boxes
        date = html_text(html_nodes(items, xpath=".//div[contains(@class, 'box-date')]"), trim=TRUE),
        title = html_text(html_nodes(items, "h3"), trim=TRUE),
        # how & where they put summary text in the div is also inconsistent so we
        # need to (unfortunately) include the date and title to ensure we capture it
        # we cld get just the text, but it's more complex code.
        summary = html_text(items, trim=TRUE),
        link = html_attr(html_nodes(items, "h3 a"), "href")
      )
    
    }
    

    现在,我们遍历国家并获取所有故事:

    country_df <- get_countries()
    
    pb <- progress_estimated(length(country_df$cid))
    map_df(country_df$cid, ~{
      pb$tick()$print()
      get_news(.x, "All")
    }) -> news_df
    
    # add in country names
    mutate(news_df, cid = as.character(cid)) %>%
      left_join(country_df) -> news_df
    
    glimpse(news_df)
    ## Observations: 857
    ## Variables: 7
    ## $ cid     <chr> "120", "120", "120", "120", "120", "120", "120", "120", "120", "120"...
    ## $ tid     <chr> "All", "All", "All", "All", "All", "All", "All", "All", "All", "All"...
    ## $ date    <chr> "26 October 2017", "16 October 2017", "2 October 2017", "10 July 201...
    ## $ title   <chr> "US Contribution To Boost WFP Food Assistance And Local Economy In A...
    ## $ summary <chr> "26 October 2017\t\t\r\n\t\t\r\n\tUS Contribution To Boost WFP Food ...
    ## $ link    <chr> "/news/news-release/us-contribution-boost-wfp-food-assistance-and-lo...
    ## $ cname   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghani...
    

    您仍然需要尝试通过调整您拥有的其他代码来对其进行分类,并且您可以使用数据框中的link 来挖掘更多文本用于所述分类。

    注意:这只会获取每个国家/地区的最新新闻页面,但这几乎就是您想要做的事情(检查网络新页面并对其进行分类)。


    现在,我们可以尝试通过遍历国家和弹出主题列表来自动分类故事,因为这些主题似乎是您关心的(其中一些)。您需要相信他们很好地标记了事物。

    注意:这将需要 很长 时间,尤其是在“友善”延迟的情况下,因此我只对它进行了脚手架,并且除了进行轻量级测试以确保它之外没有运行它工作:

    # get topic ids
    get_topics <- function() {
    
      pg <- read_html("http://www.wfp.org/news/news-releases?tid=All&tid_2=All")
    
      # find the topic popup
      country_sel <- html_nodes(pg, "select[name='tid_2'] option")
    
      # extract ids and name for each topic, ignoring "All" and sub-topics
      # i.e. ignore ones that begin with "-"
      data_frame(
        tid = html_attr(country_sel, "value"),
        tname = html_text(country_sel)
      ) %>%
        filter(stri_detect_regex(tid, "[[:digit:]]")) %>%
        filter(tid != "All") # exclude "All" since we're trying to auto-tag
    
    }
    
    topics_df <- get_topics()
    
    pb <- progress_estimated(length(country_df$cid))
    map_df(country_df$cid, ~{
      pb$tick()$print()
      cid <- .x
      Sys.sleep(5) ## NOTE THIS SHOULD REALLY GO IN get_news() but I didn't want to mess with that function for this extra part of the example
      map_df(topics_df$tid, ~get_news(cid, .x))
    }) -> news_with_tagged_topics_df
    
    mutate(news_with_tagged_topics_df, tid = as.character(tid), cid = as.character(cid)) %>% 
      left_join(topics_df) %>% 
      left_join(country_df) %>% 
      glimpse()
    

    我对 3 个国家/地区的随机样本进行了测试:

    ## Observations: 11
    ## Variables: 8
    ## $ cid     <chr> "4790", "4790", "4790", "4790", "4790", "4790", "4790", "152", "152"...
    ## $ tid     <chr> "4488", "3929", "3929", "995", "999", "1005", "1005", "997", "995", ...
    ## $ date    <chr> "16 December 2014", "2 September 2016", "1 October 2014", "1 October...
    ## $ title   <chr> "Russia & WFP Seal Partnership To End Hunger; Kamaz Trucks Rolled Ou...
    ## $ summary <chr> "16 December 2014\t\t\r\n\t\t\r\n\tRussia & WFP Seal Partnership To ...
    ## $ link    <chr> "/news/news-release/russia-wfp-seal-partnership-end-hunger-kamaz-tru...
    ## $ tname   <chr> "Executive Director", "Centre of Excellence against Hunger", "Centre...
    ## $ cname   <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil"...
    

    它确实选择了多种标签:

    unique(news_with_tagged_topics_df$tname)
    ## [1] "Executive Director"                  "Centre of Excellence against Hunger"
    ## [3] "Nutrition"                           "Procurement"                        
    ## [5] "School Meals"                        "Logistics"
    

    【讨论】:

    • 谢谢!这非常有用。我正在经历它,一旦我运行它,我会告诉你。
    • 这非常有用,谢谢。有关如何包含关键字计数的任何建议(请参阅上文了解详细信息)?
    猜你喜欢
    • 1970-01-01
    • 2020-04-13
    • 1970-01-01
    • 1970-01-01
    • 2020-03-16
    • 2016-10-21
    • 2016-02-15
    • 2011-02-09
    • 2019-01-02
    相关资源
    最近更新 更多