【问题标题】:Webscraping from list of URLs in dataframe in R从 R 中的数据框中的 URL 列表中抓取网页
【发布时间】:2020-01-07 05:25:11
【问题描述】:

我有一个有点复杂的任务,需要查找包含在数据框中的一系列 URL,从每个 URL 中抓取一些数据,然后将这些数据添加回原始数据框中。不知何故,我似乎已经解决了其中最困难的部分(抓取部分),但我在如何自动化任务方面遇到了问题(我怀疑这可能很简单)。

情况如下:我有一个由 12 个变量和 44,000 行组成的 data.frame。这些变量之一,Programme_Synopsis_url 包含 BBC iPlayer 上程序的 URL。

我需要转到该 URL,提取一条数据(频道的详细信息),然后将其添加到名为 Channel 的新列中。

这里是一些示例数据(对于这个示例的大小/复杂性,我深表歉意,但我认为有必要分享这些数据以获得正确的解决方案):

df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L, 
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den", 
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men", 
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L, 
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"), 
    Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L, 
    8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.", 
    "1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing", 
    "4/6 What Happens in Ibiza... Is Megan really a party animal?", 
    "Box Set. Dale plans to propose – but what does Ken think?", 
    "Box Set. For the undead... life begins again", "Box Set. Romesh... and mum", 
    "Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L, 
    9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs", 
    "https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru", 
    "https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud", 
    "https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year", 
    "https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
    ), class = "factor"), Programme_Duration = structure(c(6L, 
    4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins", 
    "28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"), 
    Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L, 
    4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month", 
    "Available for 11 months", "Available for 17 days", "Available for 28 days", 
    "Available for 3 months", "Available for 5 months"), class = "factor"), 
    Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 
    1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
    ), class = "factor"), Programme_Genre = structure(c(4L, 2L, 
    3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary", 
    "Drama", "Entertainment", "New SeriesComedy"), class = "factor"), 
    date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "13/08/2018", class = "factor"), rank = c(1L, 
    2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L, 
    1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

为了让事情变得更加复杂(!),有两种不同类型的 URL。有些指向节目的剧集页面,有些指向主节目页面(在 URL 语法中没有区别,以便区分两者)。这很重要的原因是因为我要抓取的数据(频道名称)位于不同的位置,具体取决于它是剧集的页面还是节目的主页。我已经编写了一个脚本,可以为每种类型的页面获取这些数据:

### Get Channel for programme page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()

### Confirm Channel details ###
print(channel)


### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()

### Confirm Channel details ###
print(channel)

问题是,我如何自动执行此操作,并遍历每个 URL(大约 44,000 个),提取这些数据,然后将其添加到名为 Channel 的新列中?

最后几个问题/警告/问题:

  1. 从 44,000 个 URL 中查找和抓取数据会导致任何技术问题吗?我不想因为这样做而杀死 BBC 的服务器或阻止我的 IP!我检查了他们网站的条款和条件,没有提到我发现的抓取。
  2. 指出虽然我需要检查大约 44,000 行(URL),但其中许多是重复的,这可能会有所帮助。因此,我想知道是否最好先创建一个删除任何重复项的新数据框(例如,基于 Programme_Synopsis_urlTitle 列)。这样做意味着我需要抓取更少数量的 URL,然后可以将这些数据合并回原始数据帧。 IE。如果Title 匹配,则将流线型数据帧的Channel 列中的变量添加到原始数据帧中名为Channel 的列中。
  3. 我想我将不得不使用某种带有 if/else 语句的循环来执行此操作。 IE。如果 URL 包含某个 xpath,则将该数据复制并粘贴到该行的 Channel 列中,否则从另一个 xpath 复制数据并将其输入到该行的 Channel 列中。如果页面不包含任何一个 xpath(这是可能的),则什么也不做。

希望一切都清楚。如有必要,很乐意详细说明。

编辑:更新了上面代码中不正确的 URL 之一。

【问题讨论】:

    标签: r loops dataframe if-statement web-scraping


    【解决方案1】:

    您可以通过以下方法轻松实现:

    1. 为您的刮削部分创建一个函数。
    2. 在此函数中,您尝试第一个 Xpath,如果结果为空,则尝试第二个 Xpath
    3. 您可以使用任何形式的循环对所有 url 重复此任务。 (我使用了purrr::map,但任何循环都可以)
    library(rvest)
    
    get_channel <- function(url) {
       ## some elements do not contain any url
       if (!nchar(url)) return(NA_character_)
       page <- url %>%
        read_html()
       ## try to read channel
       channel <- page %>% 
         html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% 
         html_text()
       ## if it's empty we are most likely on an episode page -> try the other xpath 
       if (!length(channel)) {
        channel <- page %>% 
           html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% 
           html_text()
       }
       ifelse(length(channel), channel, NA_character_)
    }
    
    ## loop through all urls in the df
    
    purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
    # [1] "BBC Two"   "BBC Three" "BBC Three" "BBC Three" NA          NA          "BBC Three" "BBC Three" "BBC Three" "BBC Two" 
    

    关于您的其他问题:

    1. 可能是 BBC 试图阻止您抓取他们的页面。有一些技巧可以解决这个问题,比如在连续请求之间添加延迟。有时网页会寻找用户代理,您需要更改每个 n 请求,以便网站不会阻止您。网站如何尝试保护自己免受网络抓取的方法有多种,这取决于您需要做什么。话虽如此,我不相信 44k 请求甚至接近杀死他们的服务,但我不是这里的专家。
    2. 避免请求重复的 url 绝对是有意义的,这可以通过 [untested] 轻松实现:

      new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
      new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url), 
                                       get_channel)
      dplyr::left_join(df, 
                       new_df[, c("Programme_Synopsis_url", "channel")], 
                       by = "Programme_Synopsis_url")
      

    【讨论】:

    • 感谢您的建议 - 几乎可以完美运行。当我删除重复项时,只有 1500 个唯一的 URL 需要被抓取,所以这应该会使任务变得更容易。但是,当我运行该函数时,出现以下错误:Error in open.connection(x, "rb") : HTTP error 404. 有什么想法吗?
    • 看起来tryCatch 可能是解决方案? stackoverflow.com/questions/38114066/…
    • 错误 404 表示找不到 url cf. Wiki。您可以在 read_html 周围添加一个 tryCatch 以防止出现此错误
    • 谢谢,正如我所怀疑的那样,但不能完全弄清楚在函数中的哪个位置?我对 R 很陌生,以防它不明显!...
    • 最简单的方法是使用purrr::possibly。所以试试purrr::map_chr(as.character(df$Programme_Synopsis_url), purrr::possibly(get_channel, NA_character_))
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-17
    • 1970-01-01
    相关资源
    最近更新 更多