【发布时间】:2020-01-07 05:25:11
【问题描述】:
我有一个有点复杂的任务,需要查找包含在数据框中的一系列 URL,从每个 URL 中抓取一些数据,然后将这些数据添加回原始数据框中。不知何故,我似乎已经解决了其中最困难的部分(抓取部分),但我在如何自动化任务方面遇到了问题(我怀疑这可能很简单)。
情况如下:我有一个由 12 个变量和 44,000 行组成的 data.frame。这些变量之一,Programme_Synopsis_url 包含 BBC iPlayer 上程序的 URL。
我需要转到该 URL,提取一条数据(频道的详细信息),然后将其添加到名为 Channel 的新列中。
这里是一些示例数据(对于这个示例的大小/复杂性,我深表歉意,但我认为有必要分享这些数据以获得正确的解决方案):
df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L,
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den",
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men",
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L,
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"),
Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L,
8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.",
"1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing",
"4/6 What Happens in Ibiza... Is Megan really a party animal?",
"Box Set. Dale plans to propose – but what does Ken think?",
"Box Set. For the undead... life begins again", "Box Set. Romesh... and mum",
"Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L,
9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs",
"https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru",
"https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1",
"https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud",
"https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1",
"https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1",
"https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year",
"https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
), class = "factor"), Programme_Duration = structure(c(6L,
4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins",
"28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"),
Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L,
4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month",
"Available for 11 months", "Available for 17 days", "Available for 28 days",
"Available for 3 months", "Available for 5 months"), class = "factor"),
Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L,
1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
), class = "factor"), Programme_Genre = structure(c(4L, 2L,
3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary",
"Drama", "Entertainment", "New SeriesComedy"), class = "factor"),
date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = "13/08/2018", class = "factor"), rank = c(1L,
2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L,
1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L,
1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
为了让事情变得更加复杂(!),有两种不同类型的 URL。有些指向节目的剧集页面,有些指向主节目页面(在 URL 语法中没有区别,以便区分两者)。这很重要的原因是因为我要抓取的数据(频道名称)位于不同的位置,具体取决于它是剧集的页面还是节目的主页。我已经编写了一个脚本,可以为每种类型的页面获取这些数据:
### Get Channel for programme page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1'
### Then, locate details of Channel via xpath ###
channel <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()
### Confirm Channel details ###
print(channel)
### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()
### Confirm Channel details ###
print(channel)
问题是,我如何自动执行此操作,并遍历每个 URL(大约 44,000 个),提取这些数据,然后将其添加到名为 Channel 的新列中?
最后几个问题/警告/问题:
- 从 44,000 个 URL 中查找和抓取数据会导致任何技术问题吗?我不想因为这样做而杀死 BBC 的服务器或阻止我的 IP!我检查了他们网站的条款和条件,没有提到我发现的抓取。
- 指出虽然我需要检查大约 44,000 行(URL),但其中许多是重复的,这可能会有所帮助。因此,我想知道是否最好先创建一个删除任何重复项的新数据框(例如,基于
Programme_Synopsis_url或Title列)。这样做意味着我需要抓取更少数量的 URL,然后可以将这些数据合并回原始数据帧。 IE。如果Title匹配,则将流线型数据帧的Channel列中的变量添加到原始数据帧中名为Channel的列中。 - 我想我将不得不使用某种带有 if/else 语句的循环来执行此操作。 IE。如果 URL 包含某个 xpath,则将该数据复制并粘贴到该行的
Channel列中,否则从另一个 xpath 复制数据并将其输入到该行的Channel列中。如果页面不包含任何一个 xpath(这是可能的),则什么也不做。
希望一切都清楚。如有必要,很乐意详细说明。
编辑:更新了上面代码中不正确的 URL 之一。
【问题讨论】:
标签: r loops dataframe if-statement web-scraping