从 rvest 中的抓取数据创建循环/函数答案

【问题标题】：Creating a loop/function from scraped data in rvest从 rvest 中的抓取数据创建循环/函数
【发布时间】：2021-02-25 22:57:57
【问题描述】：

所以我创建了这个脚本来完成我想要在每个单独的页面上执行的操作：

library(rvest)
library(dplyr)
library(stringr)

targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2022-Football/Targets/")
page <- read_html(targets_url)

jsons <- page %>%   html_nodes(xpath = '//*[@type ="application/ld+json"]') 
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))

list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
category <- list[headers] %>% html_node("b.name") %>% html_text()
nrepeats<-as.integer(str_extract(category, "[0-9]+"))

answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)
answer2$target_category <- answer2$`rep(category, nrepeats)`

target_df <- answer2 %>% select(target_category, name, jobTitle)

但正如您所见，那里有一个硬编码的 URL ohio-state

如果我想通过多次迭代来自动执行此操作怎么办？假设脚本的前几行是：

teams <- c("ohio-state","penn-state","michigan","michigan-state")

所以我的最终结果是包含这四个 URL 的结果的聚合数据框？另外，我想根据teams 列表在target_df 上添加第四列，所以它看起来像这样：

target_df <- answer2 %>% select(target_category, name, jobTitle) %>% mutate(team = teams[1])

显然它不会在较大的脚本中保留teams[1]，而只是给出我想要的第四列。

【问题讨论】：

标签： r dplyr tidyverse rvest stringr

【解决方案1】：

我倾向于为此使用 for 循环。有人会配合其他解决方案。

teams <- c("ohio-state","penn-state","michigan","michigan-state")
for (team in teams) {
targets_url <- paste0("https://247sports.com/college/",team,"/Season/2022-Football/Targets/")
page <- read_html(targets_url)

jsons <- page %>%   html_nodes(xpath = '//*[@type ="application/ld+json"]') 
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))

list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
category <- list[headers] %>% html_node("b.name") %>% html_text()
nrepeats<-as.integer(str_extract(category, "[0-9]+"))

answer2 <- cbind(rep(category, nrepeats), allplayers$athlete, team)
answer2$target_category <- answer2$`rep(category, nrepeats)`

target_df <- answer2 %>% select(target_category, name, jobTitle)
}

我觉得……

【讨论】：

谢谢，这行得通，但有一个问题：如果我想放入最后的teams 列怎么办。我尝试做target_df <- answer2 %>% select(target_category, name, jobTitle) %>% mutate(team = teams[1:T])，但它只填充了列表中的第一项（ohio-state）
哦，还有....现在我检查了一下，看起来它只返回了一个数据集。最终计数应为 930。
是的，看起来它只返回列表中最后一项的值
啊。这是在我的手机上编码而不是测试它的问题。
不用担心，好好努力。如果您想尝试修复它，我们将不胜感激。

【解决方案2】：

将要从每个页面中提取的代码放入一个函数中。

library(tidyverse)
library(rvest)

get_data <- function(url) {
  page <- read_html(url)
  
  jsons <- page %>%   html_nodes(xpath = '//*[@type ="application/ld+json"]') 
  allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
  
  list <- page %>% html_nodes("li.ri-page__list-item")
  headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
  category <- list[headers] %>% html_node("b.name") %>% html_text()
  nrepeats<-as.integer(str_extract(category, "[0-9]+"))
  
  answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)
  answer2$target_category <- answer2$`rep(category, nrepeats)`
  
  answer2 %>% select(target_category, name, jobTitle)
}

创建链接以从中提取数据

teams <- c("ohio-state","penn-state","michigan","michigan-state") 
urls <- sprintf('https://247sports.com/college/%s/Season/2022-Football/Targets/', teams)

使用map_df 从每个链接获取数据并将它们组合到一个数据帧中，其中包含一个team 列，该列将标识数据来自哪个团队。

map_df(urls, get_data, .id = 'team') %>%  
    mutate(team = teams[as.integer(team)]) -> result
result

【讨论】：

这似乎只是将列表中的对象ohio-state 乘以四。最终金额应为 930。
抱歉，函数中有错字。它应该是page <- read_html(url)。我已经更新了答案。你现在可以检查吗？
成功了。谢谢你，国王。
其中一个解决方案（函数或循环）将是更好的解决方案。两个决定因素是速度和稳定性。稳定性将是我更关心的问题——网络抓取总是有 404 或网络中断的风险。我不知道哪个是最好的，它可能因用例而异。
@LeTigris 很高兴得到帮助！随意点击左侧的复选标记接受答案。每个帖子只能接受一个答案。参考 - stackoverflow.com/help/someone-answers

【解决方案3】：

teams <- c("ohio-state","penn-state","michigan","michigan-state")
for (team in teams) {
targets_url <- paste0("https://247sports.com/college/",team,"/Season/2022-Football/Targets/")
page <- read_html(targets_url)

jsons <- page %>%   html_nodes(xpath = '//*[@type ="application/ld+json"]') 
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))

list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
category <- list[headers] %>% html_node("b.name") %>% html_text()
nrepeats<-as.integer(str_extract(category, "[0-9]+"))

answer2 <- cbind(rep(category, nrepeats), allplayers$athlete, team)
answer2$target_category <- answer2$`rep(category, nrepeats)`

if (exists("target_df")) {
target <- answer2 %>% select(target_category, name, jobTitle)

target_df <- rbind( target_df, target)

} else {

target_df <- answer2 %>% select(target_category, name, jobTitle)
}
}

这个有用吗？现在没有在 answer2

【讨论】：

如果这是您修改后的代码，请删除您的原始答案或编辑您的其他答案。从同一个人那里得到两个非常相似的答案是非常令人困惑的。
没有。 Error in exists(target_df) : invalid first argument
Dave2e - 如果答案是简单地提供完整的现成答案，我同意。但实际上，虽然正在进行对话，但我认为这有助于发布者理解变化并学习编码，而不仅仅是复制代码。
抱歉 target_df 需要用引号引起来。这总是让我抓狂。