【发布时间】:2021-02-25 22:57:57
【问题描述】:
所以我创建了这个脚本来完成我想要在每个单独的页面上执行的操作:
library(rvest)
library(dplyr)
library(stringr)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2022-Football/Targets/")
page <- read_html(targets_url)
jsons <- page %>% html_nodes(xpath = '//*[@type ="application/ld+json"]')
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
category <- list[headers] %>% html_node("b.name") %>% html_text()
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)
answer2$target_category <- answer2$`rep(category, nrepeats)`
target_df <- answer2 %>% select(target_category, name, jobTitle)
但正如您所见,那里有一个硬编码的 URL ohio-state
如果我想通过多次迭代来自动执行此操作怎么办?假设脚本的前几行是:
teams <- c("ohio-state","penn-state","michigan","michigan-state")
所以我的最终结果是包含这四个 URL 的结果的聚合数据框?另外,我想根据teams 列表在target_df 上添加第四列,所以它看起来像这样:
target_df <- answer2 %>% select(target_category, name, jobTitle) %>% mutate(team = teams[1])
显然它不会在较大的脚本中保留teams[1],而只是给出我想要的第四列。
【问题讨论】:
标签: r dplyr tidyverse rvest stringr