【问题标题】:R Script coming back with a few errors when scraping basic page抓取基本页面时,R 脚本返回一些错误
【发布时间】:2021-07-22 11:05:44
【问题描述】:

这是我的脚本:

library(rvest)
library(dplyr)


link = "http://www.mmadecisions.com/decisions-by-judge/"
page = read_html(link)

name = page %>% html_nodes("#page1 a") %>% html_text()
name_links = page %>% html_nodes("#page1 a") %>%
  html_attr("href") %>% paste("http://www.mmadecisions.com/", ., sep="")

get_decisions = function(name_link) {
  judge_page = read_html(name_link)
  date = judge_page %>% html_nodes(".list:nth-child(1)") %>% html_text()
  event = judge_page %>% html_nodes(".list:nth-child(2) a") %>% html_text()
  fight = judge_page %>% html_nodes(".list~ .list+ .list a") %>% html_text()
  decisions = judge_page %>% html_nodes(".list:nth-child(4)") %>% html_text()
  return(judge_page)
}

decision = sapply(name_links, FUN = get_decisions)

judges = data.frame(name, date, event, fight, decision, stringsAsFactors = FALSE)

我不断遇到的错误如下:

> library(rvest)
> library(dplyr)
> 
> 
> link = "http://www.mmadecisions.com/decisions-by-judge/"
> page = read_html(link)
> 
> name = page %>% html_nodes("#page1 a") %>% html_text()
> name_links = page %>% html_nodes("#page1 a") %>%
+   html_attr("href") %>% paste("http://www.mmadecisions.com/", ., sep="")
> 
> get_decisions = function(name_link) {
+   judge_page = read_html(name_link)
+   date = judge_page %>% html_nodes(".list:nth-child(1)") %>% html_text()
+   event = judge_page %>% html_nodes(".list:nth-child(2) a") %>% html_text()
+   fight = judge_page %>% html_nodes(".list~ .list+ .list a") %>% html_text()
+   decisions = judge_page %>% html_nodes(".list:nth-child(4)") %>% html_text()
+   return(judge_page)
+ }
> 
> decisions = sapply(name_links, FUN =  get_decisions)
Error in open.connection(x, "rb") : HTTP error 400.
Called from: open.connection(x, "rb")
Browse[1]> 
> judges = data.frame(name, date, event, fight, score, stringsAsFactors = FALSE)
Error in data.frame(name, date, event, fight, score, stringsAsFactors = FALSE) : 
  object 'score' not found

我的目标是从父页面导航到多个子页面,抓取四列数据“判断决策”,然后打印成列。我感谢任何人对此提出的任何见解。

【问题讨论】:

    标签: html r web-scraping rvest


    【解决方案1】:

    不完全确定为什么您有两个结束变量。如果您想要这 4 列的最终 df,则可以从每一页使用 purrr::map_dfr,并确保从函数返回 tibble。对于连接问题,您需要修剪您的空白网址。

    library(rvest)
    library(dplyr)
    library(purrr)
    
    link = "http://www.mmadecisions.com/decisions-by-judge/"
    page = read_html(link)
    
    name = page %>% html_nodes("#page1 a") %>% html_text()
    name_links = page %>% html_nodes("#page1 a") %>%
      html_attr("href") %>% paste("http://www.mmadecisions.com/", ., sep="") %>% trimws()
    
    get_decisions = function(name_link) {
      judge_page = read_html(name_link)
      tibble(
        date = judge_page %>% html_nodes(".list:nth-child(1)") %>% html_text(),
        event = judge_page %>% html_nodes(".list:nth-child(2) a") %>% html_text(),
        fight = judge_page %>% html_nodes(".list ~ .list + .list a") %>% html_text(),
        decisions = judge_page %>% html_nodes(".list:nth-child(4)") %>% html_text()
      ) -> t
      return(t)
    }
    
    df <- map_dfr(name_links, get_decisions)
    

    【讨论】:

    • 非常感谢,我实际上只是在学习 R 并遵循教程,试图将我的值替换为教程中的值。您的代码没有显示错误,但我不确定如何使用 purr::map_dfr 来显示表格。在我的版本中,我只需键入 view(name),它就会显示给我 - 你有没有推荐的教程来学习这个功能?再次感谢!
    • View(df) 将向您展示 tablemap_dfr 基本上应用了返回 tibbles 并将所有这些 tibbles 映射到单个 DataFrame 的函数。我在这里读到了:purrr.tidyverse.org/reference/map.html
    猜你喜欢
    • 1970-01-01
    • 2018-11-11
    • 2018-03-21
    • 2021-10-25
    • 1970-01-01
    • 1970-01-01
    • 2019-10-02
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多