【问题标题】:Error with R Script - Error in open.connection(x, "rb") : HTTP error 404. Called from: open.connection(x, "rb")R 脚本出错 - open.connection(x, "rb") 中的错误:HTTP 错误 404。调用自:open.connection(x, "rb")
【发布时间】:2021-07-24 10:35:46
【问题描述】:

我正在处理此代码,但由于某种原因,即使网站正常运行,我仍然会出现 404 错误。不知道我在哪里犯了错误,但会感谢任何社区建议。我相信我在网站链接的某个地方犯了一个错误,但我不确定要输入什么,我尝试了最低限度的“http://www.ufcstats.com/”,以及 '/fighter -详细信息/'。

library(rvest)
library(dplyr)
library(purrr)

link = "http://www.ufcstats.com/statistics/fighters?char=a&page=all"
page = read_html(link)

name = page %>% html_nodes(".b-link_style_black") %>% html_text()
name_links = page %>% html_nodes(".b-link_style_black") %>%
  html_attr("href") %>% paste("http://www.ufcstats.com/fighter-details/", ., sep="") %>% trimws()

get_Info = function(name_link) {
  fighter_page = read_html(name_link)
  tibble(
    name = fighter_page %>% html_nodes(".b-content__title-highlight") %>% html_text(),
    record = fighter_page %>% html_nodes(".b-content__title-record") %>% html_text(),
    height = fighter_page %>% html_nodes(".b-list__info-box_style_small-width .b-list__box-list-item_type_block:nth-child(1)") %>% html_text(),
    weight = fighter_page %>% html_nodes(".b-list__info-box_style_small-width .b-list__box-list-item_type_block:nth-child(2)") %>% html_text(),
    reach = fighter_page %>% html_nodes(".b-list__info-box_style_small-width .b-list__box-list-item_type_block:nth-child(3)") %>% html_text(),
    stance = fighter_page %>% html_nodes(".b-list__info-box_style_small-width .b-list__box-list-item_type_block:nth-child(4)") %>% html_text(),
    dob = fighter_page %>% html_nodes(".b-list__info-box_style_small-width .b-list__box-list-item_type_block:nth-child(5)") %>% html_text(),
    sig_strikes_per_min= fighter_page %>% html_nodes(".b-list__info-box-left .b-list__info-box-left .b-list__box-list-item_type_block:nth-child(1)") %>% html_text(),
    sig_striking_accuracy = fighter_page %>% html_nodes(".b-list__info-box-left .b-list__info-box-left .b-list__box-list-item_type_block:nth-child(2)") %>% html_text(),
    sig_strikes_abs_per_min = fighter_page %>% html_nodes(".b-list__info-box-left .b-list__info-box-left .b-list__box-list-item_type_block:nth-child(3)") %>% html_text(),
    sig_strike_def = fighter_page %>% html_nodes(".b-list__info-box-left .b-list__info-box-left .b-list__box-list-item_type_block:nth-child(4)") %>% html_text(),
    avg_takedown = fighter_page %>% html_nodes(".b-list__info-box_style-margin-right .b-list__box-list-item_type_block:nth-child(2)") %>% html_text(),
    takedown_accuracy = fighter_page %>% html_nodes(".b-list__info-box_style-margin-right .b-list__box-list-item_type_block:nth-child(3)") %>% html_text(),
    takedown_defense = fighter_page %>% html_nodes(".b-list__info-box_style-margin-right .b-list__box-list-item_type_block:nth-child(4)") %>% html_text(),
    sub_avg = fighter_page %>% html_nodes(".b-list__box-list_margin-top .b-list__box-list-item_type_block:nth-child(5)") %>% html_text(),
    last_fight = fighter_page %>% html_nodes(".b-statistics__table-row+ .js-fight-details-click .b-fight-details__table-col~ .b-fight-details__table-col+ .l-page_align_left .b-fight-details__table-text+ .b-fight-details__table-text") %>% html_text()
  ) -> t
  return(t)
}

df <- map_dfr(name_links, get_Info)

以下是我收到的错误代码:

Browse[1]> Q
> library(rvest)
Warning message:
In for (i in seq_along(a)) if (all(nam[i] != std.attr)) { :
  closing unused connection 6 (http://www.ufcstats.com/fighter-details/http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9)

...

> df <- map_dfr(name_links, get_Info)
Error in open.connection(x, "rb") : HTTP error 404.
Called from: open.connection(x, "rb")

【问题讨论】:

    标签: r web-scraping


    【解决方案1】:

    html_attr("href") 之后返回的 url 是完整的 url,因此您无需在此处添加另一个 paste。尝试以下 -

    library(rvest)
    
    name_links <- page %>% html_nodes(".b-link_style_black") %>% html_attr("href")
    
    df <- purrr::map_dfr(name_links, get_Info)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-10-18
      • 2021-06-19
      • 1970-01-01
      • 2018-05-27
      • 2020-09-22
      • 1970-01-01
      • 2021-10-31
      相关资源
      最近更新 更多