【问题标题】:Is there a way to fix HTTP error 403 when webscraping ESPN's NBA data?网络抓取 ESPN 的 NBA 数据时,有没有办法修复 HTTP 错误 403?
【发布时间】:2021-12-15 04:33:00
【问题描述】:

我正在尝试从this 网站抓取数据,但针对每支 NBA 球队。

但是,当我运行以下代码时,我不断收到 HTTP 错误 403,具体来说,

"open.connection(x, "rb") 中的错误:HTTP 错误 403。

“我不知道如何解决这个问题,因为我看到其他项目使用相同的确切代码毫无问题地抓取同一个确切的网站。

library(rvest)
library(lubridate)
library(tidyverse)
library(stringr)
library(zoo)
library(h2o)
library(lubridate)



teams<-c("tor", "mil", "den", "gs", "ind", "phi", "okc", "por", "bos", "hou", "lac", "sa",
         "lal", "utah", "mia", "sac", "min", "bkn", "dal", "no", "cha", "mem", "det", "orl",
         "wsh", "atl", "phx", "ny", "chi", "cle")

teams_fullname<-c("Toronto", "Milwaukee", "Denver", "Golden State", "Indiana", "Philadelphia", "Oklahoma City","Portland",
                  "Boston", "Houston", "LA", "San Antonio", "Los Angeles", "Utah", "Miami", "Sacramento", "Minnesota", "Brooklyn",
                  "Dallas", "New Orleans", "Charlotte", "Memphis", "Detroit", "Orlando", "Washington", "Atlanta", "Phoenix",
                  "New York", "Chicago", "Cleveland")

by_team<-{}
for (i in 1:length(teams)) {
  url<-paste0("http://www.espn.com/nba/team/schedule/_/name/", teams[i])
  #print(url)
  webpage <- read_html(url)
  team_table <- html_nodes(webpage, 'table')
  team_c <- html_table(team_table, fill=TRUE, header = TRUE)[[1]]
  team_c<-team_c[1:which(team_c$RESULT=="TIME")-1,]
  team_c$URLTeam<-toupper(teams[i])
  team_c$FullURLTeam<-(teams_fullname[i])
  by_team<-rbind(by_team, team_c)
}

# remove the postponed games
by_team<-by_team%>%filter(RESULT!='Postponed')

我只是想知道为什么会发生这种情况和/或如何解决此错误。任何帮助表示赞赏。

【问题讨论】:

    标签: r web-scraping http-status-code-403


    【解决方案1】:

    越来越少的网站允许直接 rvest::read_html(url)。
    首先使用 httr::GET(url) 或 httr::RETRY('GET', url)。 (对于新管道,R>=4.1)

    webpage <- url |>
      httr::GET() |>
      rvest::read_html()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-05-07
      • 1970-01-01
      • 2016-03-11
      • 1970-01-01
      • 2023-01-07
      相关资源
      最近更新 更多