R中的readHTMLTable仅从篮球参考页面中带回前两个表答案

【问题标题】：readHTMLTable in R only bringing back first two tables from basketball-reference pageR中的readHTMLTable仅从篮球参考页面中带回前两个表
【发布时间】：2017-01-03 00:42:47
【问题描述】：

我正在尝试从篮球参考.com 上抓取球队统计网页，但是当我使用 readHTML 时，它只带回了前两个表格。

我的 R 代码如下所示：

url = "http://www.basketball-reference.com/leagues/NBA_2015.html"
teamPageTables = readHTMLTable(url)

这会返回一个只有 2 个的列表。页面上的前两个表。我希望有一个包含页面中所有表格的列表。

我也尝试过将 rvest 与我想要的表的 XPath（杂项统计表）一起使用，但也没有运气。

BBR 是否更改了某些内容以阻止抓取。我什至看过其他关于抓取团队网站的帖子，指出他想要的表格位于索引 16...我复制了他的代码，但仍然没有。

任何帮助将不胜感激。谢谢，

【问题讨论】：

标签： r web-scraping

【解决方案1】：

由于其他表在 cmets 中，readHTMLTable() 不会捕获它。但是，考虑使用readLines 读取URL 文本，然后删除注释标签，从那里相应地解析文档。原来页面上有85张桌子！下面摘录了 10 个可立即在屏幕上查看的表格：

library(XML)

# READ URL TEXT
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
urltxt <- readLines(url)
# REMOVE COMMENT TAGS
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))

# PARSE UNCOMMENTED TEXT
doc <- htmlParse(urltxt)

# RETRIEVE ALL <table> TAGS
tables <- xpathApply(doc, "//table")

# LIST OF DATAFRAMES
teamPageTables <- lapply(tables[c(1:2,19:26)], function(i) readHTMLTable(i))

【讨论】：

【解决方案2】：

仅此网页有两个有效的 html 表。其他表在页面内作为 html cmets，可能由一些 javascript 解析。您也许可以尝试解析这些 cmets。

下面显示的代码找到两个有效的表并将原始 html 写入文件。在文本编辑器中打开 bb.html 并注意其中有许多表

library(rvest)
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
page <- read_html(url)

# there are two valid tables - get them with css id's
team_stats_per_game <- html_node(page, "#team-stats-per_game")
divs_standings_E <- html_nodes(page, "#divs_standings_E")

# look at the actual page text - open bb.html in a text editor
text <- readLines(url)
writeLines(text, "bb.html")

注释表的样子

<div class="placeholder"></div>
<!--  
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_misc_stats">
  <table class="sortable stats_table" id="misc_stats" data-cols-to-freeze=2><caption>Miscellaneous Stats Table</caption>
etc.
-->

【讨论】：