从 R 中的维基百科中提取特定表答案

【问题标题】：extract a specific table from wikipedia in R从 R 中的维基百科中提取特定表
【发布时间】：2021-03-14 16:25:10
【问题描述】：

我想从维基百科页面https://en.wikipedia.org/wiki/..中提取第20个表。

我现在使用这段代码，但它只提取第一个标题表。

the_url <- "https://en.wikipedia.org/wiki/..."
tb <- the_url %>% read_html() %>% 
  html_node("table") %>% 
  html_table(fill = TRUE)

我应该怎么做才能得到具体的？谢谢！！

【问题讨论】：

尝试html_node(xpath = "//table[20]") 而不是普通的html_node("table")。截至目前，它是该 html 中的第 20 个表。请注意，它的位置将来可能会发生变化。

标签： r

【解决方案1】：

您可以根据与id为prize_money的元素的关系锚定，而不是索引表位置可以移动的位置。只返回一个节点以提高效率。避免使用较长的 xpath，因为它们可能很脆弱。

library(rvest)

table <- read_html('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money') %>% 
  html_node(xpath = "//*[@id='Prize_money']/parent::h4/following-sibling::table[1]") %>% 
  html_table(fill = T)

【讨论】：

【解决方案2】：

由于您有一个特定的表要抓取，您可以使用网页元素的 xpath 在 html_node() 调用中识别：

library(dplyr)
library(rvest)

the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"

the_url %>%
  read_html() %>% 
  html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>% 
  html_table(fill=TRUE)

【讨论】：

【解决方案3】：

试试这个代码。

library(rvest)

webpage <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup")

tbls <- html_nodes(webpage, "table")

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[3:4] %>%
  html_table(fill = TRUE)

str(tbls_ls)

【讨论】：