在 R 中导入维基百科表格答案

【问题标题】：Importing wikipedia tables in R在 R 中导入维基百科表格
【发布时间】：2011-11-16 11:44:16
【问题描述】：

我经常从维基百科中提取表格。 Excel 的 Web 导入不适用于维基百科，因为它将整个页面视为一个表格。在谷歌电子表格中，我可以输入：

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

并且此功能将从该页面下载第三张表，其中列出了密歇根州 UP 的所有县。

R中有类似的东西吗？还是可以通过用户定义的函数创建？

【问题讨论】：

可能重复stackoverflow.com/questions/1395528/…
@DWin 很简单，是的；但重复/可重现？不。不是一个脚本就可以做得很好吗？
@Ramnath 我没有看到那个线程，但是那个线程中提供的解决方案确实有效：readHTMLTable(theurl) 和表 [3]。谢谢分享。将不得不弄清楚如何将结果转换为适当的框架

标签： r dataframe

【解决方案1】：

一种简单的方法是使用RGoogleDocs 接口让Google Docs 为您进行转换：

http://www.omegahat.org/RGoogleDocs/run.html

然后，您可以使用 =ImportHtml Google Docs 功能及其所有预建的魔力。

【讨论】：

【解决方案2】：

XML 包中的函数 readHTMLTable 非常适合。

尝试以下方法：

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable 为 HTML 页面的每个元素返回一个 data.frames 列表。您可以使用names 获取有关每个元素的信息：

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

【讨论】：

我尝试了代码readHTMLTable(doc = "https://en.wikipedia.org/wiki/Gross_domestic_product") 并得到了XML content does not seem to be XML: 我猜https 可能是问题所在，如何解决？
此解决方案在 Wikipedia 移至安全连接后不再有效。任何线索如何让它工作？
请参阅schnee 对此问题的回答，其中涉及 https

【解决方案3】：

这是一个适用于安全 (https) 链接的解决方案：

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

【讨论】：

【解决方案4】：

以 Andrie 的回答为基础，解决 SSL。如果您可以使用一个额外的库依赖项：

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]

【讨论】：

【解决方案5】：

使用rvest 的tidyverse 解决方案。如果您需要根据某些关键字（例如在表格标题中）查找表格，这将非常有用。这是一个示例，我们想要获取埃及的重要统计数据表。注意：html_nodes(x = page, css = "table") 是浏览页面上可用表格的有用方式。

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view

【讨论】：

【解决方案6】：

该表是唯一的第二个 td 子表的子表，因此您可以使用 css 指定该模式。您可以使用更快的类，而不是使用表的类型选择器来获取子表：

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)

【讨论】：