【发布时间】:2020-06-22 11:03:12
【问题描述】:
【问题讨论】:
-
如果你分享你的代码而不是图片会更好。
标签: python-3.x web-scraping html-table html-tableextract
【问题讨论】:
标签: python-3.x web-scraping html-table html-tableextract
您想使用 pandas.read_html() 将 HTML 表格收集到 pandas 数据帧中。
用法:
>>> import pandas as pd
>>> pd.read_html(io)
来自文档:
io : str 或类似文件
URL、类文件对象或包含 HTML 的原始字符串。
例如,请参阅此 Wikipedia URL:
In [0]: url = "https://en.wikipedia.org/wiki/List_of_Olympic_records_in_swimming"
In [1]: pd.read_html(url)[0]
Out[1]:
Event Time Name Nation Games Date Ref
0 50 m freestyle 21.30 César Cielo Brazil (BRA) 2008 Beijing 16 August 2008 [4]
1 100 m freestyle 47.05 Eamon Sullivan Australia (AUS) 2008 Beijing 13 August 2008 [5][A]
2 200 m freestyle 1:42.96 Michael Phelps United States (USA) 2008 Beijing 12 August 2008 [6]
3 400 m freestyle 3:40.14 Sun Yang China (CHN) 2012 London 28 July 2012 [7]
4 1500 m freestyle ♦14:31.02 Sun Yang China (CHN) 2012 London 4 August 2012 [8]
5 100 m backstroke ♦51.85 Ryan Murphy United States (USA) 2016 Rio de Janeiro 13 August 2016 [9][B]
6 200 m backstroke 1:53.41 Tyler Clary United States (USA) 2012 London 2 August 2012 [10]
7 100 m breaststroke 57.13 Adam Peaty Great Britain (GBR) 2016 Rio de Janeiro 7 August 2016 [11]
8 200 m breaststroke 2:07.22 Ippei Watanabe Japan (JPN) 2016 Rio de Janeiro 9 August 2016 [12][C]
9 100 m butterfly 50.39 Joseph Schooling Singapore (SGP) 2016 Rio de Janeiro 12 August 2016 [13]
10 200 m butterfly 1:52.03 Michael Phelps United States (USA) 2008 Beijing 13 August 2008 [14]
11 200 m individual medley 1:54.23 Michael Phelps United States (USA) 2008 Beijing 15 August 2008 [15]
12 400 m individual medley ♦4:03.84 Michael Phelps United States (USA) 2008 Beijing 10 August 2008 [16]
13 4×100 m freestyle relay ♦3:08.24 Michael Phelps (47.51)Garrett Weber-Gale (47.0... United States (USA) 2008 Beijing 11 August 2008 [17]
14 4×200 m freestyle relay 6:58.56 Michael Phelps (1:43.31)Ryan Lochte (1:44.28)R... United States (USA) 2008 Beijing 13 August 2008 [18]
15 4×100 m medley relay 3:27.95 Ryan Murphy (51.85)Cody Miller (59.03) Michael... United States (USA) 2016 Rio de Jainero 13 August 2016 [19]
【讨论】:
requests 库(向其传递必要的凭据)预先收集其原始 HTML,然后使用 read_html 从原始 HTML 中解析表格. (而不是来自 URL)。