【发布时间】:2021-05-18 18:57:28
【问题描述】:
我想将一些统计数据应用于直接从特定网页获得的数据表。 本教程https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 帮助我从网页http://pokemondb.net/pokedex/all 上的表格创建了一个数据框。但是,我想对地理数据做同样的事情,例如几个国家的人口和 gdp。
我在 wikipedia 上找到了一些表格,但效果不太好,我不明白为什么。这是我的代码,遵循上述教程:
import requests
import lxml.html as lh
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])
#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
输出如下:
Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []
列的长度不应为空。格式与教程中的格式不同。知道如何使它正确吗?或者可能是另一个不返回这种奇怪输出格式的数据源?
【问题讨论】:
-
长话短说:
pd.read_html(url)为您提供页面上的表格列表,然后您可以将其编入索引
标签: python html python-3.x pandas dataframe