如何将维基百科表格转换为熊猫数据框？ [复制]答案

【问题标题】：How to convert wikipedia tables into pandas dataframes? [duplicate]如何将维基百科表格转换为熊猫数据框？ [复制]
【发布时间】：2021-05-18 18:57:28
【问题描述】：

我想将一些统计数据应用于直接从特定网页获得的数据表。本教程https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 帮助我从网页http://pokemondb.net/pokedex/all 上的表格创建了一个数据框。但是，我想对地理数据做同样的事情，例如几个国家的人口和 gdp。

我在 wikipedia 上找到了一些表格，但效果不太好，我不明白为什么。这是我的代码，遵循上述教程：

import requests
import lxml.html as lh
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'


#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])

#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))
    
    
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

print(df.head())

输出如下：

Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []

列的长度不应为空。格式与教程中的格式不同。知道如何使它正确吗？或者可能是另一个不返回这种奇怪输出格式的数据源？

【问题讨论】：

长话短说：pd.read_html(url) 为您提供页面上的表格列表，然后您可以将其编入索引

标签： python html python-3.x pandas dataframe

【解决方案1】：

不使用请求，而是使用 pandas 读取 url 数据。

‘df = pd.read_html(url)

【讨论】：

【解决方案2】：

行的长度，正如您在第 16 行（对应于输出的第一行）中的 print 语句所示，不是 10。它是 5。并且您的代码在循环中中断第一次迭代，而不是填充您的 col。

更改此声明：

if len(T)!=10:
    break

到

if len(T)!=5:
    break

应该解决问题。

【讨论】：

【解决方案3】：

在第 52 行，您正在尝试编辑元组。这在 Python 中是不可能的。

要更正此问题，请改用列表。

将第 25 行更改为 col.append([name,[]])

另外，当使用 break 时，它会中断 for 循环，这会导致它在数组中没有数据。

在做这些事情时，您还必须查看 html。该表的格式并不像人们希望的那样好。例如，它有一堆新的线条，还有国家国旗的图像。您可以查看this example of North America，了解每次格式的不同。

您似乎想要一种简单的方法来做到这一点。我会研究 BeautifulSoup4。我添加了一种使用 bs4 执行此操作的方法。您必须进行一些编辑以使其看起来更好

import requests
import bs4 as bs
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'

column_names = []
data = []
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the html in the soup object
soup = bs.BeautifulSoup(page.content, 'html.parser')
#Gets the table html
table = soup.find_all('table')[0]
#gets the table header
thead = table.find_all('th')
#Puts the header into the column names list. We will use this for the dict keys later
for th in thead:
    column_names.append(th.get_text())

#gets all the rows of the table
rows  = table.find_all('tr')
#I do not take the first how as it is the header
for row in rows[1:]:
    #Creates a list with each index being a different entry in the row. 
    values = [r for r in row]
    #Gets each values that we care about
    rank = values[1].get_text()
    country = values[3].get_text()
    pop = values[5].get_text()
    date = values[7].get_text()
    source = values[9].get_text()
    temp_list = [rank,country,pop,date,source]
    #Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
    data.append(dict(zip(column_names, temp_list)))
print(column_names)

df = pd.DataFrame(data)

【讨论】：