【问题标题】:How to convert wikipedia tables into pandas dataframes? [duplicate]如何将维基百科表格转换为熊猫数据框? [复制]
【发布时间】:2021-05-18 18:57:28
【问题描述】:

我想将一些统计数据应用于直接从特定网页获得的数据表。 本教程https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 帮助我从网页http://pokemondb.net/pokedex/all 上的表格创建了一个数据框。但是,我想对地理数据做同样的事情,例如几个国家的人口和 gdp。

我在 wikipedia 上找到了一些表格,但效果不太好,我不明白为什么。这是我的代码,遵循上述教程:

import requests
import lxml.html as lh
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'


#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
print('Length of first 12 rows')
print ([len(T) for T in tr_elements[:12]])

#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))
    
    
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=10:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

print('Data gathering: done!')
print('Column lentgh:')
print([len(C) for (title,C) in col])

Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

print(df.head())

输出如下:

Length of first 12 rows
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
1:"Ranks
"
2:"Countries(or dependent territory)
"
3:"Officialfigure(whereavailable)
"
4:"Date oflast figure
"
5:"Source
"
Data gathering: done!
Column lentgh:
[0, 0, 0, 0, 0]
Empty DataFrame
Columns: [Ranks
, Countries(or dependent territory)
, Officialfigure(whereavailable)
, Date oflast figure
, Source
]
Index: []

列的长度不应为空。格式与教程中的格式不同。知道如何使它正确吗?或者可能是另一个不返回这种奇怪输出格式的数据源?

【问题讨论】:

  • 长话短说:pd.read_html(url) 为您提供页面上的表格列表,然后您可以将其编入索引

标签: python html python-3.x pandas dataframe


【解决方案1】:

不使用请求,而是使用 pandas 读取 url 数据。

‘df = pd.read_html(url)

【讨论】:

    【解决方案2】:

    行的长度,正如您在第 16 行(对应于输出的第一行)中的 print 语句所示,不是 10。它是 5。并且您的代码在循环中中断第一次迭代,而不是填充您的 col

    更改此声明:

    if len(T)!=10:
        break
    

    if len(T)!=5:
        break
    

    应该解决问题。

    【讨论】:

      【解决方案3】:

      在第 52 行,您正在尝试编辑元组。这在 Python 中是不可能的。

      要更正此问题,请改用列表。

      将第 25 行更改为 col.append([name,[]])

      另外,当使用 break 时,它会中断 for 循环,这会导致它在数组中没有数据。

      在做这些事情时,您还必须查看 html。该表的格式并不像人们希望的那样好。例如,它有一堆新的线条,还有国家国旗的图像。您可以查看this example of North America,了解每次格式的不同。

      您似乎想要一种简单的方法来做到这一点。我会研究 BeautifulSoup4。我添加了一种使用 bs4 执行此操作的方法。您必须进行一些编辑以使其看起来更好

      import requests
      import bs4 as bs
      import pandas as pd
      
      url = 'https://en.wikipedia.org/wiki/List_of_African_countries_by_population'
      
      column_names = []
      data = []
      #Create a handle, page, to handle the contents of the website
      page = requests.get(url)
      #Store the html in the soup object
      soup = bs.BeautifulSoup(page.content, 'html.parser')
      #Gets the table html
      table = soup.find_all('table')[0]
      #gets the table header
      thead = table.find_all('th')
      #Puts the header into the column names list. We will use this for the dict keys later
      for th in thead:
          column_names.append(th.get_text())
      
      #gets all the rows of the table
      rows  = table.find_all('tr')
      #I do not take the first how as it is the header
      for row in rows[1:]:
          #Creates a list with each index being a different entry in the row. 
          values = [r for r in row]
          #Gets each values that we care about
          rank = values[1].get_text()
          country = values[3].get_text()
          pop = values[5].get_text()
          date = values[7].get_text()
          source = values[9].get_text()
          temp_list = [rank,country,pop,date,source]
          #Creates a dictionary with keys being the column names and the values being temp_list. Appends this to list data
          data.append(dict(zip(column_names, temp_list)))
      print(column_names)
      
      df = pd.DataFrame(data)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-12-19
        • 2022-01-12
        • 1970-01-01
        • 2019-10-12
        • 1970-01-01
        • 2017-08-26
        相关资源
        最近更新 更多