【问题标题】:Scrape entire table from wikipedia using beautifulsoup and then load into pandas使用 beautifulsoup 从 wikipedia 中抓取整个表格,然后加载到 pandas
【发布时间】:2020-04-10 12:48:15
【问题描述】:

我目前正在抓取以下 wiki 页面:https://en.wikipedia.org/wiki/Cargo_aircraft,只有一个表格从比较开始。我正在尝试抓取整个表格并将其输出到熊猫。我知道如何添加初始列,飞机,但是从体积开始刮掉这些列时遇到了麻烦。

如何将表格的所有行添加到数据框或列中?不确定哪种方法更好。



from bs4 import BeautifulSoup
import requests
import pandas as pd

#this will use request library to call wikipedia

page = requests.get('https://en.wikipedia.org/wiki/Cargo_aircraft')

#create beautifulsoup object

soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', attrs={'class':'wikitable sortable'})
tabledata = table.findAll('tbody')
links = table.findAll('a')




aircraft = []
for link in links:
    aircraft.append(link.get('title'))
print(aircraft)


#pull table from Wikipedia

df = pd.DataFrame()
df['Aircraft'] = aircraft
df['Test'] = 'test'

【问题讨论】:

    标签: python pandas dataframe html-table beautifulsoup


    【解决方案1】:

    使用pandas.read_html

    • 绕过beautifulsoup,直接将表格读入pandas。
    • 将 HTML 表格读入 list 的 DataFrame 对象
      • 在这种情况下,表位于索引[1]
    import pandas as pd
    
    df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
    
    # df view
    
                       Aircraft    Volume                  Payload             Cruise                  Range       Usage
    0              Airbus A400M    270 m³    37,000 kg (82,000 lb)  780 km/h (420 kn)   6,390 km (3,450 nmi)    Military
    1          Airbus A300-600F  391.4 m³   48,000 kg (106,000 lb)                  –   7,400 km (4,000 nmi)  Commercial
    2          Airbus A330-200F    475 m³   70,000 kg (154,000 lb)  871 km/h (470 kn)   7,400 km (4,000 nmi)  Commercial
    3             Airbus Beluga   1210 m³   47,000 kg (104,000 lb)                  –   4,632 km (2,500 nmi)  Commercial
    4          Airbus Beluga XL   2615 m³   53,000 kg (117,000 lb)                  –   4,074 km (2,200 nmi)  Commercial
    5            Antonov An-124   1028 m³  150,000 kg (331,000 lb)  800 km/h (430 kn)   5,400 km (2,900 nmi)        Both
    6            Antonov An-225   1300 m³  250,000 kg (551,000 lb)  800 km/h (430 kn)  15,400 km (8,316 nmi)  Commercial
    7               Boeing C-17         –   77,519 kg (170,900 lb)  830 km/h (450 kn)   4,482 km (2,420 nmi)    Military
    8           Boeing 737-700C  107.6 m³    18,200 kg (40,000 lb)  931 km/h (503 kn)   5,330 km (2,880 nmi)  Commercial
    9           Boeing 757-200F    239 m³    39,780 kg (87,700 lb)  955 km/h (516 kn)   5,834 km (3,150 nmi)  Commercial
    10            Boeing 747-8F  854.5 m³  134,200 kg (295,900 lb)  908 km/h (490 kn)   8,288 km (4,475 nmi)  Commercial
    11           Boeing 747 LCF   1840 m³   83,325 kg (183,700 lb)  878 km/h (474 kn)   7,800 km (4,200 nmi)  Commercial
    12          Boeing 767-300F  438.2 m³   52,700 kg (116,200 lb)  850 km/h (461 kn)   6,025 km (3,225 nmi)  Commercial
    13              Boeing 777F    653 m³  103,000 kg (227,000 lb)  896 km/h (484 kn)   9,070 km (4,900 nmi)  Commercial
    14    Bombardier Dash 8-100     39 m³     4,700 kg (10,400 lb)  491 km/h (265 kn)   2,039 km (1,100 nmi)  Commercial
    15             Lockheed C-5         –  122,470 kg (270,000 lb)           919 km/h   4,440 km (2,400 nmi)    Military
    16           Lockheed C-130         –    20,400 kg (45,000 lb)  540 km/h (292 kn)   3,800 km (2,050 nmi)    Military
    17         Douglas DC-10-30         –   77,000 kg (170,000 lb)  908 km/h (490 kn)   5,790 km (3,127 nmi)  Commercial
    18  McDonnell Douglas MD-11    440 m³   91,670 kg (202,100 lb)  945 km/h (520 kn)   7,320 km (3,950 nmi)  Commercial
    

    【讨论】:

      【解决方案2】:

      你可以试试:

      df = pd.read_html('https://en.wikipedia.org/wiki/Cargo_aircraft')[1]
      df['Volume'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Volume'].str.split()]).astype(float)
      df['Payload'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Payload'].str.split()]).astype(int)
      df['Cruise'] = pd.Series([x[0] if x[0] != '–' else None for x in df['Cruise'].str.split()]).astype(float)
      df['Range'] = pd.Series([x[0].replace(',', '') if x[0] != '–' else None for x in df['Range'].str.split()]).astype(int)
      

      结果:

      df.info()

      <class 'pandas.core.frame.DataFrame'>
      RangeIndex: 19 entries, 0 to 18
      Data columns (total 6 columns):
      Aircraft    19 non-null object
      Volume      15 non-null float64
      Payload     19 non-null int64
      Cruise      16 non-null float64
      Range       19 non-null int64
      Usage       19 non-null object
      dtypes: float64(2), int64(2), object(2)
      memory usage: 1.0+ KB
      

      print(df)

                         Aircraft  Volume  Payload  Cruise  Range       Usage
      0              Airbus A400M   270.0    37000   780.0   6390    Military
      1          Airbus A300-600F   391.4    48000     NaN   7400  Commercial
      2          Airbus A330-200F   475.0    70000   871.0   7400  Commercial
      3             Airbus Beluga  1210.0    47000     NaN   4632  Commercial
      4          Airbus Beluga XL  2615.0    53000     NaN   4074  Commercial
      5            Antonov An-124  1028.0   150000   800.0   5400        Both
      6            Antonov An-225  1300.0   250000   800.0  15400  Commercial
      7               Boeing C-17     NaN    77519   830.0   4482    Military
      8           Boeing 737-700C   107.6    18200   931.0   5330  Commercial
      9           Boeing 757-200F   239.0    39780   955.0   5834  Commercial
      10            Boeing 747-8F   854.5   134200   908.0   8288  Commercial
      11           Boeing 747 LCF  1840.0    83325   878.0   7800  Commercial
      12          Boeing 767-300F   438.2    52700   850.0   6025  Commercial
      13              Boeing 777F   653.0   103000   896.0   9070  Commercial
      14    Bombardier Dash 8-100    39.0     4700   491.0   2039  Commercial
      15             Lockheed C-5     NaN   122470   919.0   4440    Military
      16           Lockheed C-130     NaN    20400   540.0   3800    Military
      17         Douglas DC-10-30     NaN    77000   908.0   5790  Commercial
      18  McDonnell Douglas MD-11   440.0    91670   945.0   7320  Commercial
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-01-26
        • 1970-01-01
        • 1970-01-01
        • 2015-11-27
        • 2016-08-01
        • 2017-07-07
        • 2011-03-11
        相关资源
        最近更新 更多