【问题标题】:Scrape table from static web site从静态网站抓取表格
【发布时间】:2021-05-06 15:15:30
【问题描述】:
我需要来自iana.org 的顶级域的抓取表。
我的代码:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.iana.org/domains/root/db'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='tld-table')
我怎样才能将它与网站上的结构(域、类型、TLD MANAGER)一样的 pandas DataFrame。
【问题讨论】:
标签:
python
python-3.x
web-scraping
beautifulsoup
python-requests
【解决方案1】:
Pandas 已经自带了可以读表的东西from html,不用再用 BeautifulSoup:
import pandas as pd
url = "https://www.iana.org/domains/root/db"
# This returns a list of DataFrames with all tables in the page.
df = pd.read_html(url)[0]
【解决方案2】:
你可以使用熊猫pd.read_html
import pandas as pd
URL = "https://www.iana.org/domains/root/db"
df = pd.read_html(URL)[0]
print(df.head())
Domain Type TLD Manager
0 .aaa generic American Automobile Association, Inc.
1 .aarp generic AARP
2 .abarth generic Fiat Chrysler Automobiles N.V.
3 .abb generic ABB Ltd
4 .abbott generic Abbott Laboratories, Inc.