【发布时间】:2018-12-03 11:13:29
【问题描述】:
我让这个代码在第一页上工作,并且需要用户代理,否则它不起作用。
我遇到的问题是搜索带来了第一页,但在第二个页面上你有“page=2”并继续,所以需要从搜索中抓取全部或尽可能多的内容
“https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4”
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
输出是 Pandas DataFrame
需要抓取所有页面以输出大数据框
【问题讨论】:
标签: python pandas web-scraping beautifulsoup urllib