【发布时间】:2021-10-17 20:00:49
【问题描述】:
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl='https://books.toscrape.com/'
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://books.toscrape.com/' )
soup=BeautifulSoup(r.content, 'html.parser')
productlinks=[]
Title=[]
Brand=[]
tra = soup.find_all('article',class_='product_pod')
for links in tra:
for link in links.find_all('a',href=True)[1:]:
comp=baseurl+link['href']
productlinks.append(comp)
for link in productlinks:
r =requests.get(link,headers=headers)
soup=BeautifulSoup(r.content, 'html.parser')
try:
title=soup.find('h3').text
except:
title=' '
Title.append(title)
price=soup.find('p',class_="price_color").text.replace('£','').replace(',','').strip()
Brand.append(price)
df = pd.DataFrame(
{"Title": Title, "Price": price}
)
print(df)
上述脚本按预期工作,但我想抓取每个产品的信息,例如upc,product type
example 获取这些单页的信息
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
刮upc,product type等...所有其他信息都在产品信息
【问题讨论】:
-
我没有看到任何尝试在多个页面上操作的代码。到目前为止,您在这方面做了哪些尝试?
-
请修正您的代码。
标签: python web-scraping beautifulsoup html-table html-tableextract