【发布时间】:2021-03-02 11:06:53
【问题描述】:
在中断了 4 年之后才重新开始使用 python。我想用 Beautiful soup 库练习一些网页抓取。由于我不得不同时学习基本的 html/css,这让我有点痛苦,但最终我想起了编码的快感(和挫败感)。
无论如何,我的 soup.findAll() 方法似乎只能获取所需列表的 48/50。我不确定这是否是由于列表长度的某些限制、不正确的方法重载或 lxml 解析器中的错误。
我专门包含了按价格对列表进行排序的网页,这样我就可以检查生成的 .csv 文件是否存在缺失的物品。似乎是最后两个列表被省略了,这让我相信这是一个列表长度问题。
感谢任何建议,欢迎任何超出我的问题的其他提示!谢谢!
#import Dependencies
from bs4 import BeautifulSoup
import requests
url = 'https://www.etsy.com/search?q=knitted+Toe&explicit=1&order=price_desc'
response = requests.get(url) #request http Data at URL
soup=BeautifulSoup(response.content,'lxml') #parse the data with lxml data parser
#Find all containers in divs with classes named item-container that hold item objects and store them into a list
containers = soup.findAll("div", {"class":"js-merch-stash-check-listing"})
print("-----------------------------------------------------------------------------------------------------------")
print("Search Term:\n"+'"'+soup.h1.text+'"\n') #print <h1> tag contents text
print("Items: ",len(containers)) #print length of container
print("-----------------------------------------------------------------------------------------------------------")
#print(containers[0].a)
container=containers[0]
#CSV data input methods
filename = "EtsyProducts.csv"
f = open(filename,"w")
headers = "Brand, Product Name, Cost, Product Page\n"
f.write(headers)
for container in containers:
brand_container = container.findAll("div", {"class":"v2-listing-card__shop"})
brand = brand_container[0].p.text #Call subclasses of container object
cost_container= container.findAll("span", {"class":"currency-value"})
cost= cost_container[0].text
product_name = container.a.h3.text.strip()
urlContainer = container.find('a', href=True)
productPage = urlContainer['href']
print('===========================================================================================================')
print("Brand: "+brand)
print("Name: "+product_name)
print("Price: "+cost+"\n")
print("URL: "+productPage.strip());
#sleep(randint(3,10))
f.write(brand + "," + product_name.replace(",","|") + "," + cost + "," + productPage + "\n")
f.close() #Close CSV
【问题讨论】:
标签: python html css web-scraping beautifulsoup