Python 美丽的汤 etsy 刮刀没有收集所有物品答案

【问题标题】：Python beautiful soup etsy scraper is not gathering all itemsPython 美丽的汤 etsy 刮刀没有收集所有物品
【发布时间】：2021-03-02 11:06:53
【问题描述】：

在中断了 4 年之后才重新开始使用 python。我想用 Beautiful soup 库练习一些网页抓取。由于我不得不同时学习基本的 html/css，这让我有点痛苦，但最终我想起了编码的快感（和挫败感）。

无论如何，我的 soup.findAll() 方法似乎只能获取所需列表的 48/50。我不确定这是否是由于列表长度的某些限制、不正确的方法重载或 lxml 解析器中的错误。

我专门包含了按价格对列表进行排序的网页，这样我就可以检查生成的 .csv 文件是否存在缺失的物品。似乎是最后两个列表被省略了，这让我相信这是一个列表长度问题。

感谢任何建议，欢迎任何超出我的问题的其他提示！谢谢！

#import Dependencies
from bs4 import BeautifulSoup
import requests

url = 'https://www.etsy.com/search?q=knitted+Toe&explicit=1&order=price_desc'
response = requests.get(url) #request http Data at URL
soup=BeautifulSoup(response.content,'lxml') #parse the data with lxml data parser
#Find all containers in divs with classes named item-container that hold item objects and store them into a list
containers = soup.findAll("div", {"class":"js-merch-stash-check-listing"})

print("-----------------------------------------------------------------------------------------------------------")
print("Search Term:\n"+'"'+soup.h1.text+'"\n') #print <h1> tag contents text
print("Items: ",len(containers)) #print length of container
print("-----------------------------------------------------------------------------------------------------------")

#print(containers[0].a)
container=containers[0]

#CSV data input methods
filename = "EtsyProducts.csv"
f = open(filename,"w")
headers = "Brand, Product Name, Cost, Product Page\n"
f.write(headers)

for container in containers:
    brand_container = container.findAll("div", {"class":"v2-listing-card__shop"})
    brand = brand_container[0].p.text  #Call subclasses of container object 
    cost_container= container.findAll("span", {"class":"currency-value"})
    cost= cost_container[0].text
    product_name = container.a.h3.text.strip()
        
    urlContainer = container.find('a', href=True)
    productPage = urlContainer['href']
        
    print('===========================================================================================================')
    print("Brand: "+brand)
    print("Name: "+product_name)
    print("Price: "+cost+"\n")
    print("URL: "+productPage.strip());
    #sleep(randint(3,10))
    
    f.write(brand + "," + product_name.replace(",","|") + "," + cost + "," + productPage + "\n")
        
f.close() #Close CSV

【问题讨论】：

标签： python html css web-scraping beautifulsoup

【解决方案1】：

这不是解析 HTML 时的错误，而只是 Etsy 页面针对支持 JavaScript 的浏览器进行优化的副作用。

from bs4 import BeautifulSoup
import requests

url = 'https://www.etsy.com/search?q=knitted+Toe&explicit=1&order=price_desc'
response = requests.get(url) #request http Data at URL
print(response.text.count("js-merch-stash-check-listing"))

# 48

来自 Etsy 的初始 HTML 响应确实包含 48 个项目。您可以通过将response.text 保存到文件并在浏览器中打开该html 文件来验证这一点。您将看到一个 12 行 4 列的网格。

该页面包含浏览器通过 AJAX 加载更多信息的 JS 指令（可能基于显示大小），这就是显示额外条目的方式。

也就是说，您的代码都是正确的。如果您希望从抓取中获取更多结果，则可能需要对 Etsy API 进行逆向工程，因为您的浏览器使用它来呈现所有 50 个结果。

【讨论】：

啊，非常感谢。这条评论很有帮助。我相信您对用户代理不匹配的看法是正确的。我能够将 html 写入文件，然后看到每页只有 48 个项目的意思。我使用第二页结果中的 Url 运行我的代码，并将它们写入文件....然后 BOOM！前两个条目是我认为省略的第 49 和第 50 项。非常感谢！！！