我得到 ConnectionResetError: [Errno 54] Connection reset by peer 在尝试抓取时答案

【问题标题】：I got ConnectionResetError: [Errno 54] Connection reset by peer while trying to scrape我得到 ConnectionResetError: [Errno 54] Connection reset by peer 在尝试抓取时
【发布时间】：2020-11-10 04:27:26
【问题描述】：

有人可以帮助我吗？我在尝试使用 BeautifulSoup 进行抓取时遇到这些错误，

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

myUrl = "https://www.tokopedia.com/discovery/produk-terlaris?source=homepage.top_carousel.0.38454"
#open the connection
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")

product = page_soup.findAll("div", {"class": "css-6bc98m e1uv83qc1"})
print(len(product))

这是错误

Traceback (most recent call last):
 ....
 ....
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 54] Connection reset by peer

【问题讨论】：

标签： python-3.x beautifulsoup

【解决方案1】：

首先，您需要User-Agent 标头，否则服务器（正确）认为您是机器人。

第二件事是你不会从那个网站得到任何东西，因为几乎所有内容都在 JS (JavaScript) 后面，这基本上意味着 BeautifulSoup 不会看到它。

我已经修复了您的代码，因此不再有错误，但是，正如我所说，您返回的 HTML 中没有您想要的任何 div。

import requests
from bs4 import BeautifulSoup

my_url = "https://www.tokopedia.com/discovery/produk-terlaris?source=homepage.top_carousel.0.38454"
headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
}
page_soup = BeautifulSoup(requests.get(my_url, headers=headers).text, "html.parser")

product = page_soup.findAll("div", {"class": "css-6bc98m e1uv83qc1"})
print(len(product))

这打印出0。

您可以做的是探索 selenium 或检查流量，看看是否有 API 端点暴露。

【讨论】：