Python：urllib urlopen 卡住，超时错误答案

【问题标题】：Python: urllib urlopen stuck, timeout errorPython：urllib urlopen 卡住，超时错误
【发布时间】：2020-03-01 16:56:34
【问题描述】：

如标题所述，urlopen get 卡在打开 URL 时。

代码：

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

问题：卡在 uReq 上。但是，如果您将 page_url 替换为以下链接，则一切正常。

page_url= "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

错误：超时错误

我怎样才能打开给定的 URL，以用于 Web Scraping 目的？

编辑

【问题讨论】：

标签： python web-scraping soap request urllib

【解决方案1】：

一些网站需要User-Agent 标头才能产生成功的请求。导入urllib.request.Request，修改你的代码如下

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq, Request  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(Request(page_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'
}))

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

你会没事的

【讨论】：

检查编辑，我试过你的版本，但出现错误
@Roman 你需要解码，因为它有 Unicode 字符uClient.read().decode('utf-8')（注意：urlopen().read() 返回字节）
在 page_soup ？ page_soup = soup(uClient.read().decode('utf-8'), "html.parser") - 我在很多地方都试过了，似乎没有成功
@Roman 你可能想看看this，看看它是否可以解决问题