【问题标题】:Python: urllib urlopen stuck, timeout errorPython:urllib urlopen 卡住,超时错误
【发布时间】:2020-03-01 16:56:34
【问题描述】:

如标题所述,urlopen get 卡在打开 URL 时。

代码:

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

问题:卡在 uReq 上。但是,如果您将 page_url 替换为以下链接,则一切正常。

page_url= "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

错误:超时错误

我怎样才能打开给定的 URL,以用于 Web Scraping 目的?

编辑

【问题讨论】:

    标签: python web-scraping soap request urllib


    【解决方案1】:

    一些网站需要User-Agent 标头才能产生成功的请求。导入urllib.request.Request,修改你的代码如下

    from bs4 import BeautifulSoup as soup  # HTML data structure
    from urllib.request import urlopen as uReq, Request  # Web client
    
    page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"
    
    uClient = uReq(Request(page_url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'
    }))
    
    # parses html into a soup data structure to traverse html
    # as if it were a json data type.
    page_soup = soup(uClient.read(), "html.parser")
    
    uClient.close()
    
    print(page_soup)
    

    你会没事的

    【讨论】:

    • 检查编辑,我试过你的版本,但出现错误
    • @Roman 你需要解码,因为它有 Unicode 字符uClient.read().decode('utf-8')(注意:urlopen().read() 返回字节)
    • 在 page_soup ? page_soup = soup(uClient.read().decode('utf-8'), "html.parser") - 我在很多地方都试过了,似乎没有成功
    • @Roman 你可能想看看this,看看它是否可以解决问题
    猜你喜欢
    • 2011-05-13
    • 2021-10-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-03-31
    • 2018-09-16
    • 2017-12-07
    • 2014-04-29
    相关资源
    最近更新 更多