【问题标题】:Making a webscraper when sometimes the information you are trying to scrape is missing有时您尝试抓取的信息丢失时制作网络抓取工具
【发布时间】:2021-05-15 16:25:27
【问题描述】:

当我尝试抓取页面上所有显卡的品牌时,它适用于前 15 个,但后来我得到了TypeError: 'NoneType' object is not subscriptable

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

# the url we want to scrape and saves it to a variable
url = 'https://www.newegg.com/p/pl?d=graphics+card&RandomID=551877219014822520210210001440&PageSize=36'

# opens the url and returns a file object
uClient = uReq(url)

# reads the object and returns the html contents as a string
page_html = uClient.read()

# closes the file
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each element with the class of item container and stores in a variable

containers = page_soup.findAll("div", {"class": "item-container"})

# scraping the brands of each graphics card from the website

for container in containers:
    brand = container.div.div.a.img["title"]
    print(brand)

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    在这种情况下,捕获Exception(在您的情况下为TypeError)可能是有意义的

    try:
        something_which_may_raise()
    except TypeError:  # specific Exception
        my_code_to_handle_exception()  # maybe do nothing
    except Exception as ex:  # generic Exception
        # NOTE you can collect the Exception object here to interact with it!
        print("caught an unexpected Exception: {}".format(repr(ex))
        raise ex  # re-raise that Exception to the calling function
    

    对于你的情况,也许这就是你要找的

    ...
    for index, container in enumerate(containers):
        try:
            print(container.div.div.a.img["title"])  # brand
        except Exception as ex:
            print("couldn't read brand from container {}".format(index))
    

    【讨论】:

    • 哇,好用,非常感谢。问题:为什么使用 enumerate() 使它起作用?枚举不只是产生一个计数器和值吗?例如选择 = ['a', 'b', 'c', 'd'] for index, value in enumerate(choices): print(index, value) 将产生 0 a 1 b 2 c 3 d
    • 啊,不是enumerate 使它起作用,而是某种形式的将try/except 放入循环中或不重新引发异常;是的,正是docs.python.org/3/library/functions.html#enumerate
    • 好的,我在没有枚举的情况下尝试了它并且它可以工作,所以我认为使它工作的原因是你的 except 语句“except Exception as ex:”。你能解释一下为什么会这样吗,异常参数为 ex 的实际作用是什么
    • 它几乎可以肯定是重新加注(我的第一个例子是raise ex,并且简单地raise 会选择最近的)或者包裹整个循环体(错误地?)然后编辑超过中间代码! as ex 组在 except 块 (more here, but it may be unpleasantly theory-heavy) 的上下文中将异常分配给 name ex(可以是任何名称)。这允许您与异常对象see Python datamodel
    猜你喜欢
    • 2022-12-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-05-18
    • 2021-11-07
    • 2016-05-21
    相关资源
    最近更新 更多