使用Selenium改善用于容器中的元素的网答案

【问题标题】：Improve Web Scraping for Elements in a Container Using Selenium使用Selenium改善用于容器中的元素的网
【发布时间】：2019-04-16 02:52:21
【问题描述】：

我正在使用 FireFox，我的代码运行良好，只是速度很慢。我阻止加载图像，只是为了加快一点速度：

firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)

但性能仍然很慢。我试过无头，但不幸的是，它没有用，因为我收到 NoSuchElement 错误。那么有没有办法加速 Selenium 网络抓取？我不能使用scrapy，因为这是一个动态的网络抓取，我需要多次点击next按钮，直到没有可点击的按钮存在，并且还需要点击弹出按钮。

这是一个sn-p的代码：

a = []
b = []
c = []
d = []
e = []
f = []
while True:
    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
    for item in container:
        time.sleep(2)
        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
        for i in A:
            a.append(i,text)
        time.sleep(2)
        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
        for j in B:
            b.append(j.text)
        time.sleep(3)
        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
        for k in C:
            c.append(k.text)
        time.sleep(3)
        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
        for l in D:
            d.append(l.text)
        time.sleep(3)
        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
        for m in E:
            e.append(m.text)

    try:
        time.sleep(2)
        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
        next.click()
        time.sleep(2)
        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
    except (ElementClickInterceptedException,NoSuchElementException) as e:
        break

这是一个经过编辑的版本，但速度没有提高。

========================================================================
while True:
    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
    for item in container:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
        for i in A:
            a.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
        for i in B:
            b.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
        for i in C:
            c.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
        for i in D:
            d.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
        for i in E:
            e.append(i.text)

    try:
        #time.sleep(2)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
        next.click()
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
    except (ElementClickInterceptedException,NoSuchElementException) as e:
        break

【问题讨论】：

在while 的每次迭代中，您有 17 秒的睡眠时间。你认为这可能与它有关吗？
考虑使用Waits 而不是多次休眠以减少执行时间。另请注意，如果您进行网络抓取，则应仅将 Selenium 用作最后的手段。您可以尝试使用直接 API 调用获取所需数据，例如，request lib
@Guy，我也在怀疑同样的事情，我正在寻找一种更优化的方式来抓取容器中的文本，它有一个下一步按钮，还有一个烦人的弹出窗口。
几件事，虽然不确定它会产生多大的不同。首先，如果一个元素的存在保证了另一个元素，那么您可能不需要 for 循环中的所有这些等待。就像，单击会给您一个新行以及新行中存在的所有元素。还要等到返回您正在寻找的元素。无需再次调用来获取元素。此外，我认为在每次调用中，您都在尝试再次收集所有元素，给定 xpath。因为您的列表可能类似于 1,1,2,1,2,3 种模式。

标签： python selenium firefox web-scraping scrapy

【解决方案1】：

对于动态网页（使用javascript渲染或增强的页面），我建议你使用scrapy-splash

并不是说您不能使用 selenium，而是出于报废目的，scrapy-splash 更适合。

另外，如果你必须使用 selenium 来抓取一个好主意，那就是使用 headless 选项。你也可以使用铬。我有一些 chrome headless 的基准，有时比 firefox headless 更快。

另外，最好使用带有预期条件的 webdriverwait 而不是睡眠，因为它会等待必要的时间，而不是线程睡眠，这会让您等待上述时间。

编辑：在尝试回答 @QHarr 时添加为编辑，因为答案很长。

这是一个评估scrapy-splash的建议。

我倾向于使用scrapy，因为整个生态系统都围绕着报废的目的。像中间件、代理、部署、调度、扩展。所以基本上如果你正在寻找一些严重的报废scrapy可能是更好的起始位置。因此，该建议带有警告。

关于速度，我无法给出任何客观的答案，因为我从未从时间角度与任何规模的项目进行对比和基准测试。

但我会或多或少地假设，如果你做同样的事情，你将能够在串行运行中获得可比较的时间。在大多数情况下，您花费的时间是等待响应。

如果您报废任何相当数量的项目，您获得的加速通常是通过并行化请求。此外，在没有必要的情况下，退回到基本的 http 请求和响应，而不是在任何用户代理中呈现页面。

另外，有趣的是，一些网页内操作可以使用底层的 http 请求/响应来执行。所以时间是一个优先事项，那么您应该寻求通过 http 请求/响应完成尽可能多的事情。

【讨论】：

为什么scrapy-splash 更适合？它更快吗？
感谢您的建议，但无头模式给我带来了 NoSuchElement 错误。我试图通过在 stackoverflow 中复制人们提供的解决方案来解决问题，但无济于事，所以我恢复了“不是”无头的状态。
有趣，如果你能提供一个网址或可行的解决方案，我可以看看或发帖提问，以便其他社区成员可以看看？由于几个不同的原因，我已经看到了无头模式的 NoSuchElementException 问题。