Selenium 在浏览器中获取验证码图像答案

【问题标题】：Selenium get captcha image in browserSelenium 在浏览器中获取验证码图像
【发布时间】：2020-04-01 23:37:54
【问题描述】：

我完全不熟悉 Selenium 和网络抓取，现在我遇到了验证码问题。

我正在尝试执行此链接中评论的程序：

Selenium downloading different captcha image than the one in browser

但进展并不顺利。

第一个问题

我的第一个问题是关于 xpath 选择器的。首先，我试过这段代码：

from selenium import webdriver
import urllib.request


driver = webdriver.Chrome()
driver.get("http://sistemas.cvm.gov.br/?fundosreg")

# Change frame.
driver.switch_to.frame("Main")


# Download image/captcha.
img = driver.find_element_by_xpath(".//*img[2]")
src = img.get_attribute('src')
urllib.request.urlretrieve(src, "captcha.jpeg")

基本上，我只更改了链接。但是我不知道xpath是否正确编写，以及如何编写它。在 "" 中使用 [2] 听起来不错，并且在我提到的链接中使用了这种方式，但是当我尝试在一个 scrapy shell 会话中的 response.xpath 中复制它时它不起作用：response.xpath(".//img[2]")。必须这样：response.xpath(".//img")[2]

我的链接中的验证码很难捕捉，因为相应的 img 标签没有任何 id 或 class 或其他任何东西。另外，它是一个.asp 格式，我不知道我能做些什么。

第二个问题 然后，我尝试了这段代码，它也出现在其他类似的搜索中

from PIL import Image
from selenium import webdriver

def get_captcha(driver, element, path):
    # now that we have the preliminary stuff out of the way time to get that image :D
    location = element.location
    size = element.size
    # saves screenshot of entire page
    driver.save_screenshot(path)

    # uses PIL library to open image in memory
    image = Image.open(path)

    left = location['x']
    top = location['y'] + 140
    right = location['x'] + size['width']
    bottom = location['y'] + size['height'] + 140

    image = image.crop((left, top, right, bottom))  # defines crop points
    image.save(path, 'png')  # saves new cropped image


driver = webdriver.Chrome()
driver.get("http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp")

# change frame
driver.switch_to.frame("Main")

# download image/captcha
#img = driver.find_element_by_xpath(".//*[@id='trRandom3']/td[2]/img")
img = driver.find_element_by_xpath(".//*img[2]")
get_captcha(driver, img, "captcha.png")

再次，我遇到了 xpath 的问题，但还有另一个问题：

Traceback (most recent call last):
  File "seletest2.py", line 27, in <module>
    driver.switch_to.frame("Main")
  File "/home/seiji/crawlers_env/lib/python3.6/site-packages/selenium/webdriver/remote/switch_to.py", line 87, in frame
    raise NoSuchFrameException(frame_reference)
selenium.common.exceptions.NoSuchFrameException: Message: Main

问题出在这一行：driver.switch_to.frame("Main") 什么意思？

谢谢！

【问题讨论】：

标签： python selenium captcha scrape

【解决方案1】：

使用WebDriverWait等待元素，使用.frame_to_be_available_and_switch_to_it方法切换iframe

试试下面的代码：

driver.get("http://sistemas.cvm.gov.br/?fundosreg")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'Main')))
img = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#Table1 img')))
src = img.get_attribute('src')
urllib.request.urlretrieve(src, "captcha.jpeg")

您需要以下导入：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

但是您的另一个网址是：http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp，验证码元素不在iframe 中。这是选择器：

By.CSS_SELECTOR : table img

请用上面的代码实现。

【讨论】：