如何在网页中执行“javascript:__doPostBack”以使用 selenium 下载 pdf 文件？答案

【问题标题】：How to execute "javascript:__doPostBack" in a webpage to download pdf files using selenium?如何在网页中执行“javascript:__doPostBack”以使用 selenium 下载 pdf 文件？
【发布时间】：2021-09-10 01:17:40
【问题描述】：

我已经尝试了来自this 非常相似的帖子的所有解决方案，但不幸的是，虽然我没有收到任何有用的错误，我的文件夹中也没有任何 pdf 文件。

要更改配置以使 selenium 无头工作并下载到我想要的目录，我遵循了 post 和 this。

但是我什么也没看到。此外，交互执行与运行脚本时的行为也不同。交互执行时，我看不到任何错误，但也没有任何反应。运行脚本时出现一个不太有用的错误：

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"a[href*={css_selector}']"))).click()
  File "C----\selenium\webdriver\support\wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

有问题的网站是here。

我试图使工作的代码是 -

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.headless = True

uri = "http://affidavitarchive.nic.in/CANDIDATEAFFIDAVIT.aspx?YEARID=March-2017+(+GEN+)&AC_No=1&st_code=S24&constType=AC"

driver = webdriver.Firefox(options=options, executable_path=r'C:\\Users\\xxx\\geckodriver.exe')

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', r'C:\\Users\\xxx\\Downloads')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')

# Function that reads the table in the webpage and extracts the links for the pdfs
def get_links_from_table(uri):
    html = requests.get(uri)
    soup = BeautifulSoup(html.content, 'lxml')
    table = soup.find_all('table')[-1]
    candidate_affidavit_links = []
    for link in table.find_all('a'):
        candidate_affidavit_links.append(link.get('href'))
    return candidate_affidavit_links

candidate_affidavit_links_list = get_links_from_table(uri)

driver.get(uri)

# iterate over the javascript links and try to download the pdf files
for js_link in candidate_affidavit_links_list:
    css_selector = js_link.split("'")[1]
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"a[href*={css_selector}']"))).click()
    driver.execute_script(js_link)

【问题讨论】：

我对 BeautifulSoup 几乎不熟悉，但也许您需要在 get_links_from_table 方法中放置某种等待/延迟，以让数据加载类似于我们在 Selenium 中所做的操作？在html = requests.get(uri) 之后在soup = BeautifulSoup(html.content, 'lxml') 之前睡觉？或者可能是在那之后的一行？
@Prophet 我不太确定。如果您检查网页，它非常轻巧，并且 pdf 链接始终是 javascript。您可以尝试打印candidate_affidavit_links_list，您会看到链接已成功获取。所以我认为这可能不是问题。但我真的不知道说实话。
再一次，我不知道它是如何与 BeautifulSoup 一起工作的，但是对于 Selenium，任何页面更改/加载都比代码执行花费更多的时间，所以我们必须在每一步的地方使用某种等待页面已更改。
我做了一次driver.get(uri) 然后在最后一行你可以看到我有WebDriverWait(driver, 20)...... 是等待20 秒吗？要不要我加个试试？
不，不需要。在for js_link in candidate_affidavit_links_list: 循环中，您正在等待一些元素可点击，但恐怕元素列表是空的，因为当您阅读它们时，页面仍未加载。或者类似的东西。

标签： javascript python python-3.x selenium selenium-webdriver

【解决方案1】：

如果这一切都可以用 Selenium 完成，我会试试这个：

driver.get(uri)
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "(//table//a)[last()]")))
time.sleep(1)
candidate_affidavit_links = driver.find_elements_by_xpath("(//table//a)[last()]")
for link in candidate_affidavit_links:
    link.click()
    time.sleep(1)

打开页面，至少等到表格中的第一个链接可见，再添加一些等待，直到所有表格都确实加载，将所有a（链接）元素添加到列表中，遍历该列表单击在这些元素上并在每次点击后延迟以完成下载。
单击每个链接以完成下载文件后，您可能需要延迟更长的时间才能开始下一次下载。
UPD
要禁用要求保存文件等的弹出窗口，请尝试以下操作：而不仅仅是

profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')

这样写：

profile.set_preference('browser.helperApps.neverAsk.saveToDisk", "application/csv,application/excel,application/vnd.ms-excel,application/vnd.msexcel,text/anytext,text/comma-separated-values,text/csv,text/plain,text/x-csv,application/x-csv,text/x-comma-separated-values,text/tab-separated-values,data:text/csv')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk", "application/xml,text/plain,text/xml,image/jpeg,application/octet-stream,data:text/csv')
profile.set_preference('browser.download.manager.showWhenStarting',false)
profile.set_preference('browser.helperApps.neverAsk.openFile","application/csv,application/excel,application/vnd.ms-excel,application/vnd.msexcel,text/anytext,text/comma-separated-values,text/csv,text/plain,text/x-csv,application/x-csv,text/x-comma-separated-values,text/tab-separated-values,data:text/csv')
profile.set_preference('browser.helperApps.neverAsk.openFile","application/xml,text/plain,text/xml,image/jpeg,application/octet-stream,data:text/csv')
profile.set_preference('browser.helperApps.alwaysAsk.force', false)
profile.set_preference('browser.download.useDownloadDir', true)
profile.set_preference('dom.file.createInChild', true)

不确定您是否需要所有这些，但我拥有所有这些并且对我有用

【讨论】：

我收到此错误 - AttributeError: module 'selenium.webdriver.support.expected_conditions' has no attribute 'element_to_be_visible' 。将其替换为element_to_be_clickable - 我没有看到任何错误，但我也没有看到任何文件。我试图打印 link 但我只得到 1 个输出而不是那个 url 的 13 个。
element_to_be_visible 是对的。至于打印 link 每个链接应包含单个元素，而 candidate_affidavit_links 应包含 13 个元素
其实列表只有一个元素。这是列表的输出以及列表中的链接 - [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="0709c1a4-faf1-4f17-aade-1c8467e88a9b", element="c856a48d-607d-434c-ac0b-25e5b6aa4e50")>] <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="0709c1a4-faf1-4f17-aade-1c8467e88a9b", element="c856a48d-607d-434c-ac0b-25e5b6aa4e50")>
啊，我从candidate_affidavit_links 中删除了[last()]，现在列表有13 个元素。但为什么我的“下载”文件夹中看不到任何 PDF？
我从这里看不到它......但如果它有效 - 它有效。下载测试运行文件时，您是否可以直观地看到？

【解决方案2】：

这在 chrome 中要简单得多：

driver = webdriver.Chrome()

driver.execute_cdp_cmd("Page.setDownloadBehavior", {"behavior": "allow", "downloadPath": "/path/to/folder"})

driver.get("http://affidavitarchive.nic.in/CANDIDATEAFFIDAVIT.aspx?YEARID=March-2017+(+GEN+)&AC_No=1&st_code=S24&constType=AC")

for a in driver.find_elements_by_css_selector('a[href*=doPostBack]'):
  a.click()

【讨论】：

我尝试了您的代码，是的，它确实会自动开始下载，但是当它开始下载 PDF 时，我会收到“失败 - 下载错误”。
将下载路径更改为您希望它们所在的文件夹（绝对路径）