锚标签有一半的链接，但是当我点击链接时，它会打开一个包含完整链接的新页面答案

【问题标题】：Anchor tag has half the link, but when I click on the link, it opens a new page with the complete link锚标签有一半的链接，但是当我点击链接时，它会打开一个包含完整链接的新页面
【发布时间】：2019-12-26 07:04:44
【问题描述】：

澄清我的意思。这是 html 的样子：

我正在尝试使用此代码从突出显示的部分获取 href 链接。

from bs4 import BeautifulSoup as soup
from selenium import webdriver

driver = webdriver.Chrome("chromedriver.exe")
driver.get(r"http://wayback.archive.org/web/20101004060831/http://www.arcsoft.com:80/")

html = driver.page_source
page_soup = soup(html, "html.parser")

for i in page_soup.findAll("p", {"class": "impatient"}):
    print(i.a['href'])

代码返回en-us/index.asp根据程序没有错误。但是当我点击页面源中的这个 href 链接时，它会将我重定向到具有完整链接的网站。

这是网站的最终网址：http://web.archive.org/web/20100227101719/http://www.arcsoft.com/en-us/index.asp

谁能帮我看看如何获得这个完整的网址？

【问题讨论】：

添加a["href"] 链接？ http://web.archive.org/web/20100227101719/http://www.arcsoft.com/ + a["href"]

标签： html python-3.x beautifulsoup href

【解决方案1】：

您可以explicitly等待错误消息页面，然后等待最终页面加载。错误页面有一个 div 和 id error。最终页面将始终有一个 div 和 id siteWrapper。您还可以使用 TimeoutException 处理没有错误页面的情况。

来自文档：

如果在那之后什么都没有找到，aTimeoutException 被抛出。经过默认情况下，WebDriverWait 每 500 次调用 ExpectedCondition 毫秒，直到它成功返回。成功的返回值对于 ExpectedCondition 函数类型是布尔值 true，或一个非空对象。

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Get the first page
driver = webdriver.Chrome("/path/to/chromedriver")
driver.get(r"http://wayback.archive.org/web/20101004060831/http://www.arcsoft.com:80/")
try:
    # Wait for Error Page
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="error"]')))
except TimeoutException:
    # Pass if there is no error message
    pass
# Wait for new page
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="siteWrapper"]')))
print(driver.current_url)

输出

http://web.archive.org/web/20100227101719/http://www.arcsoft.com/en-us/index.asp

现在driver.page_source 将获取最终页面的页面源代码。

不需要手动计算新的url然后去那个页面。

【讨论】：