【问题标题】：Find href on web page在网页上查找href
【发布时间】：2022-01-14 21:19:02
【问题描述】：

我不明白为什么以下内容不起作用 - 我正在寻找并尝试点击此特定链接：

<a href="#/documents/2077">

来自网址：https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover

从URL 开始，我尝试了一些方法，包括以下内容：

尝试 #1

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))

和

appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")

尝试 #2

soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`

除其他外。请记住，这是一个广义搜索，因为每次我进行此搜索时物种名称都会改变，其他一切都应该保持相似。

第二次尝试直接来自美丽的汤文档，并找到了一大堆链接，例如菜单选项卡下的链接等，但不是我正在寻找的 href。

由于某种原因，第一次尝试超时而没有找到我输入的部分文本。也许这是因为那是页面上的文字而不是 href 本身？

我没有想到的一个解决方案是先在其中找到链接的边界框，然后在新的较小搜索区域中寻找链接，但我仍然不知道为什么我找不到整个页面的正确链接。

【问题讨论】：

标签： html selenium beautifulsoup href webdriverwait

【解决方案1】：

试试这个：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time


chrome_options = Options()
#chrome_options.add_argument("--headless")
#chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")


driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
time.sleep(2)

driver.find_element_by_xpath("//a[@class='card-header']").click()

【讨论】：

啊，find_element_by_xpath 行成功了！是从源 HTML 中执行“复制 XPath”得到的 xpath（括号中的内容）吗？感谢您的帮助
不客气！是的，这是查找 xpath 的一种方法（虽然不是推荐的方法）。我通过分析 html 标签手动设计 xpath。
您能解释一下为什么我的方法不起作用吗？主要是因为我不在下一页上试图找到类似的元素，但在手动设计 XPath 方面似乎没有您那么熟练。
我并不是说你的方法行不通。我只是说这不是最好的方法，因为有时他们不会。请看一下这个答案。它会让你明白我在说什么。 stackoverflow.com/questions/43090530/…

【解决方案2】：

这里有几件事：

COSEWIC-Assessment-and-status-report 不是确切的文本，但它是 COSEWIC Assessment and Status Report on the Victoria’s Owl-clover

文本不在A标签内，而是在SPAN内：

<span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>

因此，要识别 可点击 元素，您需要为element_to_be_clickable() 诱导WebDriverWait，您可以使用以下任一Locator Strategies：

使用XPATH：

driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

【讨论】：

【解决方案3】：

import requests
from pprint import pp
headers = {
    "api-key": "3A1E8E87503C069448999238ABD05EE9"
}

params = {
    'api-version': '2017-11-11'
}


def main(url):
    with requests.Session() as req:
        req.headers.update(headers)
        req.params = params
        data = {
            "count": 'true',
            "filter": "((documentTypeId eq 18))",
            "orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
            "queryType": "full",
            "search": "/.*Victoria's.*/ /.*Owl-clover.*/",
            "searchMode": "all",
            "select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
            "skip": 0,
            "top": 10
        }
        r = req.post(url, json=data)
        ndata = {
            'filter': f"id eq '{r.json()['value'][0]['id']}'"
        }
        r = req.post(url, json=ndata)
        pp(r.json())


main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')

输出：

{'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
 'value': [{'@search.score': 1.0,
            'id': '2077',
            'documentTitle': 'COSEWIC Assessment and Status Report on the '
                             'Victoria’s Owl-clover <em>Castilleja '
                             'victoriae</em> in Canada',
            'documentCreateDate': '2010-09-01T13:54:36.8Z',
            'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
                                   'victoriae</em>) is a newly described '
                                   'species, previously misidentified as  '
                                   '(<em>C. ambigua</em> ssp. '
                                   '<em>ambigua</em>). It is a small herb of '
                                   'the broomrape family with alternate,  '
                                   'hairy, lobed stem leaves and no basal '
                                   'rosette. The wider and more deeply lobed  '
                                   'upper leaves grade into the floral bracts. '
                                   'The sepals are fused into a  five-lobed '
                                   'calyx, and the petals are fused into a '
                                   '2-lipped flower 10-18 mm  long. The lower '
                                   'lip is lemon-yellow with minute white tips '
                                   'on each of the three  lobes. The upper lip '
                                   'is slightly longer than the lower lip and '
                                   'creamy white. The  fruits are brown, '
                                   '2-celled capsules that split at the tip '
                                   'when the seeds are  ripe. Each capsule '
                                   'bears 30-70 brown seeds with a sculptured '
                                   'seed coat.',
            'documentTypeId': 18,
            'consultationStartDate': None,
            'consultationEndDate': None,
            'consultationActivationStatusId': 0,
            'finalOrDelayed': 6,
            'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
                            'Assessment and Status Report on the Victoria’s '
                            'Owl-clover <em>Castilleja victoriae</em> in '
                            'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
                            'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
            'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
                         'Secretariat","address":"c/o Canadian Wildlife '
                         'Service\\r\\n Environment '
                         'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
            'links': [],
            'species': ['1084-749']}]}

【讨论】：

【解决方案4】：

我在 bs4 中使用 selenium。您要抓取的网址是亲戚，我也将它们转换为绝对网址。您可以从取消注释部分获取绝对网址。

PS：你只需要安装 manager:pip install webdriver-manager 并运行脚本。

脚本：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time


url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc&currentPage=1&pageSize=10'

cm = ChromeDriverManager().install()
driver = webdriver.Chrome(cm)

driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(5)

base_url = 'https://species-registry.canada.ca/index-en.html'
soup = BeautifulSoup(driver.page_source, 'html.parser')
hrefs=soup.find_all('a',class_='card-header')

for href in hrefs:
    relative_url= href['href']
    print(relative_url)
    #abs_url= base_url + href['href']
    #print(abs_url)

作为亲属输出：

#/documents/2968
#/documents/3002
#/documents/1590
#/documents/3332
#/documents/3354
#/documents/3357
#/documents/1451
#/documents/3325
#/documents/3333
#/documents/205

输出为绝对网址：

https://species-registry.canada.ca/index-en.html#/documents/2968
https://species-registry.canada.ca/index-en.html#/documents/3002
https://species-registry.canada.ca/index-en.html#/documents/1590
https://species-registry.canada.ca/index-en.html#/documents/3332
https://species-registry.canada.ca/index-en.html#/documents/3354
https://species-registry.canada.ca/index-en.html#/documents/3357
https://species-registry.canada.ca/index-en.html#/documents/1451
https://species-registry.canada.ca/index-en.html#/documents/3325
https://species-registry.canada.ca/index-en.html#/documents/3333
https://species-registry.canada.ca/index-en.html#/documents/205

【讨论】：