【问题标题】:Find href on web page在网页上查找href
【发布时间】:2022-01-14 21:19:02
【问题描述】:

我不明白为什么以下内容不起作用 - 我正在寻找并尝试点击此特定链接:

<a href="#/documents/2077">

来自网址:https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover

URL 开始,我尝试了一些方法,包括以下内容:

尝试 #1

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT,"COSEWIC-Assessment-and-status-report")))

appraisal_html = driver.find_element_by_partial_link_text("COSEWIC-Assessment-and-status-report")

尝试 #2

soup = bs(req.text,'html.parser')
for link in soup.find_all('a'):`
print(link.get('href'))`

除其他外。请记住,这是一个广义搜索,因为每次我进行此搜索时物种名称都会改变,其他一切都应该保持相似。

第二次尝试直接来自美丽的汤文档,并找到了一大堆链接,例如菜单选项卡下的链接等,但不是我正在寻找的 href。

由于某种原因,第一次尝试超时而没有找到我输入的部分文本。也许这是因为那是页面上的文字而不是 href 本身?

我没有想到的一个解决方案是先在其中找到链接的边界框,然后在新的较小搜索区域中寻找链接,但我仍然不知道为什么我找不到整个页面的正确链接。

【问题讨论】:

    标签: html selenium beautifulsoup href webdriverwait


    【解决方案1】:

    试试这个:

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    import time
    
    
    chrome_options = Options()
    #chrome_options.add_argument("--headless")
    #chrome_options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
    
    
    driver = webdriver.Chrome(executable_path="./chromedriver", options=chrome_options)
    
    driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
    time.sleep(2)
    
    driver.find_element_by_xpath("//a[@class='card-header']").click()
    

    【讨论】:

    • 啊,find_element_by_xpath 行成功了!是从源 HTML 中执行“复制 XPath”得到的 xpath(括号中的内容)吗?感谢您的帮助
    • 不客气!是的,这是查找 xpath 的一种方法(虽然不是推荐的方法)。我通过分析 html 标签手动设计 xpath。
    • 您能解释一下为什么我的方法不起作用吗?主要是因为我不在下一页上试图找到类似的元素,但在手动设计 XPath 方面似乎没有您那么熟练。
    • 我并不是说你的方法行不通。我只是说这不是最好的方法,因为有时他们不会。请看一下这个答案。它会让你明白我在说什么。 stackoverflow.com/questions/43090530/…
    【解决方案2】:

    这里有几件事:

    • COSEWIC-Assessment-and-status-report 不是确切的文本,但它是 COSEWIC Assessment and Status Report on the Victoria’s Owl-clover

    • 文本不在A标签内,而是在SPAN内:

      <span data-v-7ee3c58f="" class="name-primary">COSEWIC Assessment and Status Report on the Victoria’s Owl-clover <em>Castilleja victoriae</em> in Canada</span>    
      

    因此,要识别 可点击 元素,您需要为element_to_be_clickable() 诱导WebDriverWait,您可以使用以下任一Locator Strategies

    • 使用XPATH

      driver.get("https://species-registry.canada.ca/index-en.html#/documents?documentTypeId=18&sortBy=documentTypeSort&sortDirection=asc&pageSize=10&keywords=Victoria%27s%20Owl-clover")
      WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[contains(., 'COSEWIC Assessment and Status Report on the Victoria’s Owl-clover')]"))).click()
      
    • 注意:您必须添加以下导入:

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
      

    【讨论】:

      【解决方案3】:
      import requests
      from pprint import pp
      headers = {
          "api-key": "3A1E8E87503C069448999238ABD05EE9"
      }
      
      params = {
          'api-version': '2017-11-11'
      }
      
      
      def main(url):
          with requests.Session() as req:
              req.headers.update(headers)
              req.params = params
              data = {
                  "count": 'true',
                  "filter": "((documentTypeId eq 18))",
                  "orderby": "documentTypeSort asc,sortDate desc,documentCreateDate asc,documentTitleSort asc",
                  "queryType": "full",
                  "search": "/.*Victoria's.*/ /.*Owl-clover.*/",
                  "searchMode": "all",
                  "select": "id,consultationEndDate,consultationStartDate,consultationActivationStatusId,documentCreateDate,documentDescription,documentTitle,documentTypeId,species,attachments,contacts,links,finalOrDelayed",
                  "skip": 0,
                  "top": 10
              }
              r = req.post(url, json=data)
              ndata = {
                  'filter': f"id eq '{r.json()['value'][0]['id']}'"
              }
              r = req.post(url, json=ndata)
              pp(r.json())
      
      
      main('https://ecprccsarsrch.search.windows.net/indexes/docblobidxen/docs/search')
      

      输出:

      {'@odata.context': "https://ecprccsarsrch.search.windows.net/indexes('docblobidxen')/$metadata#docs(*)",
       'value': [{'@search.score': 1.0,
                  'id': '2077',
                  'documentTitle': 'COSEWIC Assessment and Status Report on the '
                                   'Victoria’s Owl-clover <em>Castilleja '
                                   'victoriae</em> in Canada',
                  'documentCreateDate': '2010-09-01T13:54:36.8Z',
                  'documentDescription': 'Victoria’s Owl-clover (<em>Castilleja '
                                         'victoriae</em>) is a newly described '
                                         'species, previously misidentified as  '
                                         '(<em>C. ambigua</em> ssp. '
                                         '<em>ambigua</em>). It is a small herb of '
                                         'the broomrape family with alternate,  '
                                         'hairy, lobed stem leaves and no basal '
                                         'rosette. The wider and more deeply lobed  '
                                         'upper leaves grade into the floral bracts. '
                                         'The sepals are fused into a  five-lobed '
                                         'calyx, and the petals are fused into a '
                                         '2-lipped flower 10-18 mm  long. The lower '
                                         'lip is lemon-yellow with minute white tips '
                                         'on each of the three  lobes. The upper lip '
                                         'is slightly longer than the lower lip and '
                                         'creamy white. The  fruits are brown, '
                                         '2-celled capsules that split at the tip '
                                         'when the seeds are  ripe. Each capsule '
                                         'bears 30-70 brown seeds with a sculptured '
                                         'seed coat.',
                  'documentTypeId': 18,
                  'consultationStartDate': None,
                  'consultationEndDate': None,
                  'consultationActivationStatusId': 0,
                  'finalOrDelayed': 6,
                  'attachments': ['{"attachmentId":"8142","attachmentTitle":"COSEWIC '
                                  'Assessment and Status Report on the Victoria’s '
                                  'Owl-clover <em>Castilleja victoriae</em> in '
                                  'Canada","attachmentPublicationDate":"2010-09-03T00:00:00","file":"/cosewic/sr_Victoria\'s '
                                  'Owl-clover_0810_e.pdf","html":"https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/victoria-owl-clover-2010.html"}'],
                  'contacts': ['{"salutation":"None","title":"","id":33,"firstName":"","lastName":"","organization":"COSEWIC '
                               'Secretariat","address":"c/o Canadian Wildlife '
                               'Service\\r\\n Environment '
                               'Canada","postalCode":"K1A0H3","city":"Ottawa","province":"ON","phone":"8199384125","email":"cosewic-cosepac@ec.gc.ca","fax":"8199383984"}'],
                  'links': [],
                  'species': ['1084-749']}]}
      

      【讨论】:

        【解决方案4】:

        我在 bs4 中使用 selenium。您要抓取的网址是亲戚,我也将它们转换为绝对网址。您可以从取消注释部分获取绝对网址。

        PS:你只需要安装 manager:pip install webdriver-manager 并运行脚本。

        脚本:

        from selenium import webdriver
        from selenium.webdriver.chrome.service import Service
        from webdriver_manager.chrome import ChromeDriverManager
        from bs4 import BeautifulSoup
        import time
        
        
        url = 'https://species-registry.canada.ca/index-en.html#/documents?sortBy=documentTypeSort&sortDirection=asc&currentPage=1&pageSize=10'
        
        cm = ChromeDriverManager().install()
        driver = webdriver.Chrome(cm)
        
        driver.maximize_window()
        time.sleep(8)
        driver.get(url)
        time.sleep(5)
        
        base_url = 'https://species-registry.canada.ca/index-en.html'
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        hrefs=soup.find_all('a',class_='card-header')
        
        for href in hrefs:
            relative_url= href['href']
            print(relative_url)
            #abs_url= base_url + href['href']
            #print(abs_url)
        

        作为亲属输出:

        #/documents/2968
        #/documents/3002
        #/documents/1590
        #/documents/3332
        #/documents/3354
        #/documents/3357
        #/documents/1451
        #/documents/3325
        #/documents/3333
        #/documents/205
        

        输出为绝对网址:

        https://species-registry.canada.ca/index-en.html#/documents/2968
        https://species-registry.canada.ca/index-en.html#/documents/3002
        https://species-registry.canada.ca/index-en.html#/documents/1590
        https://species-registry.canada.ca/index-en.html#/documents/3332
        https://species-registry.canada.ca/index-en.html#/documents/3354
        https://species-registry.canada.ca/index-en.html#/documents/3357
        https://species-registry.canada.ca/index-en.html#/documents/1451
        https://species-registry.canada.ca/index-en.html#/documents/3325
        https://species-registry.canada.ca/index-en.html#/documents/3333
        https://species-registry.canada.ca/index-en.html#/documents/205
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2012-11-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-07-12
          相关资源
          最近更新 更多