【问题标题】:How to extract the headers of the individual search items using Selenium and Python如何使用 Selenium 和 Python 提取单个搜索项的标题
【发布时间】:2020-06-02 22:59:58
【问题描述】:

我正在学习 python 并尝试从 python.org 粘贴搜索结果。我正在使用Selenium

我想做的步骤:

  • 打开 python.org
  • 搜索术语“数组”(显示结果)
  • 粘贴搜索项列表 (print("searchResults"))

我的代码:

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

#waiting to find the element before throwing error no element found
driver.implicitly_wait(10)
#driver.maximize_window()

#getting the website
driver.get("https://www.python.org/")
driver.implicitly_wait(5)
#finding element by id
driver.find_element_by_id("id-search-field").send_keys("arrays")
driver.find_element_by_id("submit").click()
print("Test Successful")

SearchResults = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div/section/form/ul")
print(SearchResults.text)

-> 这将粘贴所有结果。

现在我想要单个结果项及其标题。当我在现场检查搜索结果时,我得到了这个:<a href="/dev/peps/pep-0209/">PEP 209 -- Multi-dimensional Arrays</a>

没有可使用的标签、类和名称。

如何使用它来获取所有标题?

【问题讨论】:

    标签: python selenium xpath css-selectors webdriverwait


    【解决方案1】:

    【讨论】:

      【解决方案2】:

      你可以试试这个吗?尝试使用 CSS 选择器并分解每个元素,而不是使用 Xpath:

      from selenium import webdriver
      import json
      import time
      
      driver = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
      
      # Getting the website
      driver.get("https://www.python.org/")
      # Finding element by id
      driver.find_element_by_id("id-search-field").send_keys("arrays")
      driver.find_element_by_id("submit").click()
      print("Test Successful")
      for elem in driver.find_elements_by_css_selector("section.main-content ul li"):
          elem_data = {
              'title': elem.find_element_by_css_selector("h3").text,
              'content': elem.find_element_by_css_selector("p").text,
              'link': elem.find_element_by_css_selector("h3 a").get_attribute('href'),
          }
          print(json.dumps(elem_data, indent=4))
          break
      # {
      #     "title": "PEP 209 -- Multi-dimensional Arrays",
      #     "content": "...arrays comprised of simple types, like numeric. How are masked-arrays implemented? Masked-arrays in Numeric 1 are implemented as a separate array class. With the ability to add new array types to Numeric 2, it is possible that masked-arrays in Numeric 2 could be implemented as a new array type instead of an array class. How are numerical errors handled (IEEE floating-point errors in particular)? It is not clear to the proposers (Paul Barrett and Travis Oliphant) what is the best or preferre...",
      #     "link": "https://www.python.org/dev/peps/pep-0209/"
      # }
      

      【讨论】:

        【解决方案3】:

        如果需要,您可以使用 selenium 选择器方法。

        就个人而言,我喜欢使用 Javascript 并将其注入并返回结果。 对于这个例子,我会这样做:

        有一个包含以下内容的 javascript 文件:

        return (()=>{
           parsed_results = [];
           search_results=document.getElementsByClassName('list-recent-events')[0].children;
           for(var i =0;i<search_results.length;i++){
              result = search_results[i];
              text = result.innerText;
              title = result.getElementsByTagName('a')[0].innerText;
              href = 'https://www.python.org'+ result.getElementsByTagName('a')[0].getAttribute('href');
              parsed_results.push([title, text, href]);
           }
           return parsed_results;
          })();
        

        页面加载完成后,您可以这样使用它:

        search_results = driver.execute_script(open('path/to/file.js').read())
        

        然后你就可以像平常在 python 中那样浏览它们了。

        for r in search_results:
            text = r[0]
            href = r[1]
            title = r[2]
        

        【讨论】:

          【解决方案4】:

          要使用 SeleniumPython 打印所有单个搜索结果的标题,您必须为 visibility_of_all_elements_located() 引入 WebDriverWait,您可以使用关注Locator Strategies

          • 使用CSS_SELECTOR

            print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.list-recent-events.menu li>h3>a")))])
            
          • 使用XPATH

            print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='list-recent-events menu']//li/h3/a")))])
            
          • 控制台输出:

            ['PEP 209 -- Multi-dimensional Arrays', 'PEP 207 -- Rich Comparisons', 'PEP 335 -- Overloadable Boolean Operators', 'PEP 535 -- Rich comparison chaining', 'Python Success Stories', 'PEP 574 -- Pickle protocol 5 with out-of-band data', 'Parade of the PEPs', 'PEP 3118 -- Revising the buffer protocol', 'PEP 465 -- A dedicated infix operator for matrix multiplication', 'PEP 358 -- The "bytes" Object', 'PEP 225 -- Elementwise/Objectwise Operators', 'Highlights: Python 2.4', 'PEP 211 -- Adding A New Outer Product Operator', 'EDU-SIG: Python in Education', 'PEP 204 -- Range Literals', 'PEP 455 -- Adding a key-transforming dictionary to collections', 'PEP 252 -- Making Types Look More Like Classes', 'PEP 586 -- Literal Types', 'PEP 579 -- Refactoring C functions and methods', 'PEP 3116 -- New I/O']
            
          • 注意:您必须添加以下导入:

            from selenium.webdriver.support.ui import WebDriverWait
            from selenium.webdriver.common.by import By
            from selenium.webdriver.support import expected_conditions as EC
            

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2016-07-14
            • 1970-01-01
            • 2021-04-29
            • 1970-01-01
            • 2019-07-27
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多