使用 python 在 selenium 中使用 xpath 获取 innerHTML答案

【问题标题】：Get innerHTML with xpath in selenium with python使用 python 在 selenium 中使用 xpath 获取 innerHTML
【发布时间】：2020-09-04 17:52:23
【问题描述】：

我正在尝试学习网络抓取，尽管我检查了文档中的示例和堆栈中的一些问题，但我无法使我的代码正常工作。

我要抓取的网站有职位列表，但它的结构上没有模式或固定的类，几乎每个元素都有自己的 id 和单独的类。当我使用检查器从锚标记中查找 innerHTML 的 xPath 时，这就是我得到的：

使用火狐：

/html/body/div[1]/div/main/div[3]/div/div/section/ul/li[1]/article/header/div/div[1]/h2/a

使用 Brave 浏览器：

//*[@id="16542952"]/section/div/header/h2/a

相同的 url，相同的元素，结果中的第一个职位。

URL

我想循环浏览页面并从职位列表中的某些元素中获取文本，例如职位名称、描述等。

我在 Python 和 Firefox/geckodriver 中使用 selenium

【问题讨论】：

我已经在 Firefox 中签入它提供相同的 xpath。在复制 xpath 时勇敢地选择“复制完整 xpath”

标签： python selenium xpath css-selectors webdriverwait

【解决方案1】：

要循环浏览页面并使用Selenium 和Python 获取职位列表的文本，您必须为visibility_of_all_elements_located() 引入WebDriverWait，您可以使用以下任一Locator Strategies:

使用CSS_SELECTOR 和get_attribute()：

driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "header>h2>a")))])

使用XPATH和text属性：

driver.get('https://www.catho.com.br/vagas/data-scientist/?q=data%20scientist&page=1')
print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//header/h2/a")))])

控制台输出：

['Analista Data Science', 'Consultor de Data Science', 'Analista Big Data / Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados', 'Cientista de Dados']

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

【讨论】：

【解决方案2】：

一旦你有一个元素el，例如获取它的innerHTML 你可以做

el = driver.find_element('xpath', 'FULL XPATH (which FireFox gave you)')
el.get_property("innerHTML")

关于循环，我认为您可以通过以下方式选择“持有”作业元素的父元素：

parent = driver.find_element('xpath', '/html/body/div[1]/article/section/ul') # the 'ul' which holds the jobs 'li' tags
jobs = driver.execute_script("return arguments[0].children", parent) # the parent variable will be replacing arguments[0]

for job in jobs:
    # do what you want to do to each element

【讨论】：