Python Selenium 从 find_elements_by_partial_link_text 中提取 href 信息答案

【问题标题】：Python Selenium pull href info out of find_elements_by_partial_link_textPython Selenium 从 find_elements_by_partial_link_text 中提取 href 信息
【发布时间】：2014-08-01 16:22:31
【问题描述】：

我正在从网站中提取一些数据，我可以成功浏览到列出前一天所有更新数据的页面，但现在我需要遍历所有链接，并将每个页面的源保存到一个文件。

一旦在一个文件中，我想使用 BeautifulSoup 来更好地排列数据，以便我可以解析它。

#learn.py
from BeautifulSoup import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url1 = 'https://odyssey.tarrantcounty.com/default.aspx'
date = '07/31/2014'
option_by_date = "6"
driver = webdriver.Firefox()
driver.get(url1)
continue_link = driver.find_element_by_partial_link_text('Case')

#follow link
continue_link.click()

driver.find_element_by_xpath("//select[@name='SearchBy']/option[text()='Date Filed']").click()
#fill in dates in form
from_date = driver.find_element_by_id("DateFiledOnAfter")
from_date.send_keys(date)
to_date = driver.find_element_by_id("DateFiledOnBefore")
to_date.send_keys(date)

submit_button = driver.find_element_by_id('SearchSubmit')
submit_button.click()

link_list = driver.find_elements_by_partial_link_text('2014')

link_list 应该是适用链接的列表，但我不确定从那里去哪里。

【问题讨论】：

标签： python selenium selenium-webdriver web-scraping

【解决方案1】：

获取所有具有href 属性以CaseDetail.aspx?CaseID= 开头的链接，find_elements_by_xpath() 会有所帮助：

# get the list of links
links = [link.get_attribute('href') 
         for link in driver.find_elements_by_xpath('//td/a[starts-with(@href, "CaseDetail.aspx?CaseID=")]')]
for link in links:
    # follow the link
    driver.get(link)

    # parse the data
    print driver.find_element_by_class_name('ssCaseDetailCaseNbr').text

打印：

Case No. 2014-PR01986-2
Case No. 2014-PR01988-1
Case No. 2014-PR01989-1
...

请注意，您不需要保存页面并通过BeautifulSoup 解析它们。 Selenium 本身在网页之外的navigating and extracting the data 中非常强大。

【讨论】：

谢谢，看来我们的方向是正确的。但这只会告诉我案件编号是否正确？我想我要做的是创建一个链接列表，然后打开每个链接并浏览文本（我会用 selenium 解析一下）
@Mike82 是的，案例编号只是一个示例 - 您应该详细说明 parse the data 步骤。这个想法是正确的，是的。希望对您有所帮助。
我现在看到这段代码打开了每个链接，谢谢。第一次阅读时我错过了 driver.get(link) 。稍后我会弄清楚如何在新窗口中打开每个。现在，当在最后一页时，我想提取所有的姓名和地址，这对 Selenium 来说很简单吗？
@Mike82 是的，使用它的强大功能API，也请查看Locating Elements 部分。如果您需要帮助或遇到困难，请随意创建单独的 SO 问题。谢谢。

【解决方案2】：

您可以使用标签名称获取 Web 元素。如果你想获取网页中的所有链接，我会使用 find_elements_by_tag_name()。

links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
source_dict = dict()
for url in link_urls:
    driver.get(url)
    source = driver.page_source #this will give you page source
    source_dict[url] = source

#source_dict dictionary will contain the source code you wanted for each url with the url as the key.

【讨论】：