Python - Selenium - Webscrape 表答案

【问题标题】：Python - Selenium - Webscrape TablePython - Selenium - Webscrape 表
【发布时间】：2014-02-09 18:50:18
【问题描述】：

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from datetime import datetime, timedelta
from tkinter import StringVar, messagebox, Entry, Tk

chromeOps=webdriver.ChromeOptions()
chromeOps._binary_location = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
chromeOps._arguments = ["--enable-internal-flash"]

browser = webdriver.Chrome("C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe", port=4445, chrome_options=chromeOps)
time.sleep(3)

browser.get('website')
elem=browser.find_element_by_id('MainForm')
eli=elem.find_element_by_xpath('//*[@id="ReportHolder"]')

现在超越这一点是：

表 xmlns:msxsl="urn:schemas-microsoft-com:xslt" width="100%"

现在，我注意到这会阻止我直接通过 xpath 访问表格内容。

所以我的问题是：如何交互或提取此表的内容？

编辑：尝试访问表的 xpath 或其内容会抛出“noSuchElementException”，执行此操作的代码行是：

eli=elem.find_element_by_xpath('//*[@id="ReportHolder"]/table')

（注意：我无法提供准确的 html 访问权限，因为它是受公司密码保护的位置。）

有没有人遇到过类似的问题？或者任何人都可以注意到 xpath 有什么不妥之处（即使它是直接的副本检查）。

编辑2：提取自的简化示例 XHTML http://s1362.photobucket.com/user/superempl/media/roady2_zps3e1430d2.png.html

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="utf-8" />
        <title>XPath</title>    
    </head>
    <body>
        <form name="MainForm" method="post" action="hidden" id="MainForm">
            <div id="ReportHolder">
                <table xmlns:msxml="urn:schemas-microsoft-com:xslt" width="100%">
                    <tr><td></td></tr>
                </table>
            </div>
        </form>
    </body>
</html>

【问题讨论】：

请在您的问题中添加相关的 html 部分
它如何阻止您使用 XPath？我不明白为什么会这样。
如果您无法提供完整的文档，请提供一个示例您的代码失败。问题无法按原样回答。

标签： python xslt selenium xpath python-3.x

【解决方案1】：

这很简单。这是时间问题。

解决方案：在 xpath 请求之前放置一个 time.sleep(5)。

browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')
time.sleep(5)
ex=browser.find_element_by_xpath('//*[@id="ReportHolder"]/table/tbody/tr/td')

xpath 正在请求对动态内容的引用。

表格是动态内容，加载该内容的时间比python程序到达行的时间要长：

ex=browser.find_element_by_xpath('//*[@id="ReportHolder"]/table/tbody/tr')

从它的前一行：

browser.get('http://www.mmgt.co.uk/HTMLReport.aspx?ReportName=Fleet%20Day%20Summary%20Report&ReportType=7&CategoryID=4923&Startdate='+strDate+'&email=false')

【讨论】：

我会否决每一个暗示任何形式的线程睡眠的答案。
如果您的元素需要 5.01 秒才能出现，会发生什么？或者如果它需要 4 秒，那么你只是在浪费时间。

【解决方案2】：

您应该使用等待类，而不是使用线程睡眠。我没有用python写过webdriver，但应该是这样的：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPath, '//*[@id="ReportHolder"]/table/tbody/tr')))

webdriverwait presence_of_element_located 将在元素出现在 DOM 中后返回给您，然后您可以与之交互。

【讨论】：