无法从网页获取动态生成的内容答案

【问题标题】：Unable to get a dynamically generated content from a webpage无法从网页获取动态生成的内容
【发布时间】：2018-12-16 09:59:25
【问题描述】：

我使用 selenium 在 python 中编写了一个脚本来获取位于网页右下角标题 Company profile 下的 business summary（位于 p 标记内）。该网页是动态的，所以我想使用浏览器模拟器。我创建了一个 css 选择器，如果我直接从该网页复制 html elements 并在本地尝试，它能够解析摘要。出于某种原因，当我在下面的脚本中尝试相同的选择器时，它不起作用。它会抛出 timeout exception 错误。如何获取？

这是我的尝试：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"

def get_information(driver, url):
    driver.get(url)
    item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
    driver.execute_script("arguments[0].scrollIntoView();", item)
    print(item.text)

if __name__ == "__main__":
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver, 20)
    try:
        get_information(driver,link)
    finally:
        driver.quit()

【问题讨论】：

标签： python python-3.x selenium selenium-webdriver web-scraping

【解决方案1】：

最初似乎没有业务摘要块，但它是在您向下滚动页面后生成的。尝试以下解决方案：

from selenium.webdriver.common.keys import Keys

def get_information(driver, url):
    driver.get(url)
    driver.find_element_by_tag_name("body").send_keys(Keys.END)
    item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
    print(item.text)

【讨论】：

它仍然给我timeout exception 错误，先生。
我不明白：在我的第二次尝试中，它运行良好。非常感谢您一如既往的无敌方法。
嗯.. 对我来说很好。返回"Apple Inc. designs, manufactures, and markets mobile communication and media devices..."。让我知道它是否工作不稳定

【解决方案2】：

您必须向下滚动页面两次，直到元素出现：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time

link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"

def get_information(driver, url):
    driver.get(url)
    driver.find_element_by_tag_name("body").send_keys(Keys.END) # scroll page
    time.sleep(1) # small pause between
    driver.find_element_by_tag_name("body").send_keys(Keys.END) # one more time
    item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
    driver.execute_script("arguments[0].scrollIntoView();", item)
    print(item.text)

if __name__ == "__main__":
    driver = webdriver.Chrome()
    wait = WebDriverWait(driver, 20)
    try:
        get_information(driver,link)
    finally:
        driver.quit()

如果您只滚动一次，它会因某种原因无法正常工作（至少对我而言）。我认为这取决于窗口尺寸，在较小的窗口上您必须滚动而不是在较大的窗口上滚动。

【讨论】：

你是说这个for _ in range(2): driver.find_element_by_tag_name("body").send_keys(Keys.END) time.sleep(2)
是的，我是说这个

【解决方案3】：

这是一种更简单的方法，它使用请求并处理页面中已有的 JSON 数据。如果可能，我还建议始终使用 request 。这可能需要一些额外的工作，但最终结果更可靠/更清洁。您还可以更深入地了解我的示例并解析 JSON 以直接使用它（您需要将文本清理为有效的 JSON）。在我的示例中，我只使用了 split ，它的执行速度更快，但在执行更复杂的操作时可能会导致问题。

import requests

from lxml import html

url = 'https://in.finance.yahoo.com/quote/AAPL?p=AAPL'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(url, headers=headers)

tree = html.fromstring(r.text)

data= [e.text_content() for e in tree.iter('script') if 'root.App.main = ' in e.text_content()][0]
data = data.split('longBusinessSummary":"')[1]
data = data.split('","city')[0]

print (data)

【讨论】：