【问题标题】:Extracting text from a website using selenium使用 selenium 从网站中提取文本
【发布时间】:2020-10-31 10:46:36
【问题描述】:

试图找到一种方法从好读物页面中提取该书的摘要。尝试过美丽的汤/硒,不幸的是无济于事。

链接:https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1

代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
link='https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1'
driver.get(link)
Description=driver.find_element_by_xpath("//div[contains(text(),'TextContainer')]")
#first TextContainer contains the sumary of the book
book_page = requests.get(link)
soup = BeautifulSoup(book_page.text, "html.parser")
print(soup)
Container = soup.find('class', class_='leftContainer')
print(Container)

错误:

容器为空 +

NoSuchElementException:没有这样的元素:无法找到元素: {"method":"xpath","selector":"//div[contains(text(),'TextContainer')]"} (会话信息:chrome=83.0.4103.116)

【问题讨论】:

  • 尝试添加显式等待元素。

标签: python python-3.x selenium selenium-webdriver beautifulsoup


【解决方案1】:

你可以像这样得到描述

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
...
driver.get("https://www.goodreads.com/book/show/67896.Tao_Te_Ching?from_search=true&from_srp=true&qid=D19iQu7KWI&rank=1")
description = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'div#description span[style="display:none"]'))
)
print(description.get_attribute('textContent'))

我使用CSS Selector 来获取包含完整描述的特定隐藏span。我还使用了explicit wait 来给元素加载时间。

【讨论】:

    猜你喜欢
    • 2018-04-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-01-29
    • 2023-03-07
    • 2021-09-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多