【问题标题】:Webscraping website that has a button to click具有可单击按钮的 Web Scraping 网站
【发布时间】:2019-10-04 03:58:28
【问题描述】:

我正在尝试抓取具有多个 javascript 呈现页面 (https://openlibrary.ecampusontario.ca/catalogue/) 的网站。我能够从第一页获取内容,但我不确定如何让我的脚本单击后续页面上的按钮以获取该内容。这是我的脚本。

import time
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json

# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/rawlins/Downloads/chromedriver'

# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')

# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path,
chrome_options = chrome_options)

# Load webpage
url = "https://openlibrary.ecampusontario.ca/catalogue/"
browser.get(url)

# to ensure that the page has loaded completely.
time.sleep(3)

data = [] 

# Parse HTML, close browser
page_soup = soup(browser.page_source, 'lxml')
containers = page_soup.findAll("div", {"class":"result-item tooltip"})

for container in containers:
    item = {}
    item['type'] = "Textbook"
    item['title'] = container.find('h4', {'class' : 'textbook-title'}).text.strip()
    item['author'] = container.find('p', {'class' : 'textbook-authors'}).text.strip()
    item['link'] = "https://openlibrary.ecampusontario.ca/catalogue/" + container.find('h4', {'class' : 'textbook-title'}).a["href"]
    item['source'] = "eCampus Ontario"
    item['base_url'] = "https://openlibrary.ecampusontario.ca/catalogue/"
    data.append(item) # add the item to the list

with open("js-webscrape-2.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

browser.quit()

【问题讨论】:

    标签: javascript python-3.x selenium web-scraping beautifulsoup


    【解决方案1】:

    您不必实际单击任何按钮。例如,要搜索带有关键字“electricity”的项目,您可以导航到 url

    https://openlibrary-repo.ecampusontario.ca/rest/filtered-items?query_field%5B%5D=*&query_op%5B%5D=matches&query_val%5B%5D=(%3Fi)electricity&filters=is_not_withdrawn&offset=0&limit=10000
    

    这将返回一个包含项目的 json 字符串,其中第一项是:

    {"items":[{"uuid":"6af61402-b0ec-40b1-ace2-1aa674c2de9f","name":"Introduction to Electricity, Magnetism, and Circuits","handle":"123456789/579","type":"item","expand":["metadata","parentCollection","parentCollectionList","parentCommunityList","bitstreams","all"],"lastModified":"2019-05-09 15:51:06.91","parentCollection":null,"parentCollectionList":null,"parentCommunityList":null,"bitstreams":null,"withdrawn":"false","archived":"true","link":"/rest/items/6af61402-b0ec-40b1-ace2-1aa674c2de9f","metadata":null}, ...
    

    现在,要获取该项目,您可以使用它的 uuid,然后导航到:

    https://openlibrary.ecampusontario.ca/catalogue/item/?id=6af61402-b0ec-40b1-ace2-1aa674c2de9f
    

    您可以像这样继续与该网站进行任何交互(这并不总是适用于所有网站,但它适用于您的网站)。

    要了解单击某某按钮或输入文本时导航到的 url 是什么(我为上述 url 所做的),您可以使用fiddler

    【讨论】:

      【解决方案2】:

      我制作了一个可以帮助你的小脚本(硒)。

      这个脚本的作用是“当目录的最后一页没有被选中时(在这种情况下,它的类中包含'selected'),我会scrape,然后点击下一步”

      while "selected" not in driver.find_elements_by_css_selector("[id='results-pagecounter-pages'] a")[-1].get_attribute("class"):
          #your scraping here
          driver.find_element_by_css_selector("[id='next-btn']").click()
      

      使用此方法可能会遇到一个问题,它不会等待结果加载,但您可以从这里开始弄清楚该怎么做。

      希望对你有帮助

      【讨论】:

        猜你喜欢
        • 2018-08-06
        • 2016-09-20
        • 1970-01-01
        • 2018-11-13
        • 1970-01-01
        • 2021-07-06
        • 2012-04-17
        • 2014-12-20
        • 1970-01-01
        相关资源
        最近更新 更多