单击按钮后从网页中抓取数据答案

【问题标题】：Scrape data from a webpage after clicking a button单击按钮后从网页中抓取数据
【发布时间】：2016-11-16 20:00:13
【问题描述】：

我要从网页上抓取数据：https://www.youtube.com/playlist?list=PLMC9KNkIncKtPzgY-5rmhvj7fax8fdxoj

页面末尾有一个“加载更多”按钮，用于加载更多视频。

此页面仅显示 100 个视频，但我想在单击“加载更多”按钮后解析数据。

<button class="yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button" type="button" onclick=";return false;" aria-label="Load more
" data-uix-load-more-target-id="pl-load-more-destination" data-uix-load-more-href="/browse_ajax?action_continuation=1&amp;continuation=4qmFsgIuEiRWTFBMTUM5S05rSW5jS3RQemdZLTVybWh2ajdmYXg4ZmR4b2oaBkNHVSUzRA%253D%253D"><span class="yt-uix-button-content">  <span class="load-more-loading hid">
      <span class="yt-spinner">
      <span class="yt-spinner-img  yt-sprite" title="Loading icon"></span>

Loading...
  </span>

  </span>
  <span class="load-more-text">
    Load more

  </span>
</span></button>

我可以这样做吗？我用漂亮的汤
编辑：找到2个解决方案。一个使用 beautifulsoup，另一个使用 selenium。

【问题讨论】：

标签： python-2.7 web-scraping beautifulsoup

【解决方案1】：

您可以通过调用 select() 方法并为您要查找的元素传递一个 CSS 选择器字符串，从 BeautifulSoup 对象中检索网页元素。

    soup.select('span .load-more-text')

我相信这应该适用于您正在尝试做的事情

【讨论】：

你没有理解这个问题。单击加载更多按钮后，我想抓取此网页的内容。

【解决方案2】：

我用下面的代码得到video titles，你可以编辑它来抓取其他内容。

from bs4 import BeautifulSoup
import json
import requests

url = "https://www.youtube.com/playlist?list=PLMC9KNkIncKtPzgY-5rmhvj7fax8fdxoj"
html=requests.get(url).text

soup=BeautifulSoup(html, "lxml")

links=soup.find_all(class_='pl-video-title')

for vid in links:
    print vid.contents[1].string

url1="https://www.youtube.com/browse_ajax?action_continuation=1&continuation=4qmFsgIuEiRWTFBMTUM5S05rSW5jS3RQemdZLTVybWh2ajdmYXg4ZmR4b2oaBkNHVSUzRA%3D%3D"
html1=requests.get(url1).text
data=json.loads(html1)

soup=BeautifulSoup(data[u'content_html'], "lxml")

links=soup.find_all(class_='pl-video-title')

for vid in links:
    print vid.contents[1].string

【讨论】：

谢谢。我有一个疑问，您为什么要使用 json.loads(html1) ？为什么它以 json 格式提供数据？此外，当我在浏览器中键入该 url 时，会下载一个名为 browser_ajax 的文件。它不应该打开一个网页
当你访问 url 时，服务器返回一个 json 响应，这就是 json.loads() 的原因。

【解决方案3】：

阅读播放列表的最佳方式是使用YouTube API。

但是，如果由于某种原因您不能使用它，那么您需要的是一个也可以与页面交互的爬虫。 selenium 就是一个很好的例子：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox()
driver.get("https://www.youtube.com/playlist?list=PLMC9KNkIncKtPzgY-5rmhvj7fax8fdxoj")  # Get the playlist page

# Click the button
load_more_button = driver.find_element_by_class_name("load-more-text")
load_more_button.click()

# Wait *up to* 10 seconds to make sure the page has finished loading (check that the button no longer exists)
WebDriverWait(driver,10).until(EC.invisibility_of_element_located(
    (By.CLASS_NAME, "load-more-text")))
# Get the html
html = driver.page_source

从此时起，您可以像从 requests 一样解析 HTML。

【讨论】：