如何使用 Scrapy 抓取具有多个类的 div答案

【问题标题】：How do I scrape divs with multiple classes using Scrapy如何使用 Scrapy 抓取具有多个类的 div
【发布时间】：2021-08-21 22:03:30
【问题描述】：

我已经在这个网站上寻找答案，但我没有得到任何对我有用的东西。我正在尝试抓取 IMDB 中的 Top Picks 页面，我想获取“a”标签的 href。 This is what the HTML of the page I'm trying to scrape looks like。我将鼠标悬停在要抓取的元素上。我正在使用 scrapy shell 进行测试，但我只是得到一个大小为 0 的列表。我试过了：

movies = response.css('a.ipc-poster-card__title').get()

movies = response.css('div[role="group"]').getall() # to get the div first so I can work my way down to the <a> tag

movies = response.css('a.ipc-poster-card__title.ipc-poster-card__title--clamp-2.ipc-poster-card__title--clickable').get()

还有其他多行。我尝试了最后一个，因为我在网上看到scrapy将空格视为层次结构，我应该使用“。”代替多个类，但是当我键入 len(movies) 或将电影作为 None 对象时，我得到的只是大小为 0 的列表。如何从那个“a”标签中获取 href？

【问题讨论】：

嘿，我认为您正在解析错误的页面。在scrapy shell中使用from scrapy.shell import open_in_browser open_in_browser(response) ，你可能会检查你正在解析一个不完整的页面。
@Joaquin 是的，我认为你是对的。这是我在外壳上写的：from scrapy.shell import open_in_browserfetch("https://imdb.com")top_picks = response.css('div.top-picks a::attr(href)')top_picks_url = response.urljoin(top_picks)fetch(top_picks_url)open_in_browser(response)，我得到的只是电影仍在加载的热门精选页面模板。知道如何解决这个问题吗？
这是因为它渲染了 javascript，你需要使用一些工具来渲染它（splash、selenium 等）。另一种选择是找到xhr请求并直接解析。
要获得有关此新问题的帮助，您可以更新您的问题或关闭此问题并创建另一个问题。
我会查找 xhr 请求，如果这不起作用，我会更新问题。非常感谢您的帮助！

标签： web-scraping scrapy

【解决方案1】：

正如@Joaquin 在 cmets 中指出的那样，我现在将使用 Selenium 和 Splash。

【讨论】：