【发布时间】:2020-11-28 20:22:36
【问题描述】:
我正在使用下面的代码抓取雅虎财经新闻。
class YfinNewsSpider(scrapy.Spider):
name = 'yfin_news_spider'
custom_settings = {'DOWNLOAD_DELAY': '0.5', 'COOKIES_ENABLED': True, 'COOKIES_DEBUG': True}
def __init__(self, month, year, **kwargs):
self.start_urls = ['https://finance.yahoo.com/sitemap/2020_03_all']
self.allowed_domains = ['finance.yahoo.com']
super().__init__(**kwargs)
def parse(self, response):
all_news_urls = response.xpath('//ul/li[@class="List(n) Py(3px) Lh(1.2)"]')
for news in all_news_urls:
news_url = news.xpath('.//a[@class="Td(n) Td(u):h C($c-fuji-grey-k)"]/@href').extract_first()
yield scrapy.Request(news_url, callback=self.parse_news, dont_filter=True)
def parse_news(self, response):
news_url = str(response.url)
title = response.xpath('//title/text()').extract_first()
paragraphs = response.xpath('//div[@class="caas-body"]/p/text()').extract()
date_time = response.xpath('//div[@class="caas-attr-time-style"]/time/@datetime').extract_first()
yield {'title': title, 'url': news_url, 'body_text': paragraphs, 'timestamp': date_time}
但是,当我运行我的蜘蛛时,它会给出以下结果。
2020-11-28 20:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e> (referer: https://finance.yahoo.com/sitemap/2020_03_all)
2020-11-28 20:42:40 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html>
Cookie: B=cnmvgrdfs5a0r&b=3&s=o1; GUCS=ASXMbR9p
2020-11-28 20:42:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e>
{'title': 'Yahoo er nu en del af Verizon Media', 'url': 'https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e', 'body_text': [], 'timestamp': None}
2020-11-28 20:42:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_d6731ce6-78bc-4222-914f-24cf98f874b8> (referer: https://finance.yahoo.com/sitemap/2020_03_all)
这似乎表明当我的蜘蛛转到https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html 时发现在https://finance.yahoo.com/sitemap/2020_03_all 中。它尝试将 cookie 发送到https://finance.yahoo.com/news/onegold-becomes-first-company-offer-110000241.html,但被重定向到同意接受墙https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e。
我在浏览器中打开此同意墙https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_05cc09ea-0bc0-439d-8b4c-2d6f20f52d6e 并找到数据同意接受屏幕。当我点击接受时,它把我带到了我想要抓取的正确网站。抓取结果也正是此同意屏幕中的内容。
我尝试将 COOKIES_ENABLED 设置为 True,但没有成功。那么,有没有办法绕过这个在scrapy中的接受屏幕?
谢谢。
【问题讨论】:
-
您找到解决方案了吗?我面临着完全相同的问题。
标签: python web-scraping scrapy yahoo-finance