【发布时间】:2021-01-13 16:25:04
【问题描述】:
刚开始学习使用 scrapy 框架进行网页抓取。我正在尝试使用以下代码从医学网站上抓取药物评论。但是,如果我运行“scrapy runspider spiders/medreview.py -o med.csv”,但会出现类似“信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)”和med.csv 没有任何数据。
# Importing Scrapy Library
import scrapy
# Creating a new class to implement Spide
class MedSpider(scrapy.Spider):
# Spider name
name = 'reviews'
# Domain names to scrape
allowed_domains = ['1mg.com']
# Base URL for the MacBook air reviews
myBaseUrl = "https://www.1mg.com/otc/becosules-z-capsule-otc63496/amp"
# Defining a Scrapy parser
def parse(self, response):
data = response.css('.OtcPage__reviews-container___hrKgt')
##data = response.css('.ReviewCards__review-card___3Z733')
# Collecting user reviews
comments = data.css('.ReviewCards__review-description___WoLdZ')
count = 0
# Combining the results
for review in comments:
yield{'comment': ''.join(review.xpath('.//text()').extract())
}
count=count+1
根据@stranac 评论添加了“start_urls = myBaseUrl”。现在我在控制台中遇到了一些错误。
2020-09-28 16:04:34 [scrapy.core.engine] ERROR: Error while obtaining
start requests
Traceback (most recent call last):
File "E:\anaconda\lib\site-packages\scrapy\core\engine.py", line 129, in
_next_request
request = next(slot.start_requests)
File "E:\anaconda\lib\site-packages\scrapy\spiders\__init__.py", line 77, in start_requests
yield Request(url, dont_filter=True)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
【问题讨论】:
-
你既没有定义
start_urls也没有定义start_requests(),你的蜘蛛没有什么要解析的 -
不是
start_urls=myBaseUrl,而是start_urls=[myBaseUrl]。你错了@Sumithra。
标签: python python-3.x python-2.7 web-scraping scrapy