【发布时间】:2020-06-24 16:45:35
【问题描述】:
我正在使用 Scrapy 构建一个小型爬虫,以便在内部使用,在我们的 Intranet 上抓取一个子站点。 在我正在抓取的页面上,这个 sn-p 是页面中的最后一个元素:
<span hidden>
xxxxx | 2017-03-15 10:36:57 +0100 (Wed, 15 Mar 2017) | 11
yyyyyy | 2017-06-07 14:54:24 +0200 (Wed, 07 Jun 2017) | 42
zzzzzzz | 2017-10-07 11:51:24 +0200 (Sat, 07 Oct 2017) | 168
aaaaa_bbbb | 2019-02-04 14:27:46 +0100 (Mon, 04 Feb 2019) | 0
</span>
当通过scrapy shell 或scrapy fetch 获取页面时,我得到了完整的页面(包括文本),但是在我的Spider 中使用Scrapy.Request 时,唯一并且总是包含“\n|\n”。 当我尝试使用 DownloaderMiddleware 类进行拦截时也会发生这种情况
我的蜘蛛是这样的:
import scrapy, csv
# from scrapy.crawler import CrawlerProcess
from scrapy import signals
class ProjectSpider(scrapy.Spider):
name = "ProjectSpider"
start_urls = [
'http://cpusrv5.beumer.com/s2000_projects_overview/',
]
projectList = []
currentProjectIndex = 0
projectListLength = 0
currentProjectItem = []
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def __init__(self):
self.outfile1 = open("projects.csv","w",newline="")
self.state = 1 # 1= projects, 2 = details
def parse(self, response):
if self.projectListLength == 0: #OVerview page
self.projectList.append(item)
# Starting up the process of collecting detail-pages
# print("DONE WITHT THE PROJECT OVERVIEW")
self.projectListLength = len(self.projectList)
self.currentProjectItem = self.projectList[0]
pageToCrawl = response.urljoin(self.currentProjectItem["url"])
# print("PageToCrawl: " + pageToCrawl)
yield scrapy.Request(pageToCrawl, callback=self.parse)
else:
self.parse_detail_page(response)
def parse_detail_page(self, response):
print("Current project: " + self.currentProjectItem["unique_id"] + ' ' +
self.currentProjectItem["pname"] + ' ' + self.currentProjectItem["pnum"])
people = response.xpath('/html/body/span/text()').get()
print("Full response: \n" + people)
# Requesting the next page in list
self.currentProjectIndex += 1
if self.currentProjectIndex < self.projectListLength:
self.currentProjectItem = self.projectList[self.currentProjectIndex]
pageToCrawl = response.urljoin(self.currentProjectItem["url"]+'&index=' +
str(self.currentProjectIndex))
yield scrapy.Request(pageToCrawl, callback=self.parse)
else:
print("DONE !")
def spider_closed(self):
with open("projects.csv","w",newline="") as f:
writer = csv.DictWriter(f,['unique_id','is_guess','pname','pnum','url','maintainer','path','os_guess','first_change','latest_change','latest_change_by','comment','repository','commit log','release log','project portal','servicenow','hotdoc','other documentation'])
writer.writeheader()
for data in self.projectList:
writer.writerow(data)
我在这里错过了什么?
【问题讨论】:
-
如果您将
response.text从您的蜘蛛回调写入文件会发生什么?内容是否与您从scrapy fetch获得的内容相符?