scrapy ERROR：蜘蛛错误处理问题答案

【问题标题】：scrapy ERROR: Spider error processing issuescrapy ERROR：蜘蛛错误处理问题
【发布时间】：2018-11-14 11:05:00
【问题描述】：

我对scrapy很陌生，在运行我的代码时，我收到了这个错误。

我的代码

import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

#extract home page search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
    req = Request(url = link, callback = self.parse_page)
    print link
    yield req

#extract second link search results
def parse_second(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]//*[@id="documentsbutton"]/a/@href').extract():
    req = Request(url = link, callback = self.parse_page)
    print link
    yield req

一旦我尝试运行此代码：scrapy crawl sec_gov 出现此错误。

2018-11-14 15:37:26 [scrapy.core.engine] INFO: Spider opened
2018-11-14 15:37:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-14 15:37:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-14 15:37:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
2018-11-14 15:37:27 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany> (referer: None)
Traceback (most recent call last):
File "/home/surukam/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/surukam/.local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 90, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: legco.parse callback is not defined
2018-11-14 15:37:27 [scrapy.core.engine] INFO: Closing spider (finished)

谁能帮我解决这个问题？提前致谢

【问题讨论】：

这是python 2代码？
感谢您的回复 dejan，是的，它是 python 2 代码。

标签： python web-scraping scrapy web-crawler scrapy-spider

【解决方案1】：

您的代码根本不应该运行。为了让您的脚本运行，有几件事需要修复。你在哪里找到了这个self.parse_page，它在你的脚本中做了什么？您的脚本缩进严重。我已经修复了脚本，该脚本现在能够从连接到其内页中文档的相关链接的登录页面跟踪每个 url。试试这个来获取内容。

import scrapy

class legco(scrapy.Spider):
    name = "sec_gov"

    start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

    def parse(self, response):
        for link in response.xpath('//table[@summary="Results"]//td[@scope="row"]/a/@href').extract():
            absoluteLink = response.urljoin(link)
            yield scrapy.Request(url = absoluteLink, callback = self.parse_page)

    def parse_page(self, response):
        for links in response.xpath('//table[@summary="Results"]//a[@id="documentsbutton"]/@href').extract():
            targetLink = response.urljoin(links)
            yield {"links":targetLink}

【讨论】：

查看脚本@Vinod kumar 以获取各种文档的链接。
哦，太好了，它工作正常，我的目标是：有更多文件，但要下载特定类型的文件（例如：EX-10.1、EX-10.2、....Ex-10.99）路径： BaseUrl-> CIK 链接-> 文档-> 下载文件。（Ex-10 文件）文件在 .htm 和 .txt 中可用。
不要一下子问多维问题。如果它满足您最初要求的目的，请尝试询问另一个描述您进一步尝试的问题。谢谢。
对不起，我是scrapy的新手，如果你能帮我解决这个问题？
我必须下载特定类型的合同，所以我需要做什么更改？你能帮我解决这个问题吗，