【问题标题】:Functions don't call on scrapy/webcrawler函数不调用scrapy/webcrawler
【发布时间】:2020-05-26 20:25:42
【问题描述】:

谁能告诉我为什么不调用 ParseLinks 和 ParseContent ?其余的运行并打印/附加/执行操作,但我从 teo 解析函数中得到了风滚草。也欢迎任何进一步的错误信息/建议。

import scrapy
import scrapy.shell
from scrapy.crawler import CrawlerProcess


Websites = ("https://www.flylevel.com/", "https://www.latam.com/en_us/")
links = []
D = {}
#D = {main website: links: content}
def dictlayout():
    for W in Websites:
        D[W] = []

dictlayout()

class spider(scrapy.Spider):
    name = "spider"
    start_urls = Websites
    print("request level 1")
    def start_requests(self):
        print("request level 2")
        for U in self.start_urls:
            print("request level 3")
            yield scrapy.Request(U, callback = self.ParseLinks)
            print("links: ")
            print(links)


    def ParseLinks(self, response):
        Link = response.xpath("/html//@href")
        Links = link.extract()
        print("parser print")
        print(link)
        for L in Links:
            link.append(L)
            D[W]=L
            yield response.follow(url=L, callback=self.ParseContent)

    def ParseContent(self, response):
        content = ParseLinks.extract_first().strip()
        D[W][L] = content
        print("content")
        print(content)

print(D)
print(links)


process = CrawlerProcess()
process.crawl(spider)
process.start()

【问题讨论】:

    标签: function web-scraping callback scrapy


    【解决方案1】:

    我认为ParseLinks 实际上是被调用的。问题是您正试图从 html 标记中提取 href。这行Link = response.xpath("/html//@href") 可能会破坏您的代码。请改用Link = response.xpath("//a/@href")

    【讨论】:

    • 谢谢,我更改了这一点,还纠正了一些错误:我现在与def ParseLinks(self, response): Link = response.xpath("/html//@href") Links = link.extract() print("parser print") print(link) for L in Links: link.append(L) D[W]=L yield response.follow(url=L, callback=self.ParseContent) 具有相同的格式,但现在链接解析部分出现缩进错误。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多