【问题标题】:Python scrapy yield to .json file not workingPython scrapy屈服于.json文件不起作用
【发布时间】:2022-09-30 21:09:58
【问题描述】:

我想使用 Scrapy 在 url 中提取不同书籍的标题,并将它们作为字典数组输出/存储在 json 文件中。

这是我的代码:

import scrapy

class BooksSpider(scrapy.Spider):
    name = \"books\"
    star_urls = [ 
        \"http://books.toscrape.com\"
    ]

def parse(self, response):
    titles = response.css(\"article.product_pod h3 a::attr(title)\").getall()
    for title in titles:
        yield {\"title\": title}

这是我在终端中输入的内容:

scrapy crawl books -o books.json

books.json 文件已创建但为空。

我检查了我是否在正确的目录和 venv 中,但它仍然无法正常工作。

然而

早些时候,我部署了这个蜘蛛来抓取整个 html 数据并将其写入 books.html 文件,一切正常。

这是我的代码:

import scrapy

class BooksSpider(scrapy.Spider):
    name = \"books\"
    star_urls = [ 
        \"http://books.toscrape.com\"
    ]
    def parse(self, response):
        with open(\"books.html\", \"wb\") as file:
            file.write(response.body)

这是我在终端中输入的内容:

scrapy crawl books

关于我做错了什么的任何想法?谢谢

编辑:

输入response.css(\'article.product_pod h3 a::attr(title)\').getall()

进入scrapy shell输出:

[\'A Light in the Attic\', \'Tipping the Velvet\', \'Soumission\', \'Sharp Objects\', \'Sapiens: A Brief History of Humankind\', \'The Requiem Red\', \'The Dirty Little Secrets of Getting Your Dream Job\', \'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull\', \'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics\', \'The Black Maria\', \'Starving Hearts (Triangular Trade Trilogy, #1)\', \"Shakespeare\'s Sonnets\", \'Set Me Free\', \"Scott Pilgrim\'s Precious Little Life (Scott Pilgrim #1)\", \'Rip it Up and Start Again\', \'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991\', \'Olio\', \'Mesaerion: The Best Science Fiction Stories 1800-1849\', \'Libertarianism for Beginners\', \"It\'s Only the Himalayas\"]







        
  • 您是否验证过您的.getall() 确实使用调试器或调用print() 返回了一些东西?
  • 我首先在scrapy shell中使用它并得到了一个标题列表,所以它确实返回了一些东西

标签: python json python-3.x macos scrapy


【解决方案1】:

现在运行代码。它应该可以工作

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):

        titles = response.css('.product_pod')
        for title in titles:
            yield {
                "title": title.css('h3 a::attr(title)').get()
                #"title": title.css('h3 a::text').get()
            }

【讨论】:

  • 感谢您的建议,但 json 文件仍然为空。你知道它可能是什么吗?
  • 要运行的终端命令:scrapy crawl quotes -o data.json
猜你喜欢
  • 2017-06-07
  • 1970-01-01
  • 2019-09-09
  • 2021-08-22
  • 2020-04-22
  • 2021-08-27
  • 2019-01-05
  • 2014-11-22
  • 2017-10-06
相关资源
最近更新 更多