【问题标题】:Saving output the to JSON format将输出保存为 JSON 格式
【发布时间】:2020-05-21 13:24:05
【问题描述】:

我正在尝试将我的输出(即og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"]))写入 JSON 文件。但是当我看到验证输出时,它说它不是正确的 JSON 标准共振峰。谁能帮助我,我做错了什么。

# -*- coding: utf-8 -*-
import scrapy
from..items import news18Item
import re
from webpreview import web_preview
from webpreview import OpenGraph
import json

class News18SSpider(scrapy.Spider):
    name = 'news18_story'
    page_number = 2
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        items = news18Item()
        page_id = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        items['page_id'] = page_id

        story_url = page_id

        for i in story_url :
            og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])

            dictionary =[{ "page_title": og.title }, { "description": og.description }, { "image_url": og.image }, { "post_url": og.url}] 

            with open("news18_new.json", "a") as outfile: 
                json.dump(dictionary, outfile)
                outfile.write("\n")
                # json.dump("\n",outfile) 



        next_page = 'https://www.news18.com/movies/page-' + str(News18SSpider.page_number) + '/'
        if News18SSpider.page_number <= 20:
           News18SSpider.page_number += 1  
           yield response.follow(next_page, callback = self.parse)

        pass

【问题讨论】:

  • 您能否提供您在news18_new.json中编写的示例输出
  • og:title o/p Mammootty, Kamal Haasan And More Celebs Wish Mohanlal On His Birthday og:description o/p On Malayalam superstar Mohanlal’s birthday, several members from the world of entertainment including Mammootty, Kamal Haasan, Nivin Pauly extended their best wishes to him. og:image o/p https://images.news18.com/ibnlive/uploads/2020/05/1590065340_1590065211213_copy_875x583.jpg og:url o/p https://www.news18.com/news/movies/mammootty-kamal-haasan-and-more-celebs-wish-mohanlal-on-his-birthday-2630693.html 这是示例输出@喜满洲```
  • {"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}news18_new.json 的输出
  • 将错误、数据和其他有问题的信息放在评论中,这样会更易读。
  • 在当前版本中,您创建多 JSON 文件 - 包含许多 JSON 对象的文件。但在普通 JSON 文件中,您必须先创建包含所有数据的列表,然后将此列表保存为一个对象。

标签: python json python-3.x web-scraping scrapy


【解决方案1】:

这是最少的工作代码。

您可以将所有代码放在一个文件script.py 中并以python script.py 运行,而无需创建项目。

我将每个项目都生成为单个字典

  yield {
            "page_title": og.title,
            "description": og.description,
            "image_url": og.image,
            "post_url": og.url
        } 

并将scrapy 保存为正确的JSON 文件,其中包含一个包含许多字典的列表。

您创建了许多单独的列表 - 这不是正确的 JSON 格式。

JSON 文件不是可以附加新数据的格式。它必须将所有数据读取到内存中,将新项目附加到内存中的数据,然后将所有数据再次保存到文件中。

您可以追加到CSV 文件,而无需将所有数据读入内存。


import scrapy
from webpreview import OpenGraph

class News18SSpider(scrapy.Spider):

    name = 'news18_story'
    page_number = 1
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        #all_hrefs = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        all_hrefs = response.xpath('//div[@class="blog-list-blog"]/p/a/@href').getall()

        for href in all_hrefs:
            og = OpenGraph(href, ["og:title", "og:description", "og:image", "og:url"])

            yield {
                "page_title": og.title,
                "description": og.description,
                "image_url": og.image,
                "post_url": og.url
            } 

        if self.page_number <= 20:
            self.page_number += 1  
            next_url = 'https://www.news18.com/movies/page-{}/'.format(self.page_number)
            #yield response.follow(next_url) # , callback=self.parse)
            yield scrapy.Request(next_url)

# --- run without project and save in `output.json` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'json',     # csv, json, xml
    'FEED_URI': 'output.json', #
})

c.crawl(News18SSpider)
c.start() 

【讨论】:

  • 您是否正确打印一个条目两次检查一次。在第一页之后它停止了。
  • 我在xpath 中忘记了/p/,但现在代码可以正常工作了。
  • 它不会进入下一页,看看我得到了什么raise URLUnreachable("The URL does not exist.") webpreview.excepts.URLUnreachable: The URL does not exist.
  • 你运行相同的代码吗?或者,服务器可能会识别您的 IP 并阻止您的请求。或者可能是互联网或服务器的问题。当我在网络浏览器中尝试'https://www.news18.com/movies/page-1/' 时,有时我会得到带有消息Error 404 的页面。
  • 不,我使用DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400, } 这将旋转用户代理
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2017-10-30
  • 1970-01-01
  • 2021-04-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-08-17
相关资源
最近更新 更多