Scrapy爬虫：无法将多个网址存储到postgres中答案

【问题标题】：Scrapy crawler: Unable to store multiple urls into postgresScrapy爬虫：无法将多个网址存储到postgres中
【发布时间】：2022-01-18 18:43:46
【问题描述】：

我使用 scrapy python 创建了一个爬虫。我想将爬虫获取的多个 url 存储到 postgres 表中。当我启动爬虫时，会获取 url 并在 postgres 中创建表，但数据没有被存储.

使用的技术： Scrapy、Python

输出为： url 应该存储在 postgres 表中。

错误：我无法存储所有的网址。爬虫不适用于所有网站。

请帮忙！！！

import scrapy
import os
import psycopg2

conn = psycopg2.connect(
   database="postgres", user='postgres', password='password', host='127.0.0.1', port= '5432'
)
print("connected")
conn.autocommit = True
cur=conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS tmp_crawler
(
WEBSITE VARCHAR(500) NOT NULL
)

""")


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com//'] 
    

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            abs_url = response.urljoin(url)
            var1  = "INSERT INTO tmp_crawler(website) VALUES('" + url + "')"
         cur.execute(var1)
        conn.commit()
        yield {'title': abs_url}

【问题讨论】：

在代码中添加一些 print() 或 breakpoint() 语句，看看哪里出错了。
我自己不是 python 大师，但你确定这段代码不会被 SQL 注入吗？那么错误处理呢？数据库返回什么错误？
我没有收到任何错误消息。我看到只插入了 1 条记录。其他人没有被插入。
看来cur.execute在for循环之外，所以只对最后一项执行

标签： python postgresql scrapy

【解决方案1】：

您可以使用scrapy ITEM_PIPELINES 来实现此目的。请参阅下面的示例实现

import scrapy
import psycopg2

class DBPipeline(object):
    def open_spider(self, spider):
        # connect to database
        try:
            self.conn = psycopg2.connect(database = "postgres", user = "postgres", password = "password", host = "127.0.0.1", port = "5432")
            self.conn.autocommit = True
            self.cur = self.conn.cursor()
        except:
            spider.logger.error("Unable to connect to database") 

        # create the table
        try:
            self.cur.execute("CREATE TABLE IF NOT EXISTS tmp_crawler (website VARCHAR(500) NOT NULL);")
        except:
            spider.logger.error("Error creating table `tmp_crawler`") 

    def process_item(self, item, spider):
        try:
            self.cur.execute('INSERT INTO tmp_crawler (website) VALUES (%s)', (item.get('title'),))
            spider.logger.info("Item inserted to database")
        except Exception as e:
            spider.logger.error(f"Error `{e}` while inserting item <{item.get('title')}")
        return item

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()


class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    allowed_domains=['google.com']
    start_urls = ['https://www.google.com/'] 
    custom_settings = {
        'ITEM_PIPELINES': {
            DBPipeline: 500
        }
    }

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for url in urls:
            yield {'title': response.urljoin(url)}

【讨论】：