我需要使用 python 有效地分批将数据从 csv 文件导入到 postgresql 表中答案

【问题标题】：I need to import data from a csv file to a postgresql table with python efficiently and in batches我需要使用 python 有效地分批将数据从 csv 文件导入到 postgresql 表中
【发布时间】：2020-11-01 16:26:00
【问题描述】：

我正在寻找一种有效的方法来使用 python 批量将数据从 CSV 文件导入 Postgresql 表，因为我有相当大的文件，并且我将数据导入到的服务器很远。我需要一个有效的解决方案，因为我尝试的一切要么很慢，要么就是没有用。我正在使用 SQLlahcemy。我想使用原始 SQL，但参数化非常困难，我需要多个循环来执行多行查询

【问题讨论】：

使用复制：stackoverflow.com/questions/13125236/…
您首先抱怨您尝试的一切都变慢了，但随后声明您需要“多个循环来执行查询”。几乎按照定义，这是最慢的解决方案。在文本 - no images 处发布问题的完整描述、表定义和示例数据。社区很可能会设计出更好的解决方案，但我们需要这些信息。我想这是你最习惯的问题，但 SQL 是最容易参数化的事情：命名列并确保值的顺序正确。但这只是我的意见和经验。

标签： python python-3.x postgresql sqlalchemy

【解决方案1】：

我的任务是处理一些数据并将其从 CSV 文件迁移到远程 Postgres 实例中。

我决定使用下面的 Python 脚本：

import csv
import uuid
import psycopg2
import psycopg2.extras
import time

#Instant Time at the start of the Script
start = time.time()
psycopg2.extras.register_uuid()

#List of CSV Files that I want to manipulate & migrate.
file_list=["Address.csv"]




conn = psycopg2.connect("host=localhost dbname=address user=postgres password=docker")
cur = conn.cursor()

i = 1
for f in file_list:
    f = open(f)
    csv_f = csv.reader(f)
    next(csv_f)
    for row in csv_f:

        # Some simple manipulations on each row
        #Inserting a uuid4 into the first column
        row.pop(0)
        row.insert(0,uuid.uuid4())
        row.pop(10)
        row.insert(10,False)
        row.pop(13)

        #Tracking the number of rows inserted
        print(i)
        i = i + 1

        #INSERT QUERY
        postgres_insert_query = """ INSERT INTO "public"."address"("address_id","address_line_1","locality_area_street","address_name","app_version","channel_type","city","country","created_at","first_name","is_default","landmark","last_name","mobile","pincode","territory","updated_at","user_id") VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
        record_to_insert = row
        cur.execute(postgres_insert_query,record_to_insert)
    f.close()


conn.commit()
conn.close()
print(time.time()-start)

该脚本在本地测试时运行良好且迅速。但是连接到远程数据库服务器会增加很多延迟。

作为一种解决方法，我将处理过的数据迁移到我的本地 postgres 实例中。然后我生成了一个迁移数据的 .sql 文件并在远程服务器上手动导入了 .sql 文件。

或者，您还可以使用 Python 的多线程功能，启动到远程服务器的多个并发连接，并为每个连接专用一个隔离的批处理，并刷新数据。这应该会大大加快您的迁移速度。

我个人没有尝试过多线程方法，因为在我的情况下不需要它。但它似乎非常有效。

希望这有帮助！ :)

资源： CSV Manipulation using Python for Beginners.

【讨论】：

【解决方案2】：

使用 copy_from 命令，将所有行复制到表中。

path=open('file.csv','r')
next(path)
cur.copy_from(path,'table_name',columns=('id','name','email'))

【讨论】：