有没有更快的方法使用 Python 将数百万行从 Excel 移动到 SQL 数据库？答案

【问题标题】：Is there a faster way to move millions of rows from Excel to a SQL database using Python?有没有更快的方法使用 Python 将数百万行从 Excel 移动到 SQL 数据库？
【发布时间】：2023-04-08 18:08:01
【问题描述】：

我是一名金融分析师，拥有大约两个月的 Python 经验，我正在开展一个使用 Python 和 SQL 来自动编译报告的项目。该过程涉及访问保存在共享驱动器中的不断变化数量的 Excel 文件，从每个选项卡（摘要和报价）中拉出两个选项卡，并将数据集组合成两个大型“报价”和“摘要”表。下一步是从每个列中提取不同的列，合并、计算等。

问题是数据集最终是 3.4 毫米行和大约 30 列。我在下面编写的程序可以运行，但完成第一部分（创建数据帧列表）需要 40 分钟，另外需要 4.5 小时来创建数据库和导出数据，更不用说使用大量内存了。

我知道一定有更好的方法来实现这一点，但我没有 CS 背景。任何帮助将不胜感激。

import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound

reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)

starttime = datetime.now()
print('Started', starttime)

c = 0

tables = list()
quote_combined = list()
summary_combined = list()

# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes.  Add two columns, format column headers, then 
# add each dataframe to the list of dataframes. 

for xl in os.listdir(month_folder):
    if '-Amazon' in xl:
        ttime = datetime.now()
        table_name = str(xl[11:-5])
        tables.append(table_name)
        quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
        summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
        
        quote_sheet.insert(0,'reportmonth', reportmonth)
        summary_sheet.insert(0,'reportmonth', reportmonth)
        quote_sheet.insert(0,'source_file', table_name)
        summary_sheet.insert(0,'source_file', table_name)
        quote_sheet.columns = quote_sheet.columns.str.strip()
        quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
        summary_sheet.columns = summary_sheet.columns.str.strip()
        summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
        
        quote_combined.append(quote_sheet)
        summary_combined.append(summary_sheet)
        
        c = c + 1
        
        print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

# Concatenate the list of dataframes to append one to another.  
# Totals about 3.4mm rows for August

totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)     

# Change directory, create Sqlite database, and send the combined dataframes to database

os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()

sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'

totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')    
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')  
     
print('Finished. It took: ', datetime.now() - starttime)
'''

【问题讨论】：

考虑完全避免 pandas 并将每个 Excel 电子表格保存为 CSV（您应该已经这样做了！），然后通过 Python 或 sqlite3 CLI 将 CSV 导入 SQLite。
我对python一无所知，尽可能避免使用MS-Excel。但是，当您导入 SQLite 时，您可以通过将 SQL 语句封装在事务中来节省大量时间：1）在 SQL 语句的最开头：BEGIN TRANSACTION; 2）在 SQL 语句的最后：COMMIT;
HTH
@Parfait 你能告诉我为什么我应该已经保存到 CSV 吗？另外，在 CSV 与 pandas 中组织导入数据有什么优势？

标签： python excel database pandas sqlite

【解决方案1】：

试试这个，这里大部分时间是在将数据从 excel 加载到 Dataframe 期间。我不确定以下脚本会将时间缩短到几秒钟内，但它会减少 RAM 包袱，从而加快进程。它可能会减少至少 5-10 分钟的时间。由于我无法访问数据，我无法确定。但是你应该试试这个

import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound

os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)

engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()

sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'


reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)

starttime = datetime.now()
print('Started', starttime)


c = 0
tables = list()

for xl in os.listdir(month_folder):
    if '-Amazon' in xl:
        ttime = datetime.now()
        
        table_name = str(xl[11:-5])
        tables.append(table_name)
        
        quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
        summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
        
        quote_sheet.insert(0,'reportmonth', reportmonth)
        summary_sheet.insert(0,'reportmonth', reportmonth)
        
        quote_sheet.insert(0,'source_file', table_name)
        summary_sheet.insert(0,'source_file', table_name)
        
        quote_sheet.columns = quote_sheet.columns.str.strip()
        quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
        
        summary_sheet.columns = summary_sheet.columns.str.strip()
        summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
        
        quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')    
        summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')  
        
        c = c + 1
        print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

【讨论】：

谢谢Kuldip，这种方法似乎工作得更快。我只需要将 'if_exists = 'replace'' 更改为 'if_exists = 'append''

【解决方案2】：

我看到了一些你可以做的事情。首先，由于您的第一步只是将数据传输到 SQL DB，因此您不一定需要将所有文件相互附加。您可以一次只解决一个文件的问题（这意味着您可以进行多进程！） - 然后，无论需要完成什么计算，都可以稍后进行。这也将导致您减少 RAM 使用量，因为如果您的文件夹中有 10 个文件，您不会同时加载所有 10 个文件。
我会推荐以下内容：

构造一个您需要访问的文件名数组
编写一个可以获取文件名、打开 + 解析文件并将内容写入 MySQL 数据库的包装函数
使用 Python multiprocessing.Pool 类同时处理它们。例如，如果您运行 4 个进程，您的任务将快 4 倍！如果您需要从这些数据中派生计算并因此需要聚合它，请在数据进入 MySQL 数据库后执行此操作。这会更快。
如果您需要基于聚合数据定义一些计算，现在就在 MySQL DB 中进行。 SQL 是一种非常强大的语言，几乎所有东西都有一个命令！

我添加了一个简短的代码 sn-p 来告诉你我在说什么:)

from multiprocessing import Pool

PROCESSES = 4

FILES = []

def _process_file(filename):
    print("Processing: "+filename)

pool = Pool(PROCESSES)
pool.map(_process_file, FILES)

SQL 说明：您不需要为移动到 SQL 的每个文件创建一个独立的表！您可以根据给定的schema 创建一个table，然后将所有文件中的数据逐行添加到该表中。这本质上就是你用来从 DataFrame 到 table 的函数所做的，但它创建了 10 个不同的表。您可以查看一些将行插入表 here 的示例。

但是，在您拥有的特定用例中，将 if_exists 参数设置为 @987654326 @ 应该可以工作，正如您在评论中提到的那样。我刚刚添加了之前的参考资料，因为您提到您对 Python 还很陌生，而且我在金融行业的许多朋友都发现对 SQL 有更细微的了解非常有用。

【讨论】：