openpyxl 整合电子表格答案

【问题标题】：openpyxl consolidate spreadsheetsopenpyxl 整合电子表格
【发布时间】：2016-12-21 16:32:14
【问题描述】：

我有单独的电子表格，其中包含一年中每个月的数据 - 总共 12 个电子表格。每个工作簿包含 200k-500k 行。

例如

一月

| name  | course  | grade |
|-------|---------|-------|
| dave  | math    | 90    |
| chris | math    | 80    |
| dave  | english | 75    |

二月

| name  | course  | grade |
|-------|---------|-------|
| dave  | science | 72    |
| chris | art     | 58    |
| dave  | music   | 62    |

我正在使用 openpyxl 打开每个月度工作簿，遍历每一行和每个单元格，并将相关数据写入个人工作簿。即所有属于 Chris 的行都进入“Chris.xlsx”，所有属于 Dave 的行都进入“Dave.xlsx”。

我遇到的问题是 openpyxl 非常慢。我确信这是因为我的代码非常程序化，没有优化迭代和写作。

任何想法将不胜感激。

def appendToWorkbooks():
    print("Appending workbooks")
    je_dump_path = "C:/test/"

    # define list of files in path
    je_dump_files = os.listdir( je_dump_path )

    # define path for resultant files
    results_path = "C:/test/output/"

    max_row = 0
    input_row = 1

    for file in je_dump_files:
        current_row = 1

        # load each workbook in the directory
        load_file = je_dump_path + file
        print("Loading workbook: " + file)
        wb = load_workbook(filename=load_file, read_only=True)
        print("Loaded workbook: " + file)

        # select the worksheet with the name Sheet in each workbook
        ws = wb['Sheet']
        print("Loaded worksheet")

        # iterate through the rows in the currently open workbook
        for row in ws.iter_rows():

            # determine the person this row of data relates to
            person = ws.cell(row=current_row, column=1).value

            # set output workbook to that person
            output_wb_file = results_path + person + ".xlsx"
            output_wb = load_workbook(output_wb_file)
            output_ws = output_wb["Sheet"]

            # increment the current row
            current_row = current_row + 1

            print("Currently on row: " + str(current_row))

            # determine the last row in the current output workbook
            max_row = output_ws.max_row

            # set the output row to the row after the last row in the current output workbook
            output_row = max_row + 1

            for cell in row:
                output_ws.cell(row=output_row, column=column_index_from_string(cell.column)).value = cell.value
            output_wb.save(output_wb_file)

【问题讨论】：

@mike-müller 看到您在 stackoverflow.com/questions/35823835/… 上发布了类似的帖子

标签： python excel openpyxl

【解决方案1】：

在循环中包含这条线非常昂贵： max_row = output_ws.max_row

但您确实需要提供有关您的文件和您所看到的性能的更多详细信息。单个文件有多大？它们单独加载需要多长时间？等等。

【讨论】：

感谢您的回复。单个文件的行数从 41MB 到 150MB (c. 200k - 900k) 不等。通过 82 行大约需要 30 秒。假设 200k 行超过 12 个工作簿，这将需要大约 10 天才能完成。
不要在ws.iter_rows 循环内调用ws.cell，尤其是在只读模式下。它将导致 openpyxl 再次开始解析文件。您只需要遍历每一行中的单元格。不要自己增加行，使用 enumerate 来获取计数器。请阅读文档。
谢谢查理。我试图阅读文档。使用 write_only 模式将整行写入新电子表格不会更快吗？虽然我不确定如何实现这一点。非常感谢您的帮助。
老实说这种问题在邮件列表上会更好。您的代码有很多问题，因此 SO 不是进行此类讨论的最佳平台。
谢谢 Charlie - 会把它放在邮件列表中。