使用 Python3 从大型 .xlsx 中高效提取工作表名称和列名称答案

【问题标题】：Efficiently extract sheet names, and column names from large .xlsx with Python3使用 Python3 从大型 .xlsx 中高效提取工作表名称和列名称
【发布时间】：2019-01-17 20:53:47
【问题描述】：

有哪些 Python3 选项可以有效地（性能和内存）提取工作表名称和给定工作表，以及从非常大的 .xlsx 文件中提取列名？

我尝试过使用熊猫：

对于使用pd.ExcelFile 的工作表名称：

    xl = pd.ExcelFile(filename)
    return xl.sheet_names

对于使用pd.ExcelFile的列名：

    xl = pd.ExcelFile(filename)
    df = xl.parse(sheetname, nrows=2, **kwargs)
    df.columns

对于使用pd.read_excel 和nrows (>v23) 的列名：

    df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
    df.columns

但是，pd.ExcelFile 和 pd.read_excel 似乎都读取了内存中的整个 .xlsx，因此速度很慢。

非常感谢！

【问题讨论】：

没有什么方便的测试，但是dfs = pd.read_excel(filename, sheet_name=None, nrows=0) 的表现如何？您应该得到一个字典，其中工作表名称作为键，空 DataFrame 作为其值...

标签： excel python-3.x performance pandas memory

【解决方案1】：

我认为这会有所帮助

from openpyxl import load_workbook

workbook = load_workbook(filename, read_only=True)

data = {}   #for storing the value of sheet with their respective columns

for sheet in worksheets:
    for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
        data[sheet.title] = value #value would be a tuple with headings of each column

【讨论】：

【解决方案2】：

此程序列出了 excel 中的所有工作表。这里使用熊猫。

import pandas as pd
with pd.ExcelFile('yourfile.xlsx') as xlsx :
    sh=xlsx.sheet_names
print("This workbook has the following sheets : ",sh)

【讨论】：

【解决方案3】：

这是我可以与您分享的最简单的方法：

# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names

【讨论】：

【解决方案4】：

根据this SO question，不支持分块读取excel文件（see this issue on github），使用nrows总是先将所有文件读入内存。

可能的解决方案：

将工作表转换为 csv，然后分块读取。
使用熊猫以外的东西。有关替代库的列表，请参阅 this page。

【讨论】：