Openpyxl、Pandas 或两者兼有答案

【问题标题】：Openpyxl, Pandas or bothOpenpyxl、Pandas 或两者兼有
【发布时间】：2021-11-20 15:30:47
【问题描述】：

我正在尝试处理一个 Excel 文件，以便以后可以将每一行和每一列用于特定操作。

我的问题如下：

使用 Openpyxl 使我更容易加载文件并能够遍历行

    #reading the excel file
    path = r'Datasets/Chapter 1/Table B1.1.xlsx'
    wb = load_workbook(path) #loading the excel table
    ws = wb.active #grab the active worksheet
    
    #Setting the doc Header
    for h in ws.iter_rows(max_row = 1, values_only = True): #getting the first row (Headers) in the table
        header = list(h)
    
    for sh in ws.iter_rows(min_row = 1 ,max_row = 2, values_only = True):
        sub_header = list(sh)
    
    #removing all of the none Values
    header = list(filter(None, header))
    sub_header = list(filter(None, sub_header))
    #creating a list of all the rows in the excel file
    row_list = []
    
    for row in ws.iter_rows(min_row=3): #Iteration over every single row starting from the third row since first two are the headers
        row = [cell.value for cell in row] #Creating a list from each row
        row = list(filter(None, row)) #removing the none values from each row
        row_list.append(row) #creating a list of all rows (starting from the 3d one)

    colm = []
    for col in ws.iter_cols(min_row=3,min_col = 1): #Iteration over every single row starting from the third row since first two are the headers
        col = [cell.value for cell in col] #Creating a list from each row
        col = list(filter(None, col)) #removing the none values from each row
        colm.append(col) #creating a list of all rows (starting from the 3d one)

但同时（据我在文档中阅读），我无法将其可视化或对行或列进行直接操作。

虽然使用 pandas 对行和列进行直接操作更有效，但我读到不建议迭代数据帧以获取列表中的行，即使它是使用 df.iloc[2:] it 完成的不会给我相同的结果（将每一行保存在特定列表中，因为标题始终存在）。但是，与 Openpyxl 不同的是，使用我需要做的列名使用 df[col1]-df[col2] 之类的东西对列进行直接操作要容易得多。（因为只是将所有列值放在一个列表中对我来说不会这样做）

所以我的问题是，是否有一种解决方案可以只使用其中一个来完成我想要做的事情，或者如果同时使用它们并没有那么糟糕，请记住我必须加载excel文件两次。

“提前致谢！”

【问题讨论】：

不清楚您的代码试图做什么。您是否使用它来将电子表格中的“Non-Falsy”或not None 值收集到row_list 和colm 中？如果您担心您的两个for loops 在pandas 中会变慢，您可以选择在pd.read_excel(... 函数中设置engine='openpyxl'
最好提供一些我们可以复制粘贴到 Excel 中的示例输入、预期输出以及您可能拥有的任何性能/时间指标。
请针对特定问题提出问题评级。不幸的是，对图书馆的猜测对所有相关人员来说都是浪费时间。

标签： python pandas dataframe performance openpyxl

【解决方案1】：

使用openpyxl读取excel文件一次，然后将行加载到pandas没有问题：

pandas.DataFrame(row_list, columns=header)

您是对的，使用索引迭代 DataFrame 非常慢，但您还有其他选择：apply()、iterrows()、itertuples()

链接：Different ways to iterate over rows in pandas DataFrame

我还想指出，您的代码可能没有按照您的意愿行事。

list(filter(None, header)) 不仅过滤无，还过滤所有虚假值，例如 0 或 ""。
此类过滤会移动列。例如，您有一行 [1, None, 3] 和列 ['a', 'b', 'c']。通过过滤无，您将获得[1, 3]，这将与列'a' 和'b' 相关。

【讨论】：