【问题标题】:Python: How to process multiple different types of files in a folder?Python:如何处理文件夹中多种不同类型的文件?
【发布时间】:2021-09-22 18:38:45
【问题描述】:

我在一个文件夹中有一个 zip 文件 A001-C-002.zip 和一个 .xlsx 文件 HUBMAP B004 codex antibodies metadata.xlsx。 首先,我想读入xlsx 文件并将其转换为数据框。 接下来,我要处理 zip 文件中的所有文件。

from pathlib import Path
import pandas as pd
import zipfile
import os
import sys

path = "./../../"
os.chdir(path)

for filename in os.listdir(os.getcwd()):
    with open(os.path.join(os.getcwd(), filename), 'r') as f:
        with open("HUBMAP B004 codex antibodies metadata.xlsx", 'r') as ab:
            ab_df = pd.read_excel(ab)
            print(f"Antibody metadata column names:\n {ab_df.columns.values}")
        
        # Patient A001
        with zipfile.ZipFile(path / "A001-C-002.zip") as z:
            for filename in z.namelist():
                if not os.path.isdir(filename):
                    for line in z.open(filename):
                        print(line)
                    z.close()  

追溯

> --------------------------------------------------------------------------- UnicodeDecodeError                        Traceback (most recent call
> last) /tmp/ipykernel_3212/4008185006.py in <module>
>       2     with open(os.path.join(os.getcwd(), filename), 'r') as f:
>       3         with open("HUBMAP B004 codex antibodies metadata.xlsx", 'r') as ab:
> ----> 4             ab_df = pd.read_excel(ab)
>       5             print(f"Antibody metadata column names:\n {ab_df.columns.values}")
>       6 
> 
> ~/.local/lib/python3.8/site-packages/pandas/util/_decorators.py in
> wrapper(*args, **kwargs)
>     309                     stacklevel=stacklevel,
>     310                 )
> --> 311             return func(*args, **kwargs)
>     312 
>     313         return wrapper
> 
> ~/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py in
> read_excel(io, sheet_name, header, names, index_col, usecols, squeeze,
> dtype, engine, converters, true_values, false_values, skiprows, nrows,
> na_values, keep_default_na, na_filter, verbose, parse_dates,
> date_parser, thousands, comment, skipfooter, convert_float,
> mangle_dupe_cols, storage_options)
>     362     if not isinstance(io, ExcelFile):
>     363         should_close = True
> --> 364         io = ExcelFile(io, storage_options=storage_options, engine=engine)
>     365     elif engine and engine != io.engine:
>     366         raise ValueError(
> 
> ~/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py in
> __init__(self, path_or_buffer, engine, storage_options)    1189                 ext = "xls"    1190             else:
> -> 1191                 ext = inspect_excel_format(    1192                     content_or_path=path_or_buffer, storage_options=storage_options   
> 1193                 )
> 
> ~/.local/lib/python3.8/site-packages/pandas/io/excel/_base.py in
> inspect_excel_format(content_or_path, storage_options)    1073        
> stream = handle.handle    1074         stream.seek(0)
> -> 1075         buf = stream.read(PEEK_SIZE)    1076         if buf is None:    1077             raise ValueError("stream is empty")
> 
> /usr/lib/python3.8/codecs.py in decode(self, input, final)
>     320         # decode input (taking the buffer into account)
>     321         data = self.buffer + input
> --> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
>     323         # keep undecoded input until the next call
>     324         self.buffer = data[consumed:]
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position
> 15: invalid start byte

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    对于读取 excel 文件,如果您最终要将其转换为数据框,最好使用 pandas。所以我找到了解决你问题的方法。 这是您阅读 xlsx 所需的帖子。

    Problem in reading Excel Files

    在那篇文章中,他基本上说改用这个:

    df = pd.read_excel("HUBMAP B004 codex antibodies metadata.xlsx")
    

    【讨论】:

    • 顺便说一句,如果您为使用标准库的问题指定python版本,您可以更快地找到您的解决方案。希望对您有所帮助。
    猜你喜欢
    • 2019-11-14
    • 2018-05-01
    • 2019-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-06-18
    • 2011-06-01
    • 1970-01-01
    相关资源
    最近更新 更多