【问题标题】:How to loop through a file folder to run script for each item in folder?如何遍历文件夹以为文件夹中的每个项目运行脚本?
【发布时间】:2020-04-01 19:25:51
【问题描述】:

我有一个脚本,该脚本从 excel 文件中提取样本并将该样本作为 csv 输出。如何遍历包含多个 excel 文件的文件夹以避免每次运行脚本时更改文件的任务?我相信我可以使用 glob,但这似乎只是将所有 excel 文件合并在一起。

import pandas as pd
import glob

root_dir = r"C:\Users\bryanmccormack\Desktop\Test_Folder\*.xlsx"
excel_files = glob.glob(root_dir, recursive=True)

for xls in excel_files:
    df_excel = pd.read_excel(xls)
    df_excel = df_excel.loc[(df_excel['Track Item']=='Y')]

def sample_per(df_excel):
    if len(df_excel) <= 10000:
        return df_excel.sample(frac=0.05)
    elif len(df_excel) >= 15000:
        return df_excel.sample(frac=0.03)
    else:
        return df_excel.sample(frac=0.01)

final = sample_per(xls)

df_excel.loc[df_excel['Retailer Item ID'].isin(final['Retailer Item ID']), 'Track Item'] = 'Audit'

df_excel.to_csv('Testicle.csv',index=False)

【问题讨论】:

  • 但它不起作用会发生什么?

标签: python pandas glob


【解决方案1】:

您在正确的轨道上,但使用 pd.concat() 是“负责合并您的 excel 文件。这个 sn-p 应该可以帮助你:

import pandas as pd
import glob

# use regex style to get all files with xlsx extension
root_dir = r"excel/*.xlsx"
# this call of glob only gives xlsx files in the root_dir
excel_files = glob.glob(root_dir)

# iterate over the files
for xls in excel_files:
    # read
    df_excel = pd.read_excel(xls)
    # manipulate as you wish here
    df_new = df_excel.sample(frac=0.1)
    # store
    df_new.to_csv(xls.replace("xlsx", "csv"))

请注意,您还可以在 glob 调用中传递 recursive=True,这会为您(我相信来自 python 3+)提供子目录中的所有 excel 文件。

【讨论】:

    【解决方案2】:

    这将返回一个目录中您可以迭代的所有文件的列表:

    from os import walk
    from os.path import join
    
    def retrieve_file_paths(dirName):       #Declare the function to return all file paths of the particular directory
        filepaths = []                      #setup file paths variable
        for root, directories, files in walk(dirName):   #Read all directory, subdirectories and file lists
            for filename in files:
                filepath = join(root, filename)     #Create the full filepath by using os module.
                filepaths.append(filepath)
    
        return filepaths      #return all paths
    

    最后它应该在这一行显示一些东西:

    import pandas as pd
    from os import walk
    from os.path import join
    
    dirName = "/your/dir"
    
    def sample_per(df2):
        if len(df2) <= 10000:
            return df2.sample(frac=0.05)
        elif len(df2) >= 15000:
            return df2.sample(frac=0.03)
        else:
            return df2.sample(frac=0.01)
    
    
    def retrieve_file_paths(dirName):       #Declare the function to return all file paths of the particular directory
        filepaths = []                      #setup file paths variable
        for root, directories, files in walk(dirName):   #Read all directory, subdirectories and file lists
            for filename in files:
                filepath = join(root, filename)     #Create the full filepath by using os module.
                filepaths.append(filepath)
    
        return filepaths      #return all paths
    
    def main():
        global dirName
        for filepath in retrieve_file_paths(dirName):
            df = pd.read_excel(r+filepath)
            df2 = df.loc[(df['Track Item']=='Y')]
            final = sample_per(df2)
            df.loc[df['Retailer Item ID'].isin(final['Retailer Item ID']), 'Track Item'] = 'Audit'
            df.to_csv('Test.csv',index=False)
    
    if __name__ == '__main__':
        main()
    

    【讨论】:

    • 这行得通,但我收到一个关于 str 没有属性“样本”的错误,这很奇怪,因为 df_new 是一个数据框。
    • @Tyrone_Slothrop 我对这个库不太熟悉,但你检查过 df.loc[(df['Track Item']=='Y')] 返回的内容吗?从您的错误来看,它似乎返回了一个简单的字符串。如果你打印 df2 会发生什么?
    猜你喜欢
    • 2021-07-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-08-14
    • 2014-06-11
    相关资源
    最近更新 更多