如何根据名称中的日期处理多个文件答案

【问题标题】：How to process multiple files based on the date in their names如何根据名称中的日期处理多个文件
【发布时间】：2021-03-18 21:53:39
【问题描述】：

假设我有这样的结构：

Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`
Folder1
       `XX_20201212.txt`

我当前的脚本收集每个文件夹中的 3 个文件，处理它们并制作 1 个文件。所以现在我的脚本在 1 个日期内完成了这项工作。

现在假设结构已更改为：

Folder1
       `XX_20201201.txt`
       `XX_20201202.txt`
Folder1
       `YY_20201201.txt`
       `YY_20201202.txt`
Folder1
       `ZZ_20201201.txt`
       `ZZ_20201202.txt`
       `ZZ_20201203.txt`

我希望我的脚本现在执行相同的操作，但要针对多个日期。我希望我的脚本检查文件名称中是否包含日期，该日期也存在于名为missing_dates 的列表中，以及该文件是否在每个目录中可用。如果是这样，我想收集它并将其处理成 1 个文件。所以如果我们假设20201201, 20201202 and 20201203 在missing_list 中。需要发生以下情况。

该脚本会将XX_20201201.txt, YY_20201201.txt 和ZZ_20201201.txt 的文件处理为1 个文件，因为该日期存在于missing_dates 中并且它存在于每个目录中。
该脚本会将XX_20201202.txt, YY_20201202.txt 和ZZ_20201202.txt 的文件处理为1 个文件，因为该日期存在于missing_dates 中并且它存在于每个目录中..
脚本将不处理ZZ_20201203.txt 的文件，因为该日期并不存在于每个目录中，即使它存在于missing_dates. 中

所以实际上简短地说：3 个文件具有相同的日期（在 3 个不同的目录中），日期在 missing_dates = 继续

请注意，下面将文件处理为 1 个文件的代码已经在工作，根本问题是我必须调整循环，以便它始终处理超过 1 个日期。我不知道该怎么做....

这是读取文件的代码：

for root, dirs, files in os.walk(counter_part):
    for file in files:
        date_files= re.search('_(.\d+).', file).group(1) 
        with open(file_path, 'r') as my_file:
            reader = csv.reader(my_file, delimiter = ',')
            next(reader)
            for row in reader:
                if filter_row(row):                      
                    vehicle_loc_dict[(row[9], location_token(row))].append(row)

【问题讨论】：

标签： python list loops file if-statement

【解决方案1】：

使用pathlib 中的工具，这相当容易。

给定：

% tree /tmp/test
/tmp/test
├── dir_1
│   ├── XX_20201201.txt
│   └── XX_20201202.txt
├── dir_2
│   ├── YY_20201201.txt
│   └── YY_20201202.txt
└── dir_3
    ├── ZZ_20201201.txt
    ├── ZZ_20201202.txt
    └── ZZ_20201203.txt

3 directories, 7 files

你可以这样做：

from pathlib import Path

root=Path('/tmp/test')

missing_dates=['20201201']

for fn in (e for e in root.glob('**/*.txt') 
    if e.is_file() and any(d in str(e) for d in missing_dates)):
    print(fn)
    # here do what you mean by 'proceed' with path fn

打印：

/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt

或者，你可以这样做：

missing_dates=['20201201', '20201202']

for d in missing_dates:
    print(f"processing {d}")
    for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file()):
        print(fn)
        # here do what you mean by 'proceed'

打印：

processing 20201201
/tmp/test/dir_2/YY_20201201.txt
/tmp/test/dir_3/ZZ_20201201.txt
/tmp/test/dir_1/XX_20201201.txt
processing 20201202
/tmp/test/dir_2/YY_20201202.txt
/tmp/test/dir_3/ZZ_20201202.txt
/tmp/test/dir_1/XX_20201202.txt

如果您只对 3 人一组感兴趣，您可以这样做：

missing_dates=['20201201', '20201202', '20201203']

for d in missing_dates:
    print(f"processing {d}")
    files=[fn for fn in (e for e in root.glob(f"**/*_{d}.txt") if e.is_file())]
    if len(files)==3:
        print(files)

打印：

processing 20201201
[PosixPath('/tmp/test/dir_2/YY_20201201.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201201.txt'), PosixPath('/tmp/test/dir_1/XX_20201201.txt')]
processing 20201202
[PosixPath('/tmp/test/dir_2/YY_20201202.txt'), PosixPath('/tmp/test/dir_3/ZZ_20201202.txt'), PosixPath('/tmp/test/dir_1/XX_20201202.txt')]
processing 20201203

您可以使用 os.walk 和 glob.glob 做同样的事情，但这只是更多的工作......

【讨论】：

您的代码运行良好。我用你的最后一个和len(files)=3。只是一个问题......我怎么能确定它真的处理了同一日期的 1 个文件，所以它不会混淆它们？
恐怕我不明白你的问题。 glob 只会选择具有该特定日期的文件。混淆可能在于文件的命名方式以及递归 glob 在特定文件树中选择的内容的假设。例如，如果您将某些文件重命名为OLD_DONTUSE_20201202.txt，则该文件将与该日期的其余文件一起被选中。您可以细化 glob 或测试文件或确保树中的文件是预期的文件。否则我不知道你所说的“混合”是什么意思......
是否可以进行私人聊天，以便我可以告诉你我到底需要什么？