如何在 Python 中根据日期处理文件？答案

【问题标题】：How to process files based on their date in Python?如何在 Python 中根据日期处理文件？
【发布时间】：2021-09-27 11:10:11
【问题描述】：

我有两种文件，xml files 和 txt files。这些文件的名称中有一个日期。如果xml file 的日期与txt file 的日期匹配，我想打开txt file 进行一些处理并将输出写入列表。之后我想更改xml file。多个xml files 可以具有相同的日期，但txt file 是唯一的，因此这意味着可以将超过1 个xml file 与txt file. 链接

现在我有一个问题。我的to_csv 列表包含 20200907 和 20201025 的数据。我不希望它那样工作。我希望我的to_csv 列表一次只做一个文件（因此是一个日期）。

output_xml = r"c:\desktop\energy\XML_Output"
output_txt = r"c:\desktop\energy\TXT_Output"

xml_name = os.listdir(output_xml )
txt_name = os.listdir(output_txt)
txt_name = [x.replace('-', '') for x in txt_name] #remove the - in the filenames

# Extract the date from the xml and txt files. 
xml_dates = []
for file in xml_name:
    find = re.search("_(.\d+)-", file).group(1)
    xml_dates.append(find)

txt_dates = []
for file in txt_name:
    find = re.search("MM(.+?)AB", file).group(1)
    txt_dates.append(find)

#THIS IS SOME REPRODUCABLE OUTPUT FROM WHAT IS RECEIVED FROM ABOVE SNIPPET.
xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

to_csv = []

for date_xml in xml_dates:
    for date_txt in txt_dates:
        if date_xml == date_txt:

              match_txt = [s for s in txt_name if date_txt in s]  # matching txt file  
              match_xml = [s for s in xml_name if date_xml in s]  # matching xml file

              match_txt_temp = match_txt[0]
              match_txt_score = [match_txt_temp[:6]+'-'+match_txt_temp[6:8]+'-'+match_txt_temp[8:10]+'-'+match_txt_temp[10:12]+match_txt_temp[12:]]

              with open(output_txt + "/" + match_txt_score[0], "r") as outer:
                reader = csv.reader(outer, delimiter="\t")  

                for row in reader:
                    read = [row for row in reader if row]
                    for row in read:
  
                        energy_level = row[20]

                        if energy_level > 250:
                            to_csv.append(row)
                            
print(to_csv)

当前输出：

[['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20201025, '4', '5'], 
['1', '2', '3', '20201025, '4', '5']]

期望的输出：

[[['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5'], 
['1', '2', '3', '20200907', '4', '5']], 
['1', '2', '3', '20201025, '4', '5'], 
['1', '2', '3', '20201025, '4', '5']]

【问题讨论】：

@Iguananaut 不，它们不一样。在所需的输出中，我在最后一个按日期分隔的列表中有一个列表。在当前的输出中，它的全部在一个
是的，我看到了。
@Iguananaut 如果您有其他想法来解决这个问题，请告诉我，我不喜欢这样的嵌套列表，但找不到任何其他解决方案
根据您提供的代码，您可以通过摆脱双 for 循环和 if 语句来大大简化此操作。如果您有两个列表xml_dates 和txt_dates，您可以通过取两个matching_dates = set(xml_dates).intersection(txt_dates) 的集合交集然后循环matching_dates 来处理匹配的日期。我认为您还有其他一些错误，例如在reader 上进行双循环（在reader 上的for 循环中有一些[row for row in reader if row]，这是没有意义的）。
@Iguananaut 你能在答案中展示它吗？这也让我有机会接受你的回答。

标签： python list for-loop

【解决方案1】：

您说您只有一个按日期划分的 txt 文件，并且只想处理链接到 txt 文件的 xml 文件。这意味着对 txt_dates 进行一个循环就足够了：

...
for date_txt in txt_dates:
    date_xml = date_txt

    match_txt = [s for s in txt_name if date_txt in s]  # the matching txt file  
    match_xml = [s for s in xml_name if date_xml in s]  # possible matching xml files
    if len(match_xml) == 0:   # no matching xml files
        continue

    match_txt_temp = match_txt[0]
    match_txt_score = [match_txt_temp[:6]+'-'+match_txt_temp[6:8]+'-'
                       +match_txt_temp[8:10]+'-'+match_txt_temp[10:12]
                       +match_txt_temp[12:]]

    # prepare a new list for that date
    curr = list()

    with open(output_txt + "/" + match_txt_score[0], "r") as outer:
        reader = csv.reader(outer, delimiter="\t")  

        for row in reader:
            read = [row for row in reader if row]
            for row in read:
                energy_level = row[20]
                if energy_level > 250:
                    curr.append(row)

    if len(curr) > 0:    # if the current date list is not empty append it
        to_csv.append(curr)
                        
print(to_csv)

注意：由于您提供的不是可重现的示例，因此我无法测试上述代码，并且可能存在拼写错误...

【讨论】：

【解决方案2】：

您可以将行附加到字典而不是数组，以允许使用表示日期的键来分隔行。解析文件后，您可以从字典中创建任何您想要的列表组合。

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

to_csv = {'20200907': [], '20201025':[]}

for date_xml in xml_dates:
    for date_txt in txt_dates:
        if date_xml == date_txt:
             with open(output_t2m + "/" + match_t2m_score[0], "r") as outer:
                reader = csv.reader(outer, delimiter="\t")  

                for row in reader:
                    read = [row for row in reader if row]
                    for row in read:
  
                        energy_level = row[20]

                        if energy_level > 250:
                            to_csv[date_txt].append(row)

final_csv = [to_csv['20200907'], to_csv['20201025']]

【讨论】：

这是非常不灵活的。文件的日期不断变化，所以我不能硬编码 20200907 和 20201025
真的。我会修改这个答案以从空字典开始，或使用defaultdict，如to_csv = defaultdict(list)。尽管如此，上述答案仍然存在与您的代码相同的错误。
也没有必要为此使用dict。这可以更简单地完成，但不清楚如何提供正确的答案。

【解决方案3】：

根据您的更新和this comment，我可以告诉您，以下内容与您尝试执行的操作相同，尽管它似乎不是很有用，因为您只是在复制具有匹配日期的每个 XML 文件的 CSV 文件：

xml_file_re = re.compile(r'_(.\d+)-')
xml_dates = defaultdict(int)
for filename in os.listdir(output_xml):
    if m := re.search("_(.\d+)-", file):
        xml_dates[m.group(1)] += 1

txt_file_re = re.compile(r'MM(.+?)AB')
csv_by_date = []

for filename in os.listdir(output_txt):
    if not m := txt_file_re.search(filename):
        continue

    date = m.group(1)

    if date not in xml_dates:
        continue

    with open(os.path.join(output_txt, filename)) as fobj:
        reader = csv.reader(fobj, delimiter='\t')
        # Take only rows with energy_level > 250
        rows = [row for row in reader if row[20] > 250]
        # Make a list of copies of the row for each matching XML file
        # Here we make duplicates of the rows just to be on the safe side...
        copies = [rows[:] for _ in range(xml_dates[date])]
        csv_by_date.append(copies)

同样，这相当于您似乎想要做的事情，但我不确定它为您完成了什么（尤其是 XML 文件的来源......）

【讨论】：

我不只是为每个具有匹配日期的 xml 文件复制 csv 文件的内容。我正在检查某行的值是否高于 250（energy_level）。之后我将输出写入列表
单行包含多个值。在您的示例代码中，它正在检查某个列（在您的代码中row[20]，我猜是“energy_level”）是否包含一个值。现在这似乎是硬编码的，所以同一个 CSV 文件的内容总是相同的。你只是在复制它。
只有 energy_level > 250 的行才会被放入 to_csv。根据我的经验，我可以告诉你，它可能有 10% 的文件数据被放入 to_csv。为了回答您的其他问题，我仅在与相同的 xml 日期匹配时才重新读取 txt 文件。如果我有多个与 txt 日期匹配的 xml 日期，那么是的，我会重新阅读。看不出它的问题。
顺便说一句。操作员：- 需要 python 3.8 或更高版本。我有 3.7
没什么大不了的，你只需要用m = ...; if m: ...替换它

【解决方案4】：

如果你先循环通过txt_dates，那么我认为你可以实现你想要的输出。您可以看到 txt_date 中的日期被组合在一起，因此您一次只能获得一个日期。

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
    "20200907": ["a,b,c", "1,2,3", "10,11,12"],
    "20201025": ["a,c,b", "7,8,9"]
}

to_csv = []

for date_txt in txt_dates:
    # Filter xml_dates for those that match the current date_txt
    xml_matches = [xd for xd in xml_dates if xd == date_txt]
    print("txt date:", date_txt)
    for date_xml in xml_matches:
        print("    ", date_xml, end=" ")
        # simulate rows in a csv file
        file_rows = [row.split(",") + [date_xml] for row in xml_rows[date_xml]]
        to_csv.append(file_rows)
    print()

print(to_csv)

结果：

txt date: 20200907
     20200907
txt date: 20201025
     20201025      20201025      20201025      20201025
[[['a', 'b', 'c', '20200907'],
  ['1', '2', '3', '20200907'],
  ['10', '11', '12', '20200907']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']],
 [['a', 'c', 'b', '20201025'],
  ['7', '8', '9', '20201025']]]

编辑：file_rows 行的解释

file_rows = [row.split(",") + [date_xml] for row in xml_rows[date_xml]]

这是list comprehension。这个想法是模拟处理 csv 文件。 xml_rows[date_xml] 是一个列表，例如可以使用什么来创建

xml_rows = {}
date_xml = "2021-09-27"
with open("data.csv") as fd:
   xml_rows[date_xml] = [line.strip() for line in fd]

data.csv 包含在哪里

a,b,c
1,2,3
10,11,12

请注意，处理 csv 文件有更强大的方法，例如通过使用csv 库。

给定xml_rows[date_xml] 列表，然后脚本用逗号将每一行与row.split(",") 分开。如果我们这样做了

for row in xml_rows[date_xml]:
    print(row, "=>", row.split(","))

输出将是

a,b,c => ['a', 'b', 'c']
1,2,3 => ['1', '2', '3']
10,11,12 => ['10', '11', '12']

为了弄清楚这些行的来源，我使用row.split(",") + [date_xml] 附加了日期，这样列表就会包含正在处理的日期。所以

for row in xml_rows[date_xml]:
    print(row, "=>", row.split(",") + ["2021-09-27"])

会产生

a,b,c => ['a', 'b', 'c', '2021-09-27']
1,2,3 => ['1', '2', '3', '2021-09-27']

既然我一直在模拟输入数据，这一切可能会更清楚

xml_dates = ['20200907', '20200908', '20201025', '20201025', '20201025', '20201025']
txt_dates = ['20200907', '20201025']

# sample csv data for each xml_date since we don't have the actual file contents
xml_rows = {
    "20200907": [["a", "b", "c"], ["1", "2", "3"], ["10", "11", "12"]],
    "20201025": [["a", "c", "b"], ["7", "8", "9"]]
}

to_csv = []

for date_txt in txt_dates:
    # Filter xml_dates for those that match the current date_txt
    xml_matches = [xd for xd in xml_dates if xd == date_txt]
    print("txt date:", date_txt)
    for date_xml in xml_matches:
        print("    ", date_xml, end=" ")
        # Append the date currently being processed to the row
        file_rows = [row + [date_xml] for row in xml_rows[date_xml]]
        to_csv.append(file_rows)
    print()

print(to_csv)

【讨论】：

我不明白你的file_rows
我理解你想要做什么，但因为我不需要模仿任何 csv 文件，所以我很困惑。我需要用什么代替file_rows = [row + [date_xml] for row in xml_rows[date_xml]]
您可以将file_rows = ... 行替换为您需要对匹配的文件进行的任何处理。看起来您正在打开一个文件并根据某些值过滤行。如果你把它放在一个函数中，比如process_csv_file()，那么你可以用process_csv_file()替换我的file_rows = ...。由于您使用的是csv.reader，我推断您的输入是一个csv文件。