如何根据日期时间过滤文件？答案

【问题标题】：How to filter a file based on datetime?如何根据日期时间过滤文件？
【发布时间】：2020-10-15 06:41:40
【问题描述】：

我有一个文件，如果满足 2 个条件，我想将一些行附加到一个空列表中：

我只选取具有country_code 的行，my_countrycodes AND 中也存在该行
对于每个country_code，如果该日期时间为my_time1，我将取最大日期时间

请注意，文件中每一行的country_code 索引为[1]，每一行的日期时间是一个名为date_time4 的变量。

这是我的代码：

my_time = '2020-09-06 16:00:45'
my_time1 =  datetime.datetime.strptime(my_time, '%Y-%m-%d %H:%M:%S') 

my_countrycodes = ['555', '256', '1000']

all_row_times = [] #<--- this is the list where we will append the datetime values of the file
new_list = [] #<--- this is the final list where we will append our results
    
with open(root, 'r') as out:
    reader = csv.reader(out, delimiter = '\t')
    for row in reader:  
        # print(row)
        date_time1 = row[-2] + row[-1] #<--- concatenate date + time
        date_time2 = datetime.datetime.strptime(date_time1, '%d-%m-%Y%H:%M:%S') #<--- make a datetime object of the string
        date_time3 = datetime.datetime.strftime(date_time2, '%Y-%m-%d %H:%M:%S') #<--- turn the datetime object  back to a string
        date_time4 = datetime.datetime.strptime(date_time3, '%Y-%m-%d %H:%M:%S') #<--- turn the string object  back to a datetime object
        all_row_times.append(date_time4) #<--- put all the datetime objects into a list.
        
        if any(country_code in row[1] for country_code in my_countrycodes) and date_time4 == max(dt for dt in all_row_times if dt <  my_time1): 
            new_list.append(row) #<-- append the rows with the same country_code in my_countrycodes and the latest time if that time is < my_time1
                
print(new_list)

文件如下所示： enter image description here

这是new_list的输出：

[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'], 
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'], 
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'], 
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '14:51:45'], 
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]

如您所见，代码提取具有country_codes555、256 和1000 的行，它还提取小于my_time1 的行。所以这部分工作完美。但是，1000 行有 2 个不同的日期时间，我不明白为什么它不只占用 MAX 日期时间。

这是new_list的预期输出：

[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'], 
['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'], 
['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],  
['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]

【问题讨论】：

标签： python list file datetime

【解决方案1】：

实际上，它只需要 MAX 日期时间，但在 for 循环中，14:51:45 首先出现。您的代码将此与其他代码进行比较，由于尚未出现其他值，因此将其作为最大值。

在下一次迭代中，另一个国家代码出现了，因为它的时间比其他代码大，所以也附加了这一行。我猜这就是你所缺少的。

你可以试试这样的。

my_time =  datetime.datetime.strptime('2020-09-06 16:00:45', '%Y-%m-%d %H:%M:%S')
my_countrycodes = ['555', '256', '1000']

country_code_max_date_rel = {}
matched_rows = []
with open(root, 'r') as out:
    reader = csv.reader(out, delimiter = '\t')
    for row in reader:
        date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
        if any(country_code in row[1] for country_code in my_countrycodes):
            matched_rows.append(row)
            try:
                if country_code_max_date_rel[str(row[1])] < date_time:
                    raise KeyError
            except KeyError:
                country_code_max_date_rel[str(row[1])] = date_time

此时，您拥有每个国家/地区的最大值。还有行列表。如果你再次过滤喜欢;

new_list = []
for row in matched_rows:
    country_code = row[1]
    date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
    if date_time == country_code_max_date_rel[country_code]:
        if date_time < my_time:
            new_list.append(row)

新名单：

[['USA', '555', 'White', 'True', 'NY', '06-09-2020', '10:11:32'],
 ['USA', '555', 'White', 'True', 'BS', '06-09-2020', '10:11:32'],
 ['EU', '256', 'Blue', 'False', 'BR', '06-09-2020', '11:26:21'],
 ['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']]

这段代码不是很好，但我想它会帮助你更新你的。

【讨论】：

谢谢阿比。真的很棒的一段代码。你介意我在下午告诉我你到底做了什么吗？
谢谢老哥，我去查一下。
我没有在您的代码中看到my_time？你做了什么？你的代码怎么知道它应该只取 my_time 的行

【解决方案2】：

抱歉，我不确定您要在这里做什么。假设您希望在 new_list 中只有一个 contrycode 实例，并且最新时间在 my_tim1 之前，这是一个答案：

您代码中的逻辑不正确。现在，您正在遍历 csv 文件中的所有行，并在将新行附加到 new_list 之前应用相同的条件。
在给定的情况下，添加了['GE', '1000', 'Green', 'True', 'BE', '06-09-2020', '15:59:45']，因为条件 1 为真（1000 在 my_countrycodes 中），条件 2 也为真（'06-09-2020', '15:59:45' 小于my_time1，这是@987654328 中的“最大”时间@也）。

您可以通过许多不同的方式解决这个问题，但这里有一些建议：

更改您的解决方案：
检查row[1] 是否在str(my_countrycodes) 中，
检查行时间是否小于my_time1
检查行的国家代码是否已经在 new_list,
如果它不在new_list 中，请添加它，
如果它在new_list 中，请检查新日期和时间是否符合您的条件，如果是，请更新该行的日期和时间列。
按国家/地区代码过滤您的文件，然后从每个国家/地区代码的过滤结果中检索最大值

小心你的钥匙是什么，因为你有countrycode，它用不同的参数重复自己。（'纽约'，'BS'）

建议和cmets：

为了快速访问数据，您可以使用字典。使用国家/地区代码作为键可以让您轻松访问数据并帮助您快速检查数据是否存在并更新其参数。
any(country_code in row[1] for country_code in my_countrycodes)
可以写成：
row[1] in str(my_countrycodes)
或者你甚至可以在进入之前创建my_country_code_str = str(my_countrycodes) for 循环。
我不知道您为什么要来回转换日期时间，但由于您只需要最后一个，这样做就足够了：
rows_date_time = datetime.datetime.strptime(row[-2] + row[-1], '%d-%m-%Y%H:%M:%S')
请记住，您可以使用'%d-%m-%Y%H:%M:%S' 随意格式化它
记住给变量命名有意义，并为代码保留一个编码标准（例如，当您使用下划线时，然后依次使用它）

【讨论】：