Python：选择列表中最长的连续日期系列答案

【问题标题】：Python: Selecting longest consecutive series of dates in listPython：选择列表中最长的连续日期系列
【发布时间】：2020-12-16 17:40:33
【问题描述】：

我有一系列列表（实际上是 np.arrays），其中的元素是日期。

id
0a0fe3ed-d788-4427-8820-8b7b696a6033    [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1    [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8    [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object

对于系列中的每个元素（即每个日期列表），我想保留所有日期都是连续的最长子列表。我正在努力以pythonic（简单/高效）的方式来解决这个问题。我能想到的唯一方法是使用多个循环：循环序列值（列表），然后循环列表中的每个元素。然后，我将存储第一个日期和连续天数，如果遇到更长的连续天数，则使用临时值覆盖结果。不过，这似乎效率很低。有更好的方法吗？

【问题讨论】：

将日期转换为序数并获得最长的递增子数组。我发布了答案你可以试试

标签： python python-3.x date datetime series

【解决方案1】：

由于您提到您使用的是 numpy 日期数组，因此坚持使用 numpy 类型而不是转换为内置类型是有意义的。我在这里假设您的数组具有 dtype 'datetime64[D]'。在这种情况下，您可以执行类似的操作

import numpy as np

date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
       '2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
       '2005-02-11', '2005-02-12',
       '2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
       '2005-02-19', '2005-02-20',
       '2005-02-22', '2005-02-23', '2005-02-24',
       '2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
      dtype='datetime64[D]')

i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
    if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
        if i1 - i0 > i1max - i0max:
            i0max, i1max = i0, i1
        i0 = i1

print(date_list[i0max:i1max])

# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']

这里，i0 和 i1 表示当前连续日期子数组的开始和停止索引，i0max 和 i1max 表示到目前为止找到的最长子数组的开始和停止索引.该解决方案使用了这样一个事实，即连续日期列表中的i-th 和第零项之间的差正好是i 天。

【讨论】：

【解决方案2】：

您可以将列表转换为所有连续日期都在增加的序数。这意味着next_date = previous_date + 1read more。

然后找到最长的连续子数组。

此过程将花费O(n)->single loop 时间，这是最有效的方法。

代码

from datetime import datetime
def get_consecutive(date_list):
  # convert to ordinals
  v = [datetime.strptime(d, "%Y-%m-%d").toordinal()  for d in date_list]
  consecutive = []
  run = []
  dates = []

  # get consecutive ordinal sequence 
  for i in range(1, len(v) + 1):
    run.append(v[i-1])
    dates.append(date_list[i-1])
    if i == len(v) or v[i-1] + 1 != v[i]:
      if len(consecutive) < len(run):
        consecutive = dates
      dates = []
      run = []

  return consecutive

输出：

date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']

现在在df.column.apply(get_consecutive)中使用get_consecutive，它将为您提供所有增加的日期列表。或者，如果您使用其他数据结构，您可以为每个列表都运行。

【讨论】：

【解决方案3】：

我将把这个问题简化为在单个列表中查找连续天数。正如您所要求的，有一些技巧可以使它更加 Pythonic。以下脚本应按原样运行。我已经记录了它是如何内联工作的：

from datetime import timedelta, date

# example input
days = [
    date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
    date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]

# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index =  0
longest_interval_length = current_interval_length = 1

# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
    if current_day - previous_day == timedelta(days=+1):
        # we've found a consecutive day! increase the interval length
        current_interval_length += 1
    else:
        # nope, not a consecutive day! start from this day and start
        # counting from 1
        current_interval_index = i
        current_interval_length = 1
    if current_interval_length > longest_interval_length:
        # we broke the record! record it as the longest interval
        longest_interval_index = current_interval_index
        longest_interval_length = current_interval_length

print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])

把它变成一个可重用的函数应该很容易。

【讨论】：