使用矢量化查找最大连续元素数答案

【问题标题】：finding max number of consecutive elements using vectorization使用矢量化查找最大连续元素数
【发布时间】：2014-04-29 20:56:54
【问题描述】：

作为我项目的一部分，我需要查找向量中是否有 4 个或更多连续元素及其索引。目前我正在使用以下代码：

#sample arrays:
#a1 = np.array([0, 1, 2, 3, 5])
#a2 = np.array([0, 1, 3, 4, 5, 6])
#a3 = np.array([0, 1, 3, 4, 5])
a4 = array([0, 1, 2, 4, 5, 6])

dd = np.diff(a4) #array([1, 1, 2, 1, 1])
c = 0
idx = []
for i in range(len(dd)):
    if dd[i]==1 and c<3:
        idx.append(i)
        c+=1
    elif dd[i]!=1 and c>=3:
        break
    else:
         c=0
         idx=[]

我有兴趣看看是否可以避免 for 循环，而只使用 numpy 函数来完成这项任务。

【问题讨论】：

标签： python numpy vectorization

【解决方案1】：

这将为您提供一个包含所有连续块长度的数组：

np.diff(np.concatenate(([-1],) + np.nonzero(np.diff(a) != 1) + ([len(a)-1],)))

一些测试：

>>> a = [1, 2, 3, 4, 5, 6, 9, 10, 11, 14, 17, 18, 19, 20, 21]
>>> np.diff(np.concatenate(([-1],) + np.nonzero(np.diff(a) != 1) +
                           ([len(a)-1],)))
array([6, 3, 1, 5], dtype=int64)

>>> a = [0, 1, 2, 4, 5, 6]
>>> np.diff(np.concatenate(([-1],) + np.nonzero(np.diff(a) != 1) +
                           ([len(a)-1],)))
array([3, 3], dtype=int64)

要检查是否有至少 4 个项目，只需将上述代码包装在 np.any(... >= 4) 中。

要了解它是如何工作的，让我们从内到外计算出我的第一个示例的结果：

>>> a = [1, 2, 3, 4, 5, 6, 9, 10, 11, 14, 17, 18, 19, 20, 21]

首先，我们计算出连续项目之间的增量：

>>> np.diff(a)
array([1, 1, 1, 1, 1, 3, 1, 1, 3, 3, 1, 1, 1, 1])

然后，我们确定增量不是1 的位置，即一大块连续项开始或结束的位置：

>>> np.diff(a) != 1
array([False, False, False, False, False,  True, False, False,  True,
        True, False, False, False, False], dtype=bool)

我们提取Trues的位置：

>>> np.nonzero(np.diff(a) != 1)
(array([5, 8, 9], dtype=int64),)

上述索引标记连续连续中的最后一项。 Python 切片定义为start 到last+1，因此我们可以将该数组加一，在开头添加一个零，在末尾添加数组的长度，并具有连续序列的所有开始和结束索引，即：

>>> np.concatenate(([0], np.nonzero(np.diff(a) != 1)[0] + 1, [len(a)]))
array([ 0,  6,  9, 10, 15], dtype=int64)

从连续索引中获取差异将为我们提供每个连续块的所需长度。因为我们关心的是差异，而不是在索引中添加一个，所以在我的原始答案中，我选择在前面加上 -1 并附加 len(a)-1：

>>> np.concatenate(([-1],) + np.nonzero(np.diff(a) != 1) + ([len(a)-1],))
array([-1,  5,  8,  9, 14], dtype=int64)
>>> np.diff(np.concatenate(([-1],) + np.nonzero(np.diff(a) != 1) +
                           ([len(a)-1],)))
array([6, 3, 1, 5], dtype=int64)

假设在这个数组中，你确定你想要5 项目块的索引，即这个数组的3 位置的索引。要恢复该块的开始和停止索引，您只需执行以下操作：

>>> np.concatenate(([0], np.nonzero(np.diff(a) != 1)[0] + 1, [len(a)]))[3:3+2]
array([10, 15], dtype=int64)
>>> a[10:15]
[17, 18, 19, 20, 21]

【讨论】：

谢谢。你能解释一下它是如何工作的吗？我还需要这些元素的索引。

【解决方案2】：

下面的递归解决方案怎么样？（当然只适用于 1dim 数组）

我觉得它优雅得令人作呕。我不是说你应该使用它，但我很高兴地想出了它。

import numpy as np

def is_consecutive(arr, n):
    if n <= len(arr) <= 1:
        return True
    if len(arr) < n:
        return False
    diffs1idx = np.where(np.diff(arr) == 1)[0]
    return is_consecutive(diffs1idx, n-1)

print is_consecutive([1,2], 3)  # False
print is_consecutive([1,2,3], 3)  # True
print is_consecutive([5,1,2,3], 3)  # True
print is_consecutive([4,9,1,5,7], 3)  # False
print is_consecutive([4,9,1,2,3, 7, 9], 3)  # True
print is_consecutive(np.arange(100), 100)  # True
print is_consecutive(np.append([666], np.arange(100)), 100)  # True
print is_consecutive(np.append([666], np.arange(100)), 101)  # False

（请不要问我它是怎么工作的……我不懂递归……）

【讨论】：

看到解决方案很有趣，但我想它在 python 中效率不高，因为 python 函数调用很昂贵。感谢任何方式

【解决方案3】：

是的。你开始正确：

from numpy import array, diff, where

numbers = array([0, 1, 3, 4, 5, 5])
differences = diff(numbers)

您对连续数字感兴趣：

consecutives = differences == 1

并且您想要两个连续的情况。您可以将数组与其偏移量进行比较：

(consecutives[1:] & consecutives[:-1]).any()
#>>> True

要获取出现次数，请使用.sum() 而不是.any()。

如果你想要索引，只需使用numpy.where:

[offset_indexes] = where(consecutives[1:] & consecutives[:-1])
offset_indexes
#>>> array([2])

编辑：您似乎已将所需长度从3 编辑为4。这使我的代码无效，但您只需要设置

consecutives[1:] & consecutives[:-1]

到

consecutives[2:] & consecutives[1:-1] & consecutives[:-2]

这是一个毫无意义的通用版本：

from numpy import arange, array, diff, where

def doubling_step_shifts(shifts):
    """
    When you apply a mask of some kind of all rotations,
    often the size of the last prints will allow shifts
    larger than 1. This is a helper for that.

    A mask is assumed to exist before invocation, as
    this is typically called repeatedly on the mask or
    a copy.
    """
    # Total shift
    subtotal = 1
    step = 1

    # While the shifts won't overflow
    while subtotal + step < shifts:
        yield step
        subtotal += step
        step *= 2

    # Make up the remainder
    if shifts - subtotal > 0:
        yield shifts - subtotal

def consecutive_indexes_of_length(numbers, length):
    # Constructing "consecutives" creates a
    # minimum mask of 1, whereas this would need
    # a mask of 0, so we special-case these
    if length <= 1:
        return arange(numbers.size)

    # Mask of consecutive numbers
    consecutives = diff(numbers) == 1
    consecutives.resize(numbers.size)

    # Recursively reapply mask to cover lengths too short
    for i in doubling_step_shifts(length-1):
        consecutives[:-i] &= consecutives[i:]

    # Reextend those lengths
    for i in doubling_step_shifts(length):
        consecutives[i:] = consecutives[i:] | consecutives[:-i]

    # Give the indexes
    return where(consecutives)[0]

编辑：快了很多（numpy.roll 很慢）。

还有一些测试：

numbers = array([1, 2, 3, 4, 5, 6, 9, 10, 11, 14, 17, 18, 19, 20, 21])

consecutive_indexes_of_length(numbers, 1)
#>>> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
consecutive_indexes_of_length(numbers, 2)
#>>> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 14])
consecutive_indexes_of_length(numbers, 3)
#>>> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 14])
consecutive_indexes_of_length(numbers, 4)
#>>> array([ 0,  1,  2,  3,  4,  5, 10, 11, 12, 13, 14])
consecutive_indexes_of_length(numbers, 5)
#>>> array([ 0,  1,  2,  3,  4,  5, 10, 11, 12, 13, 14])
consecutive_indexes_of_length(numbers, 6)
#>>> array([0, 1, 2, 3, 4, 5])
consecutive_indexes_of_length(numbers, 7)
#>>> array([], dtype=int64)

为什么？没理由。它是O(n log k)，其中n 是列表中的元素数，k 是GROUPSIZE~~，所以不要将它用于非常大的GROUPSIZE。但是，~~ 对于几乎所有规模的组，它应该相当快。

编辑：现在速度很快。我敢打赌 Cython 会更快，但这很好。

这种实现的优点是相对简单、可扩展并且使用非常原始的操作。除了非常小的输入外，这可能不会比 Cython 循环快。

【讨论】：

它应该输出原始矩阵中的索引，即[2,3,4]
很抱歉，但不是真的！能不能也给 & 部分做个小解释？
我的自动取款机有点忙。但是offset_indexes 会给你每个的起始索引，所以offset_indexes+1 会给你中间，offset_indexes+2 会给你最后一个。如果你想要这些的联合：创建一个原始大小的空数组X，让C = consecutives[1:] & consecutives[:-1]，并在0、1 和2 的移位处添加C 到X。然后在X 上致电numpy.where。如果可行，请随时更新我的答案:)。