有效检查数组是否为锯齿状答案

【问题标题】：Efficiently check if an array is jagged有效检查数组是否为锯齿状
【发布时间】：2020-09-11 08:38:09
【问题描述】：

我正在寻找一种有效的方法来检查数组是否是锯齿状的，其中“锯齿状”意味着数组中的一个元素与它在同一维度上的相邻元素具有不同的形状。

例如[[1, 2], [3, 4, 5]] 或 [[1, 2], [3, 4], [5, 6], [[7], [8]]]

为方便起见，我使用列表语法，但参数可能是嵌套列表或嵌套的 numpy 数组。为了方便起见，我还显示了整数，最低级别的组件可以是任何东西（例如通用对象）。假设最低级别的对象本身不可迭代（例如str 或dict，但对于能够处理这些问题的解决方案来说绝对是加分！）。

尝试：

递归地展平一个数组非常容易，虽然我猜in效率很高，然后展平后的数组的长度可以与输入数组的numpy.size 进行比较。如果它们匹配，则它不是锯齿状的。

def really1d(arr):
    # Returns false if the given array is not 1D or is a jagged 1D array.
    if np.ndim(arr) != 1:
        return False
    if len(arr) == 0:
        return True
    if np.any(np.vectorize(np.ndim)(arr)):
        return False
    return True


def flatten(arr):
    # Convert the given array to 1D (even if jagged)
    if (not np.iterable(arr)) or really1d(arr):
        return arr
    return np.concatenate([flatten(aa) for aa in arr])


def isjagged(arr):
    if (np.size(arr) == len(flatten(arr))):
        return False
    return True

我很确定串联会复制所有数据，这完全是浪费。也许有一个itertools 或numpy.flatiter 实现相同目标的方法？最终，扁平化数组仅用于查找它的长度。

【问题讨论】：

这个答案在 java 中，但它可能会有所帮助。 stackoverflow.com/a/22874074/13314450
感谢@LD，我确实看到了一些其他语言的答案，但我怀疑找到一个有效的答案需要使用适当的numpy 或itertools 方法来不必要地复制数据，我想我在这里的尝试中正在这样做。
你显示的是列表，但测试都使用numpy。一个“锯齿状”的 numpy 数组将具有 object dtype。通常形状也是 1d（或至少比预期的维度少）。
@hpaulj 列表为简单起见显示，问题已修正。如果我一般都知道“预期”的形状，那么解决方案将是微不足道的。
鉴于您的问题有多普遍，我不确定效率是否可以衡量。您要么有一个列表列表，要么有一个对象 dtype 数组（数字 dtype 不能参差不齐。对象数组的迭代比列表的迭代慢。快速编译的 numpy 方法在对象数组中不起作用。跨度>

标签： python arrays numpy

【解决方案1】：

这是解决问题的另一种方法。它的目标是更多通用性（没有 numpy 假设）和代码简单性。它忽略了您多次提出的效率问题：它不会展平或复制数据，但它确实构建了一个并行数据结构来进行测试锯齿很容易。

def simplified(xs):
    # Takes a value and returns it in recursively simplfied form.
    # Array-like values (list, tuple, str) become tuples.
    # All other values (and single characters) become None.
    if isinstance(xs, (list, tuple)):
        return tuple(simplified(x) for x in xs)
    elif isinstance(xs, str):
        return tuple(None for x in xs)
    else:
        return None

def is_jagged(xs):
    # Takes a simplified value.
    # Non-jagged structures will have the same form at the top level.
    return len(set(xs)) > 1

演示：

tests = (
    # Non-jagged.
    (False, []),
    (False, [[], [], []]),
    (False, [1, 2, 3]),
    (False, [[1, 2], [3, 4]]),
    (False, [[1, 2], [3, 4], [5, 6], [7, 8]]),
    (False, ('ab', 'cd')),
    (False, (['ab', 'cd', 'ef'], ('gh', 'ij', 'kl'))),
    # Jagged.
    (True, [1, 2, [3, 4]]),
    (True, [[1, 2], [3, 4, 5]]),
    (True, [[1, 2], [3, 4], [5, 6], [[7], [8]]]),
    (True, ('ab', 'cdefg')),
)
fmt = '\nInput:      {}\nSimplified: {}\nIs jagged:  {} [{}]'
for exp, xs in tests:
    sim = simplified(xs)
    isj = is_jagged(sim)
    msg = fmt.format(xs, sim, isj, 'ok' if isj == exp else 'DOH')
    print(msg)

输出：

Input:      []
Simplified: ()
Is jagged:  False [ok]

Input:      [[], [], []]
Simplified: ((), (), ())
Is jagged:  False [ok]

Input:      [1, 2, 3]
Simplified: (None, None, None)
Is jagged:  False [ok]

Input:      [[1, 2], [3, 4]]
Simplified: ((None, None), (None, None))
Is jagged:  False [ok]

Input:      [[1, 2], [3, 4], [5, 6], [7, 8]]
Simplified: ((None, None), (None, None), (None, None), (None, None))
Is jagged:  False [ok]

Input:      ('ab', 'cd')
Simplified: ((None, None), (None, None))
Is jagged:  False [ok]

Input:      (['ab', 'cd', 'ef'], ('gh', 'ij', 'kl'))
Simplified: (((None, None), (None, None), (None, None)), ((None, None), (None, None), (None, None)))
Is jagged:  False [ok]

Input:      [1, 2, [3, 4]]
Simplified: (None, None, (None, None))
Is jagged:  True [ok]

Input:      [[1, 2], [3, 4, 5]]
Simplified: ((None, None), (None, None, None))
Is jagged:  True [ok]

Input:      [[1, 2], [3, 4], [5, 6], [[7], [8]]]
Simplified: ((None, None), (None, None), (None, None), ((None,), (None,)))
Is jagged:  True [ok]

Input:      ('ab', 'cdefg')
Simplified: ((None, None), (None, None, None, None, None))
Is jagged:  True [ok]

【讨论】：

感谢您的回答。当我为问题提供 working 解决方案时，efficiency 部分非常重要......因此“忽略效率问题”的答案并不是真正的答案。
也许，但 StackOverflow 有很多东西：想象一下未来的用户最终以与您的需求略有不同的需求来看待这个问题。说到效率，如果你想在这方面得到好的答案，我们需要更多信息：问题的哪些方面应该优先于其他方面；哪种效率最重要；问题可以并行化吗？必须支持哪些数据类型；以及您有哪些基准（您当前的方法有多快/效率/无论您想要达到什么目标）。没有这些细节，谈效率往往是空谈。

【解决方案2】：

抱歉，如果我对问题的表述过于含糊，但我需要一个比仅针对给定（整数列表列表）示例更通用的解决方案。

我仍在猜测可能有更好的解决方案，但这里有一个显着的改进，绝对不会复制内存中的输入：

def really1d(arr):
    if np.ndim(arr) != 1:
        return False
    if len(arr) == 0:
        return True
    if np.any(np.vectorize(np.ndim)(arr)):
        return False
    return True


def flatlen(arr):
    # NOTE: If you know your base types are NOT iterable (e.g. not `str`, or `dict`, etc)
    # Then you might be able to get away with:
    # if not np.iterable(arr):

    # This will work for my cases (catching possible `str` and `dict` types)
    if np.isscalar(arr) or isinstance(arr, dict):
        return 1

    if really1d(arr):
        return len(arr)

    return np.sum([flatlen(aa) for aa in arr])


def isjagged(arr):
    if np.isscalar(arr) or (np.size(arr) == flatlen(arr)):
        return False
    return True

【讨论】：

如果arr 是一个列表，np.ndim 和np.vectorize（和np.size）会在执行他们的操作之前将其转换为一个数组。这会产生一个副本（尽管在对象数组的情况下，它只是列表引用的副本）。如果我想有效地检查列表列表，或者在开始时进行一次转换，我会远离 numpy 函数。

【解决方案3】：

首先显示的是列表，而不是数组（稍后会详细介绍）：

In [305]: alist1 = [[1, 2], [3, 4, 5]]                                                   
In [306]: alist2 = [[1, 2], [3, 4], [5, 6], [[7], [8]]]

第一级的混合 len 是一个简单而明显的测试

In [307]: [len(i) for i in alist1]                                                       
Out[307]: [2, 3]

但第二个例子还不够：

In [308]: [len(i) for i in alist2]                                                       
Out[308]: [2, 2, 2, 2]

从list1 创建一个数组会产生一个 1d 对象 dtype：

In [310]: np.array(alist1)                                                               
Out[310]: array([list([1, 2]), list([3, 4, 5])], dtype=object)

list2 是 2d，但仍然是 object dtype：

In [311]: np.array(alist2)                                                               
Out[311]: 
array([[1, 2],
       [3, 4],
       [5, 6],
       [list([7]), list([8])]], dtype=object)

np.array 不是最有效的工具；在编译时，它确实必须至少评估嵌套列表到它发现差异的级别。

如果列表没有参差不齐，在任何级别，结果都是数字 dtype：

In [321]: alist3 = [[1, 2], [3, 4], [5, 6], [7, 8]]                                      
In [322]: np.array(alist3)                                                               
Out[322]: 
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

如果列表元素是数组，则可能会出现更进一步的结果 - 广播错误。这是第一个维度匹配时的结果，但差异在较低级别。

总之，如果它已经是一个numpy 数组，那么object 是一个很好的指标，特别是如果您期望一个数字dtype。如果最低级别的元素本身可能是对象（列表除外），这将无济于事。在list1 和list2 两种情况下，一些或所有最低级别的元素都是对象 - 列表。

如果它是一个列表，那么递归评估 len 可能是要走的路。但是只有时间测试才能证明这比np.array(alist)好。

【讨论】：

你能告诉我这个解决方案和我的有什么不同吗？

【解决方案4】：

有一种超级简单的方法可以在一行中完成......只需一行。它并不完美，但非常简单。

您可以使用np.array 做到这一点。如果 nested-list 的所有元素都具有相同的形状，则结果数组的维度将超过一维。但是如果只有一个元素有不同数量的元素，则返回的数组将是一维的。

请参阅此示例以获得更多理解：

>>> import numpy as np
>>> lst = [[1, 2], [3, 4, 5]]
>>> arr = np.array(lst)
>> arr.shape
(2,)
>>> arr.dtype
object

>>> lst = [[0, 1, 2], [3, 4, 5]]
>>> arr = np.array(lst)
>>> arr.shape
(2, 3)
>>> arr.dtype
int64

所以，你的函数将像这样写在一行中：

def isjagged(lst):
    return len(np.array(lst).shape) == 1

注意：当然这只适用于嵌套列表

编辑

正如@Ch3steR 所说，此解决方案适用于简单的嵌套列表，但不适用于像[[1, 2], [3, 4], [5, 6], [[7], [8]]] 这样有点复杂的嵌套列表。

所以，我认为这可能是一个更好的解决方案：

def isjagged(lst):
    return np.array(lst).dtype == 'object'

【讨论】：

不，这行不通，试试这个np.array([[1, 2], [3, 4], [5, 6], [[7], [8]]])，形状是(4,2)，它是锯齿状的。
是的......你是对的！我可以改用dtype，所以它变成：return np.array(lst).dtype == 'object' 你觉得怎么样？
如果我的例子选择不当，我很抱歉。我特别需要的解决方案，必须能够处理像np.empty((2, 2), dtype=object)这样的简单案例，我肯定会称之为not-jagged。
但是检查锯齿状的目的是什么？只是一个编程练习，还是因为下一个操作需要这些信息？许多数组和张量操作会在锯齿状列表上阻塞。