按步长值变化对数组中的数字进行分组答案

【问题标题】：group numbers in an array by step value changes按步长值变化对数组中的数字进行分组
【发布时间】：2017-11-14 16:29:38
【问题描述】：

我有一个数组，如 [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471 , 472, 557, 559, 562, 566, 569...]

在这个数组中，在几个整数之后，值会有一个阶跃变化。（比如在[101,107,106]和[199,204,...]之间）或者换一种说法，数组是由一组整数组成的，每组的值都围绕着一个未知的平均值。但是我不知道总共有多少组，每个组的整数个数是不确定的。

如何将每一步中的这些整数分组到不同的数组中。

谢谢

【问题讨论】：

对于具有未知手段的任意长度组，我认为层次聚类算法可以工作。您不知道要创建多少个集群，但您可以尝试迭代搜索（可能从 2 个组开始）并优化您的集群，直到最大限度地减少集群内方差。
阅读聚类算法。例如，您可以在这里使用K-means 算法，因为它是非常流行的聚类算法。

标签： python arrays algorithm grouping

【解决方案1】：

您可以试试这个：确定每对连续数字的差值，并从中确定平均差值。

nums = [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 
        380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569]
pairs = list(zip(nums, nums[1:]))
diffs = [abs(x-y) for x, y in pairs]
avg_diff = sum(diffs) / len(diffs)  # ~ 18.31

现在，您可以根据与前一个数字的差值是低于还是高于平均值来对数字进行分组：

groups = [[nums[0]]]          # first group already has first number
for (x, y), d in zip(pairs, diffs):
    if d < avg_diff:
        groups[-1].append(y)  # add to last group
    else:
        groups.append([y])    # start new group

或者，如果您更喜欢跨越三行的单行，那么这一个可能适合您：

groups = [functools.reduce(lambda A, b: A+(b[1],) if A else b, group, None) 
          for key, group in itertools.groupby(zip(nums, nums[1:]), 
                  key=lambda t: abs(t[0]-t[1]) < 18.3) if key]

你的例子的结果是这样的：

[[101, 107, 106],
 [199, 204, 205, 207],
 [306, 310, 312, 312, 314, 317, 318],
 [380, 377, 379, 382],
 [466, 469, 471, 472],
 [557, 559, 562, 566, 569]]

当然，如果有组内部差异大不相同的组，例如[1, 4, 2, 5, 1042, 1230, 920, 3, 2, 5]，这就会失效。如果是这种情况，您可以尝试数字的 relative 差异，例如max(x,y)/min(x,y) 而不是 abs(x-y)。

【讨论】：

上面那个和上面那个不一样！
@ddofborg 结果实际上是相同的，只是一个是列表列表，一个是元组列表。
好的。我在一些数据上运行了两条线，结果不同。也许我在某处犯了错误。

【解决方案2】：

我尝试按照我在评论中的建议去做。我认为这将为更普遍的问题提供一个很好的解决方案，但我警告说我没有考虑所有边缘情况或在这里考虑算法复杂性。

import numpy as np

# function to initialize clusters
def init_clusters(x, num_elements_per_cluster=3):
    # initialize clusters by splitting into n groups
    x.sort()  # sort the list
    nclusters = len(x) / num_elements_per_cluster
    clusters = {i: {'values': []} for i in range(nclusters)}

    # assign to clusters (helps that list is sorted)
    for i in range(len(x)):
        index = min(i/num_elements_per_cluster, nclusters-1)
        clusters[index]['values'].append(x[i])

    # compute variance
    for index in clusters:
        clusters[index]['var'] = np.var(clusters[index]['values'])

    return clusters

def get_avg_var(clusters):
    total_var = 0.0
    denom = 0.0
    for index in clusters:
        total_var += clusters[index]['var'] * len(clusters[index]['values'])
        denom += len(clusters[index]['values'])
    return total_var / denom  # possible div by 0, but shouldn't happen

def assign_value_to_cluster(clusters, value):
    """
    add value to a cluster such that results in the lowest variance
    """
    new_cluster_vars = []
    indices = []
    for index in clusters:
        new_cluster_vars.append(np.var(clusters[index]['values'] + [value]))
        indices.append(index)

    index_min_new_cluster_var = indices[np.argmin(new_cluster_vars)]
    clusters[index_min_new_cluster_var]['values'].append(value)
    # update the variances
    clusters[index_min_new_cluster_var]['var'] = new_cluster_vars[index_min_new_cluster_var]


def purify(clusters):
    curr_var = get_avg_var(clusters)
    prev_var = curr_var*10
    max_iter = 1000
    iter_count = 0
    while(curr_var < prev_var):
        if iter_count > max_iter:
            break

        prev_var = curr_var

        # start with the cluster with the highest variance
        sorted_vars = sorted(
            [{'index': i, 'var': clusters[i]['var']} for i in clusters],
            key=lambda x: x['var'], 
            reverse=True
        )
        highest_var_index = sorted_vars[0]['index']

        vals = clusters[highest_var_index]['values']
        if len(vals) > 2:
            # find the element that when removed will minimize the variance of this cluster
            dropout_variance = [np.var([vals[j] for j in range(len(vals)) if j != i]) for i in range(len(vals))]
            index_to_drop = np.argmin(dropout_variance)
            value_to_reassign = clusters[highest_var_index]['values'].pop(index_to_drop)
            # update the variances
            clusters[highest_var_index]['var'] = dropout_variance[index_to_drop]
            assign_value_to_cluster(clusters, value_to_reassign)
        else:
            # break this cluster and assign values to others
            clusters.pop(highest_var_index)
            for val in vals:
                assign_value_to_cluster(clusters, val)

        curr_var = get_avg_var(clusters)
        print "after iter %04d: %04.2f" % (iter_count, curr_var) 
        iter_count += 1

    return clusters

对提供的样本数据 OP 运行算法：

# vector x of values that we want to cluster
x = [
    101, 107, 106, 199, 204, 205, 207, 306, 310, 312,
    312, 314, 317, 318, 380, 377, 379, 382, 466, 469,
    471, 472, 557, 559, 562, 566, 569
]

clusters = init_clusters(x)
final_clusters = purify(clusters)

# print values of the final clusters
[final_clusters[y]['values'] for y in final_clusters]

输出：

[[101, 106, 107],
 [204, 205, 207, 199],
 [306, 310],
 [312, 312, 314],
 [317, 318],
 [379, 380, 382, 377],
 [466, 469, 471, 472],
 [557, 559],
 [562, 566, 569]]

编辑：修复了get_avg_var() 中的一个错误，并意识到我没有更新集群差异。这对初始化很敏感，但它通常会提供合理的集群。话虽如此，您可以定义自己的优化参数（而不是像我那样使用平均集群方差）。

【讨论】：

【解决方案3】：

从您发布的代码看来，abs(array[i]-array[i+1]) > 6 时发生了一个步骤。你可以使用这个：

final = []
current = []   
arr = [101, 107, 106, 199, 204, 205, 207, 306, 310, 312, 312, 314, 317, 318, 380, 377, 379, 382, 466, 469, 471, 472, 557, 559, 562, 566, 569]
for i in range(len(arr)-1):
   if abs(arr[i] - arr[i+1]) > 6:
      current.append(arr[i])
      final.append(current)
      current = []
   else:
       current.append(arr[i])

输出：

[[101, 107, 106], [199, 204, 205, 207], [306, 310, 312, 312, 314, 317, 318], [380, 377, 379, 382], [466, 469, 471, 472]]

【讨论】：

不知道为什么有人反对它，这个解决方案确实有效。但是一个问题是它非常死板，对于其他级别变化较大的数组我必须手动编辑代码。
@melonb 你能发布级别变化更大的案例吗？它可以用于这个解决方案。