Python：以数字方式对列表列表进行排序答案

【问题标题】：Python: Sort list of lists numericallyPython：以数字方式对列表列表进行排序
【发布时间】：2018-11-30 05:37:06
【问题描述】：

我有一个x,y 坐标列表，我需要根据x 坐标进行排序，然后在x 相同时对y 坐标进行排序，并消除相同坐标的重复项。例如，如果列表是：

[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], 
 [349.9, 486.6], [450.0, 313.3]]

我需要将其重新排列为：

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
 [500.0, 400.0]]

（删除了一个重复的[450.0, 313.3]）

【问题讨论】：

标签： python python-3.x list sorting

【解决方案1】：

无论如何，这是列表列表的正常排序顺序。使用字典对其进行重复数据删除。

>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
 [349.9, 486.6],
 [350.0, 313.3],
 [450.0, 313.3],
 [450.0, 486.6],
 [500.0, 400.0]]

【讨论】：

dict 比 set 更有效吗？
可能。集合不能包含列表。
@OlivierMelançon，这正是我所做的 :)
但是您必须转换回列表才能获得输出。
@OlivierMelançon 实际上，看看我的回答。由于输入集比jpp (L=100000*L) 创建的输入集更加随机，字典方法比jpp 解决方案略快。

【解决方案2】：

因为我们无论如何都在排序，所以我们可以使用 groupby 进行重复数据删除：

>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]                                                                 
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

几个时间：

>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>> 
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265

【讨论】：

你能评论一下内存影响吗？我怀疑based on this 除了unique 之外的所有人都会复制数据。当然，我可能是错的:)。
unique 需要跟踪已经看到的内容，这意味着在接近结束时将有输入列表，uniques seen-that set 和 sorteds input列表（sorted 必须在开始排序之前收集整个输入）同时在内存中。所以unique没有内存优势。

【解决方案3】：

你想要的似乎很容易用numpy的unique函数完成：

import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()

如果你真的担心数组没有按列排序，那么除了上面的运行np.lexsort()：

u = u[np.lexsort((u[:,1], u[:,0]))]

时序（非随机样本）：

In [1]: import numpy as np

In [2]: from toolz import unique

In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
   ...:  [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
   ...:  

In [4]: L = 100000 * data

In [5]: npL = np.array(L)

In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @user3483203 solution:

In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

时间安排（更多随机样本）：

当样本数据更随机时，结果会有所不同：

In [29]: npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))

In [30]: L = npL.tolist()

In [31]: %timeit sorted(unique(L, key=tuple))
143 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit sorted({tuple(x): x for x in L}.values())
134 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit np.unique(L, axis=0)
78.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit np.unique(npL, axis=0)
54 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Paul Panzer's solution:

In [36]: import itertools

In [37]: %timeit [k for k, g in itertools.groupby(sorted(L))]
123 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user3483203 solution:

In [54]: %timeit lex(np.asarray(L))
60.1 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %timeit lex(npL)
38.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：

这似乎没有保留对。 np.unique 默认变平。
@PaulPanzer 谢谢！在本地，我有axis=0，但发帖时没有……太晚了

【解决方案4】：

我们可以使用 np.lexsort 和一些掩码快速完成此操作

def lex(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

L = np.array(L)
lex(L)

# Output:
[[300.  400. ]
 [349.9 486.6]
 [350.  313.3]
 [450.  313.3]
 [450.  486.6]
 [500.  400. ]]

性能

`Functions`

def chrisz(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

def pp(data):
    return [k for k, g in itertools.groupby(sorted(data))]

def gazer(data):
    return np.unique(data, axis=0)

def wim(L):
    return sorted({tuple(x): x for x in L}.values())

def jpp(L):
    return sorted(unique(L, key=tuple))

`Setup`

res = pd.DataFrame(
       index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
        L = npL.tolist()
        stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
        setp = 'from __main__ import L, npL, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

`Validation`

npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))    
L = npL.tolist()    
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True

【讨论】：

@AGNGazer 现在应该修复了，我相信它实际上变得更快了。我使用np.all(np.unique(npL, axis=0) == lex(npL)) 进行了验证
刚刚添加了您的时间。非常快！

【解决方案5】：

这是使用sorted 和toolz.unique 的一种方法：

from toolz import unique

res = sorted(unique(L, key=tuple))

print(res)

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
 [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

注意toolz.unique 也可以通过标准库作为itertools unique_everseen recipe 获得。元组转换是必要的，因为该算法通过set 使用散列来检查唯一性。

在此处使用set 的性能似乎略好于dict，但您应该一如既往地使用您的数据进行测试。

L = L*100000

%timeit sorted(unique(L, key=tuple))               # 223 ms
%timeit sorted({tuple(x): x for x in L}.values())  # 243 ms

我怀疑这是因为unique 是惰性的，因此由于sorted 没有对输入数据进行副本，因此您的内存开销较少。

【讨论】：

抱歉，这完全不正确。 Sorted 将在进行任何比较之前消耗整个序列，它同样需要内存中的中间结构 - 这里是否唯一迭代惰性无关。
@wim，该中间结构是list，还是dict？创建 Python 结构需要付出一定的代价，例如试试sorted(range(10000)) vs sorted(list(range(10000)))，我知道我更喜欢哪个。
@wim，内存的好处是如果你提供一个列表（或其他内存中的可迭代），sorted 将在排序之前make its own list，即如果你提供一个列表它会复制它。这个副本就是我所说的中间结构。
就我而言，这是一个dict_values 视图。 unique 也必然需要以某种方式维护一些中间结构（我假设它使用一个集合）。如果您没有注意到，那是因为您的测试存在缺陷：设置L = L*100000 意味着数据实际上包含 100% 的重复数据。