更好的多数组排序，基于第一个数组答案

【问题标题】：better multiple array sort, based on first array更好的多数组排序，基于第一个数组
【发布时间】：2021-03-15 07:42:15
【问题描述】：

我正在努力更新 SVG::Graph gem，并对我的版本进行了许多改进，但发现了多个数组排序的瓶颈。

有一个内置的“sort_multiple”函数，它保持一个数组数组（所有大小相等）按组中的第一个数组排序。

我遇到的问题是这种排序在真正随机的数据上效果很好，在排序或几乎排序的数据上效果很差：

def sort_multiple( arrys, lo=0, hi=arrys[0].length-1 )
  if lo < hi
    p = partition(arrys,lo,hi)
    sort_multiple(arrys, lo, p-1)
    sort_multiple(arrys, p+1, hi)
  end
  arrys
end

def partition( arrys, lo, hi )
  p = arrys[0][lo]
  l = lo
  z = lo+1
  while z <= hi
    if arrys[0][z] < p
      l += 1
      arrys.each { |arry| arry[z], arry[l] = arry[l], arry[z] }
    end
    z += 1
  end
  arrys.each { |arry| arry[lo], arry[l] = arry[l], arry[lo] }
  l
end

此例程似乎使用了来自维基百科的 Lomuto 分区方案的变体：https://en.wikipedia.org/wiki/Quicksort#Lomuto_partition_scheme

我有一个包含 5000 多个数字的数组，它是之前排序的，这个函数在每个图表上增加了大约 1/2 秒。

我已使用以下内容修改了“sort_multiple”例程：

def sort_multiple( arrys, lo=0, hi=arrys[0].length-1 )
  first = arrys.first
  return arrys if first == first.sort

  if lo < hi
...

它已经“修复”了排序数据的问题，但我想知道是否有任何方法可以利用 ruby 内置的更好的排序功能来让这种排序工作得更快。例如你认为我可以利用 Tsort 来加快速度吗？ https://ruby-doc.org/stdlib-2.6.1/libdoc/tsort/rdoc/TSort.html

查看我的基准测试，完全随机的第一组似乎非常快。

当前基准测试：

def sort_multiple( arrys, lo=0, hi=arrys[0].length-1 )
  if lo < hi
    p = partition(arrys,lo,hi)
    sort_multiple(arrys, lo, p-1)
    sort_multiple(arrys, p+1, hi)
  end
  arrys
end

def partition( arrys, lo, hi )
  p = arrys[0][lo]
  l = lo
  z = lo+1
  while z <= hi
    if arrys[0][z] < p
      l += 1
      arrys.each { |arry| arry[z], arry[l] = arry[l], arry[z] }
    end
    z += 1
  end
  arrys.each { |arry| arry[lo], arry[l] = arry[l], arry[lo] }
  l
end

first = (1..5400).map { rand }
second = (1..5400).map { rand }
unsorted_arrys = [first.dup, second.dup, Array.new(5400), Array.new(5400), Array.new(5400)]
sorted_arrys = [first.sort, second.dup, Array.new(5400), Array.new(5400), Array.new(5400)]
require 'benchmark'
Benchmark.bmbm do |x|
  x.report("unsorted") { sort_multiple( unsorted_arrys.map(&:dup) ) }
  x.report("sorted") { sort_multiple( sorted_arrys.map(&:dup) ) }
end

结果：

Rehearsal --------------------------------------------
unsorted   0.070699   0.000008   0.070707 (  0.070710)
sorted     0.731734   0.000000   0.731734 (  0.731742)
----------------------------------- total: 0.802441sec

               user     system      total        real
unsorted   0.051636   0.000000   0.051636 (  0.051636)
sorted     0.715730   0.000000   0.715730 (  0.715733)

#EDIT#

最终接受的解决方案：

def sort( *arrys )
  new_arrys = arrys.transpose.sort_by(&:first).transpose
  new_arrys.each_index { |k| arrys[k].replace(new_arrys[k]) }
end

【问题讨论】：

arrays.transpose.sort_by(&:first).transpose 可能值得一试。
添加它作为答案，我会支持你！每次操作 0.004 秒 ...唯一（次要）问题是它没有更新原地数组
所以问题实际上是如何有效地对我的 SOA 进行排序？
可能取决于 SOA 的含义 acronyms.thefreedictionary.com/SOA。

标签： arrays ruby-on-rails ruby sorting

【解决方案1】：

我有一个包含 5000 多个数字的数组，它是之前排序的，这个函数在每个图表上增加了大约 1/2 秒。

不幸的是，在 Ruby 中实现的算法可能会变得很慢。将工作委托给用 C 实现的内置方法通常要快得多，即使这会带来开销。

要对嵌套数组进行排序，您可以先transpose 它，然后sort_by 其first 元素，然后再次转置：

arrays.transpose.sort_by(&:first).transpose

它是这样工作的：

arrays              #=> [[3, 1, 2], [:c, :a, :b]]
  .transpose        #=> [[3, :c], [1, :a], [2, :b]]
  .sort_by(&:first) #=> [[1, :a], [2, :b], [3, :c]]
  .transpose        #=> [[1, 2, 3], [:a, :b, :c]]

虽然它一路创建了几个临时数组，但结果似乎比“未排序”变体快一个数量级：

unsorted   0.035297   0.000106   0.035403 (  0.035458)
sorted     0.474134   0.003065   0.477199 (  0.480667)
transpose  0.001572   0.000082   0.001654 (  0.001655)

从长远来看，您可以尝试将您的算法实现为 C 扩展。

【讨论】：

【解决方案2】：

我承认我没有完全理解这个问题，也没有时间研究链接上的代码，但似乎你有一个排序数组，你只是重复地轻微变异，每次改变你可能会改变几个其他数组，每个数组都有一点或很多。在每组突变之后，您重新排序第一个数组，然后根据第一个数组中元素索引的变化重新排列其他每个数组。

例如，如果第一个数组是

arr = [2,4,6,8,10]

对arr 的更改是将索引1 (4) 处的元素替换为9，并将索引3 (8) 处的元素替换为3、arr将变为[2,9,6,3,10]，重新排序后将变为[2,3,6,9,10]。我们可以这样做：

new_arr, indices = [2,9,6,3,10].each_with_index.sort.transpose
  #=> [[2, 3, 6, 9, 10], [0, 3, 2, 1, 4]]

因此，

new_arr
  #=> [2, 3, 6, 9, 10]
indices
  #=> [0, 3, 2, 1, 4]

中间计算是

[2,9,6,3,10].each_with_index.sort
   #=> [[2, 0], [3, 3], [6, 2], [9, 1], [10, 4]]

考虑到

new_array == [2,9,6,3,10].values_at(*indices)
   #=> true

我们看到，其他每个数组，在变异后，都可以用下面的方法排序以符合第一个数组中索引的排序，这非常快。

def sort_like_first(a, indices)
  a.values_at(*indices)
end

例如，

a = [5,4,3,1,2]
a.replace(sort_like_first a, indices)
a #=> [5, 1, 3, 4, 2]

a = %w|dog cat cow pig owl|
a.replace(sort_like_first a, indices)
a #=> ["dog", "pig", "cow", "cat", "owl"]

事实上，在计算中需要它们之前，没有必要对其他每个数组进行排序。

我现在想考虑一种特殊情况，即只有第一个数组中的一个元素需要更改。

假设（和以前一样）

arr = [2,4,6,8,10]

并且索引3 (8) 处的元素将被替换为5，从而产生[2,4,6,5,10]。可以使用以下方法进行快速排序，该方法采用二分查找。

def new_indices(arr, replace_idx, replace_val) 
  new_loc = arr.bsearch_index { |n| n >= replace_val } || arr.size
  indices = (0..arr.size-1).to_a
  index_removed = indices.delete_at(replace_idx)
  new_loc -= 1 if new_loc > replace_idx
  indices.insert(new_loc, index_removed)
end

arr.bsearch_index { |n| n >= replace_val } 返回nil，如果n >= replace_val #=> false 为所有n。正是出于这个原因，我添加了|| arr.size。

请参阅Array#bsearch_index、Array#delete_at 和 Array#insert。

让我们试试吧。如果

arr = [2,4,6,8,10]
replace_idx = 3
replace_val = 5

然后

indices = new_indices(arr, replace_idx, replace_val)
  #=> [0, 1, 3, 2, 4]

只有现在我们才能在索引replace_idx 处替换arr 的元素。

arr[replace_idx] = replace_val
arr
  #=> [2, 4, 6, 5, 10]

我们看到重新排序的数组如下。

arr.values_at(*indices)
  #=> [2, 4, 5, 6, 10]

其他数组和之前一样排序，使用sort_like_first:

a = [5,4,3,1,2]
a.replace(sort_like_first(a, indices))
  #=> [5, 4, 1, 3, 2]

a = %w|dog cat cow pig owl|
a.replace(sort_like_first(a, indices))
  #=> ["dog", "cat", "pig", "cow", "owl"]

这是第二个例子。

arr = [2,4,6,8,10]
replace_idx =  3
replace_val = 12
indices = new_indices(arr, replace_idx, replace_val)
  #=> [0, 1, 2, 4, 3]

arr[replace_idx] = replace_val
arr
  #=> [2, 4, 6, 12, 10]

因此排序的第一个数组是

arr.values_at(*indices)
  #=> [2, 4, 6, 10, 12]

其他数组排序如下。

a = [5,4,3,1,2]
a.replace(sort_like_first a, indices)
a #=> [5, 4, 3, 2, 1]

a = %w|dog cat cow pig owl|
a.replace(sort_like_first a, indices)
a #=> ["dog", "cat", "cow", "owl", "pig"]

【讨论】：

谢谢，这很好，但我会接受 Stefan 的建议