哪种并行排序算法具有最好的平均案例性能？答案

【问题标题】：Which parallel sorting algorithm has the best average case performance?哪种并行排序算法具有最好的平均案例性能？
【发布时间】：2011-04-27 13:20:54
【问题描述】：

在串行情况下，排序需要 O(n log n)。如果我们有 O(n) 处理器，我们希望线性加速。存在 O(log n) 并行算法，但它们具有非常高的常数。它们也不适用于没有接近 O(n) 处理器的商品硬件。对于 p 个处理器，合理的算法应该花费 O(n/p log n) 时间。

在串行情况下，平均而言，快速排序具有最佳的运行时复杂度。并行快速排序算法很容易实现（参见here 和here）。然而，它表现不佳，因为第一步是将整个集合分区到单个核心上。我找到了许多并行排序算法的信息，但到目前为止，我还没有看到任何明确的赢家。

我希望在 8 到 32 个内核上运行的 JVM 语言中对包含 100 万到 1 亿个元素的列表进行排序。

【问题讨论】：

我认为你的“应该采取”中的 n/p 太多了
@Sparr 我不这么认为。我正在区分拥有几个处理器和拥有与被排序的元素一样多的处理器。
@CraigP.Motlin 对，但您似乎错误地“分发”了 /p 。应该只有一个 /p。
@Sparr 啊，改变了，谢谢。
@CraigP.Motlin 我认为你留错了 :)

标签： algorithm sorting concurrency

【解决方案1】：

以下文章（PDF 下载）是对各种架构上的并行排序算法的比较研究：

Parallel sorting algorithms on various architectures

根据文章，样本排序似乎在许多并行架构类型上是最好的。

更新以解决马克对年龄的担忧：

这里有更多最近的文章介绍了一些更新颖的东西（从 2007 年开始，顺便说一句，仍然可以与样本排序进行比较）：

Improvements on sample sort
AA-Sort

最前沿（大约 2010 年，有些只有几个月大）：

Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach

2013 年更新： 这是大约 2013 年 1 月的前沿。（注意：一些链接指向 Citeseer 的论文，需要免费注册）：

大学讲座：
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3

其他来源和论文：
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations

GPU 和 CPU/GPU 混合资源和论文：
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison

2021 年更新：我没有忘记这个答案，就像所有与计算机相关的东西一样，它还没有老化。在今年年底（2021 年）之前的某个时候，我将尽我所能根据当前趋势和最新技术对其进行更新和更新。

【讨论】：

这是对 1996 年当前各种架构上的并行排序算法的比较研究。从那时起，并行计算发生了很大变化。
您似乎错过了恕我直言，在多核 SIMD 架构中高效实现排序。来自英特尔研究，在 VLDB 2008 上发表。
这将是一个很好的答案，一次。现在，大部分链接都已损坏。

【解决方案2】：

我使用过并行快速排序算法和本质上将快速排序与合并相结合的 PSRS 算法。

使用 Parallel Quicksort 算法，我已经展示了多达 4 个内核（具有超线程的双核）的接近线性加速，考虑到算法的局限性，这是可以预期的。纯并行快速排序依赖于共享堆栈资源，这将导致线程之间的争用，从而降低性能的任何增益。该算法的优点是它可以“就地”排序，从而减少了所需的内存量。如您所说，在对 100M 以上的元素进行排序时，您可能需要考虑这一点。

我看到您希望在具有 8-32 个内核的系统上进行排序。 PSRS 算法避免了共享资源的争用，允许在更多的进程下加速。如上所述，我已经演示了最多 4 个内核的算法，但其他人的实验结果报告了接近线性加速的内核数量更多，32 个及以上。 PSRS 算法的缺点是它不是就地的并且需要相当多的内存。

如果您有兴趣，可以使用或仔细阅读我的 Java 代码，了解这些算法中的每一个。你可以在 github 上找到它：https://github.com/broadbear/sort。该代码旨在替代 Java Collections.sort()。如果您正在寻找在 JVM 中执行上述并行排序的能力，我的 repo 中的代码可能会帮助您。对于实现 Comparable 或实现您自己的 Comparator 的元素，该 API 已完全通用化。

请问您要对这么多元素进行排序是为了什么？我很想知道我的分拣包的潜在应用。

【讨论】：

我有一个 8 核处理器。 :) 现在我已经测试了超过 40M 个元素的排序。我没有看到线性加速，但我看到标准 Java 8 Collections 排序算法的性能大幅提升，这应该是一个多线程 Timsort。我的 PSRS 实现在平均 4985 毫秒内对 40M 元素进行排序，而默认的 JDK 排序算法为 19759 毫秒。

【解决方案3】：

看看这篇论文：A Scalable Parallel Sorting Algorithm Using Exact Splitting。它涉及超过 32 个内核。但是，它详细描述了一种算法，该算法的运行时间复杂度为 O(n/p * log(n) + p * log(n)**2)，适用于任意比较器。

【讨论】：

【解决方案4】：

论文"Comparison of Parallel Sorting Algorithms on Different Architectures" 可能是您开始的好地方。

【讨论】：