读取高效的数据结构

Introduction

The previous blog post introduced the RUM conjecture. It describes the trade-off between read (RO), update (UO), and memory (MO) overheads that one should take into account when designing data structures and access methods.

在这篇文章中，我们想仔细研究为实践中常用的低读取开销而设计的数据结构，即哈希表，红黑树和跳过列表。这篇博客文章是RUM系列的第二部分：

为了比较不同的实现，我们使用与上一篇文章相同的示例。任务是实现一组整数，这次着重于少量读取开销。

该职位的结构如下。在第一部分中，我们想从读取开销最小的最后一篇文章中总结解决方案。以下各节详细介绍了三种实用的选择，从哈希表开始，然后是红黑树，最后是跳过列表。最后一部分是通过从理论角度比较三种不同的数据结构以及进行简单的运行时实验来总结的。

Minimizing The Read Overhead

我们已经看到了读取开销方面的最佳实现：给定可能的整数（从0到1000），我们可以利用布尔数组一种大小1001. The 一种rr一种y elements 一种re 一世n一世t一世一种l一世zed w一世th f一种lse. To m一种rk 一种n 一世nteger 一世一种s member of the set, we set 一种[一世] := true. Rec一种ll the follow一世ng overhe一种ds:

RO = 1UO = 1MO→∞

这对于大多数现实情况是不切实际的，因为内存开销随着可能值的数量而增加。尽管从理论上讲可能将此方法用于整数，但是例如，如果我们尝试将字符串存储在集合中，则不可能实现。我们如何减少内存开销而又不损失太多的读取（和写入）性能？

实际上，这是通过使用基于散列，排序树或基于列表的方法来实现的。让我们看一下旨在提高读取效率而又不牺牲太多写入和内存性能的三种数据结构：哈希表，红黑树和跳过列表。

Hash Tables

Concept

哈希表的想法类似于最佳解决方案中使用的想法。而不是为每个可能的整数保留布尔值插槽，我们将空间限制为整数数组一种大小米。 THen we p一世ck 一种 funct一世on H sucH tH一种t for e一种cH 一世nteger 一世: H(一世) ∈ [0。。米-1]。 TH一世s funct一世on c一种n be used to co米pute tHe 一种rr一种y 一世ndex 一种nd we c一种n store 一世一世n 一种[H(一世)]。

下图说明了如何将整数3存储在使用哈希表实现的集合中。我们计算高（3）并将值存储在数组的相应字段中。

How do we pick h? A practical choice for h is to reuse an existing cryptographic hash function (e.g. MD5) and take the resulting value modulo m. The disadvantage is that these hash functions might be slow. This is why Java, e.g., relies on custom hash functions for every data type (e.g. String).

If we know all possible values upfront we can pick a perfect hash function. But this is not possible most of the time. What do we do if two integers i and j get mapped to the same index h(i) = h(j)? There are different techniques how to resolve these so called collisions.

One collision resolution method is separate chaining. In separate chaining we do not store the actual values in the array but another layer of data structures. A value is read from the hash table by first computing the array index and then querying the data structure stored there. Possible candidates are tree or list based structures as introduced in the following sections. But it is also common to simply use a linked list, if the number of collisions is expected to be low.

Another commonly used method is called open addressing. If a collision happens, we compute a new index based on some probing strategy, e.g. linear probing with h(i) + 1.

RUM Overheads

哈希表实现的RUM开销在很大程度上取决于所选的哈希函数，数组大小米, as well as the collision resolution strategy. This 米akes it possible to tune the RUM overheads of your hash table.

如果没有冲突，则读取开销仅受计算开销的影响（。评估哈希函数的开销较小，导致总体读取开销较小。如果发生冲突，则存在额外的开销，具体取决于解决方案的策略。尽管通过选择一个几乎完美的哈希函数来避免冲突很有用，但是如果我们使哈希函数变得相当快，并且在解决冲突时进行一些额外的操作，则总开销可能会较小。

由于更新首先需要读取操作，因此更新开销等于读取开销，加上在单独链接的情况下，对数组或链接的数据结构进行的插入/删除操作。

内存开销与负载因子间接成比例（n /米）。如果我们使用单独的链接来解决冲突，我们还必须考虑额外的内存。

如果我们插入越来越多的数据，则内存开销会减少。但是，由于我们必须解决更多的冲突，因此读取开销会增加。在这种情况下，可以重新调整哈希表的大小，这需要将现有数据的完整副本复制到更大的数组中，并重新计算所有哈希值。

Asymptotic Complexity

读取和更新操作平均具有恒定的渐近复杂度，因为计算散列值需要恒定的时间，而与存储在表中的数据量无关。在最坏的情况下（全部ñ iñput values get the sa米e hash value), we have ñ - 1 collisioñs to resolve. Thus the worst case perfor米añce is as bad as if we stored the data iñ añ uñordered array añd perfor米ed a full scañ. If 米 is choseñ as s米all as possible añd the hash table is resized if required, the a米ortized 米e米ory require米eñt is liñear iñ the ñu米ber of values stored iñ the set.

类型平均最坏的情况下读O（1）上）更新资料O（1）上）记忆上）上）

平均而言，基于哈希表的数据结构的读取性能是恒定的。但是，根据其设计，它仅有效地支持点查询。如果您的访问模式包含范围查询，例如检查整数[0..500]包含在集合中，哈希集不是正确的选择。为了有效地支持范围查询，我们可以按排序的方式存储数据。此用例最常见的数据结构类型之一是二进制搜索树。

Red-Black Trees

Concept

In a binary search tree, the data is stored in the nodes. Each node has up to two child nodes. The left sub-tree contains only elements that are smaller than the current node. The right sub-tree contains only larger elements. If the tree is balanced, i.e. for all nodes the height of the left and right sub-trees differ at most by 1, searching a node takes logarithmic time. The following picture illustrates how to store the set {0..6} in a binary search tree.

问题是在插入和删除元素时如何使树保持平衡？我们需要相应地设计插入和删除算法，以使树自平衡。这种自平衡二进制搜索树的一种广泛使用的变体是红黑树[1]。

In a red-black tree, each node stores its color in addition to the actual value. The color information is used when inserting or deleting nodes in order to determine how to rebalance the tree. Rebalancing is done by changing color and rotating sub-trees around their parents recursively until the tree is balanced again.

Explaining the algorithm in detail is beyond the scope of this post, so please feel free to look it up on your own. Also there is an amazing interactive visualization of red-black trees by David Galles which is worth checking out. Now let's take a look at the same example set {0..6} stored in a red-black tree.

请注意，红黑树不一定完全平衡，而是就子树中的黑色节点的高度而言。由于红黑树的不变性，平衡的红黑树永远不会比完美平衡的树差很多，即它们具有相同的渐近搜索复杂度。

RUM Overheads

自平衡二进制搜索树中的RUM开销取决于保持树平衡的算法。在红黑树中，重新平衡是递归发生的，并且可能一直影响到根节点。

读取操作涉及遍历树直到找到元素。如果元素存储在叶节点中，则最多需要花费log（n）+ C遍历步骤C being the potential overhead if the tree is not perfeCtly balanCed.

与基于哈希表的实现一样，对基于红黑树的集合的更新操作首先需要进行读取操作。除了读取开销外，更新开销还取决于要更新的值，是否应插入和删除它以及树的当前结构。在大多数情况下，更新只需要在父节点上执行一次操作，即修改指向子节点的指针。最坏的情况是我们必须重新平衡树直到根。

Asymptotic Complexity

Read operations have logarithmic complexity as red-black trees are balanced, thus searching a value conceptually corresponds to a binary search. Update operations have the same complexity, as they require logarithmic search, plus worst case rebalancing operations from a leaf to the root, which is again logarithmic. As we require one node per value, the memory requirements are linear.

类型平均最坏的情况下读O（log n）O（log n）更新资料O（log n）O（log n）记忆上）上）

我们已经看到，如果需要范围查询或应该以排序方式将数据呈现给用户，则自平衡二进制搜索树是有用的数据结构。但是，使其自我平衡所需的算法相当复杂。另外，如果我们要支持并发访问，则必须在重新平衡期间锁定树的某些部分。如果需要大量重新平衡，这可能会导致无法预测的速度下降。

我们如何设计一个同时支持对数搜索成本的并发友好型数据结构？

Skip Lists

Concept

通过设计，链表非常并发友好，因为更新高度本地化并且对缓存友好[3]。如果我们的数据是一个排序的序列，我们可以利用二进制搜索来实现对数读取复杂性。但是，排序后的链表的问题在于我们无法访问列表的随机元素。因此，二进制搜索是不可能的。还是？这是跳过列表的来源。

跳过列表是平衡树的一种概率替代方法[4、5、6]。跳过列表的核心思想是使用跳过指针为数据的后面部分提供快速通道。

要执行二进制搜索，我们必须将查询与中位数进行比较。如果中位数不是我们要查找的元素，则选择左侧或右侧子列表，然后递归重复中位数比较。这意味着我们实际上并不需要完全的随机访问，而是需要访问当前子列表的中位数。下图说明了如何使用跳过指针来实现这一点。

此跳过列表分为三个级别。最低级别包含整组整数{0..6}。仅下一级{1，3，5}，而上层仅包含{3}。我们要添加两个人工节点-∞和∞. Each node holds a value和an array of pointers, one to each successor on the corresponding level. If we now want to check if 4是集合的成员，我们进行如下操作。

从最左边的元素开始（-∞），指针最高（第3级）比较查询（4）以及当前级别中的下一个元素（3）如3 < 4，我们向右移动一个元素（3）然后，我们再次比较查询（4）以及当前级别中的下一个元素（∞）如∞ >= 4, we move one level down (to level 2）然后，我们再次比较查询（4）以及当前级别中的下一个元素（5）如5 >= 4, we move one level down (to level 1）然后，我们再次比较查询（4）以及当前级别中的下一个元素（4）如4 = 4，查询成功返回

如果列表是静态的，则此算法非常有效，我们可以建立跳过指针以支持二进制搜索。但是，在现实生活中，我们希望能够插入或删除元素。我们如何才能有效地支持插入和删除操作，而又不会丢失放置正确的跳过指针的出色属性？每次修改后完全重建跳过列表是不切实际的。最后，我们希望具有高度本地化的更新以支持高并发性。

我们引入了跳过列表作为概率的 alternative to balanced trees. And the 概率的 part is exactly the one needed to solve the problem of where and how to place skip pointers.

For each element we want to insert into the skip list, we first search for its position in the existing elements. Then we insert it into the lowest level. Afterwards we flip a coin. If the coin shows tails, we are done. If it shows heads, we "promote" the element to the next level, inserting it into the higher level list, and repeat the procedure. In order to delete an element, we search for it and then simply remove it from all levels. Feel free to check out this amazing interactive skip list visualization.

由于插入算法的不确定性，现实生活中的跳过列表看起来并不像上图中的那样最佳。他们很可能看起来更凌乱。然而，可以证明预期的搜索复杂度仍然是对数的[7]。

RUM Overheads

跳过列表中的RUM开销是不确定的。这也是为什么渐进复杂性分析比平时更复杂的原因，因为它也涉及概率论。尽管如此，我们将在概念上看一下不同的开销。

读取操作需要遍历一系列水平和垂直指针，并将查询与沿途的不同列表元素进行比较。这意味着在查询返回之前，可能会有大量辅助读取。

您可能已经猜到了，我们需要执行读操作才能执行更新。辅助更新（即促销）的数量是不确定的。但是，它们完全是本地的，不依赖于其余跳过列表的结构。这使得并行化更新变得容易。

内存开销取决于促销的数量，因为我们必须为每个促销存储其他指针。通过使用非公平硬币，即使用促销/不促销的可能性[p，1-p]与0 < p < 1代替[0.5，0.5]，我们实际上可以调整内存开销，有可能与其他读取和更新开销进行交易。如果我们选择p = 0我们将获得一个链表，该链表具有在此数据结构中可以实现的最小内存开销。如果我们选择p太大了，我认为内存和读取开销都会增加，因为我们可能必须沿不同级别执行许多垂直移动。

Asymptotic Complexity

There are different ways to analyze the asymptotic complexity of skip-lists. Two commonly used methods are to look at the expected asymptotic complexity or an asymptotic complexity that holds with high probability. For simplicity reasons, let us take a look at the expected complexity here.

如上所述，实现跳过列表时，最终机会是无限提升的机会很小。虽然预期的级别数是O（log（n）），从理论上讲是无限的。为了解决这个问题，可以选择最大数量的等级中号一个元素可以被提升。如果中号足够大，实际上没有负面影响。

平均情况下，预期的读取和更新复杂度是对数的。跳过列表的预期高度为O（log（n））。但是，更高的提升元素不太可能，这使我们能够得出线性期望的内存需求[8]。

对于有界列表，分析最坏情况更为有趣，因为无界最坏情况是无限高的跳过列表。在有界列表的最坏情况下，我们将每个元素提升到最大级别。如果我们选择最大水平依赖ñ, we cañ derive liñear complexity for read añd update operatioñs.

类型平均最坏的情况下（中号界）最坏的情况下（unbounded)读O（log n）上）∞更新资料O（log n）上）∞中号emory上）O(n中号)∞

现在，我们了解了行业中广泛使用的三种不同类型的数据结构。我们从理论的角度逐一研究了它们。下一部分包含静态比较，总结了我们的发现，以及使用Java标准库中的实现进行的一些运行时实验。

Comparison

Theoretical Comparison

从我们今天所了解到的，可以肯定地说，高效读取的数据结构针对的是次线性读取开销。哈希表非常适合内存中的映射或集合。缺点在于，如果数据增长，则需要重新调整基础数组的大小，并且缺少范围查询支持。如果要关注范围查询或排序的输出，则基于树的数据结构是一个很好的选择。跳过列表有时由于其简单性而优于树，尤其是在涉及无锁实现时。

根据RUM开销，某些数据结构是可配置的。通过调整诸如冲突解决策略或所需负载因子之类的参数，我们可以将内存开销与哈希表中的读取开销进行权衡。在跳过列表中，我们可以通过修改提升概率来实现。

下表总结了我们在本文中看到的三种数据结构的平均渐进读取，更新和内存需求以及RUM可调性方面。

哈希表红黑树跳过清单平均读O（1）O（log n）O（log n）平均更新资料O（1）O（log n）O（log n）平均记忆上）上）上）RUM调整参数负载因子，哈希函数，冲突解决策略--晋升机率

Runtime Experiments

Last but not least, we want to take a look at the actual read performance of three Java standard library data structures: HashSet, TreeSet, and ConcurrentSkipListSet.

HashSet uses separate chaining for collision resolution. If the number of elements in a bucket is small enough, they will be stored in a list. If the number exceeds the TREEIFY_THRESHOLD, it will be migrated to a red-black tree. TreeSet is implemented using a red-black tree. Both, HashSet and TreeSet are not thread safe and do not support concurrent modifications. As the name suggests, ConcurrentSkipListSet supports concurrent access. The base lists use a variant of the Harris-Maged linked ordered set algorithm [9, 10].

作为基准，我们从ñ rañdom iñtegers，añd copy it iñto a 哈希集，树集，añd CoñcurreñtSkipListSet，respectively. We also create a read-optimal set from those ñumbers，i.e. usiñg a huge booleañ array. We theñ create a list of ñ rañdom poiñt queries añd measure the ruñ time for all queries to complete.

We are using ScalaMeter for measuring the runtime performance. Feel free to check out my microbenchmarking blog post which contains more details about the tool.

下表显示了从10万个随机整数生成的不同集合上的10万个点查询的运行时间。

不出所料，最佳读取方式的执行效果明显优于其他所有方式。第二名进入哈希集。最佳读取实现和哈希集都具有恒定的渐近读取开销。树集和跳过列表集的性能要差得多。这也是可以预期的，因为它们具有对数的运行时复杂度。

还要查看这四个实现的其他开销，以及将并发性纳入组合中，这将很有趣。但是，我将本练习留给读者：P在下一篇文章中，我们将仔细研究旨在减少更新开销的高效写入数据结构。

References

[1] Guibas, L.J. and Sedgewick, R., 1978, October. A dichromatic framework for balanced trees. In Foundations of Computer Science, 1978., 19th Annual Symposium on (pp. 8-21). IEEE.
[2] Memory consumption of popular Java data types – part 2 by Mikhail Vorontsov
[3] Choose Concurrency-Friendly Data Structures By Herb Sutter
[4] Pugh, W., 1989, August. Skip lists: A probabilistic alternative to balanced trees. In Workshop on Algorithms and Data Structures (pp. 437-449). Springer, Berlin, Heidelberg.
[5] Fraser, K. and Harris, T., 2007. Concurrent programming without locks. ACM Transactions on Computer Systems (TOCS), 25(2), p.5.
[6] Herlihy, M., Lev, Y., Luchangco, V. and Shavit, N., 2006. A provably correct scalable concurrent skip list. In Conference On Principles of Distributed Systems (OPODIS).
[7] Papadakis, T., 1993. Skip lists and probabilistic analysis of algorithms. Ph. D. Dissertation: University of Waterloo.
[8] Skip lists - Data Structures Course of Ben-Gurion University of the Negev
[9] Harris, T.L., 2001, October. A pragmatic implementation of non-blocking linked-lists. In International Symposium on Distributed Computing (pp. 300-314). Springer, Berlin, Heidelberg.
[10] Michael, M.M., 2002, August. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures (pp. 73-82). ACM.
Cover image by Smabs Sputzer - It's a Rum Do... auf flickr, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=59888481

from: https://dev.to//frosnerd/read-efficient-data-structures-57i5