可用性计算_高可用性计算简介：概念和理论

可用性计算

Let’s focus more on some of the larger architectural principles of cluster management than on any single technology solution.

让我们更多地关注集群管理的一些更大的体系结构原理，而不是任何单一技术解决方案。

We get to see some actual implementations later in the book - and you can learn a lot about how this works on Amazon’s AWS in my Learn Amazon Web Services in a Month of Lunches book from Manning. But for now, let’s first make sure we’re comfortable with the basics.

在本书的后面，我们将看到一些实际的实现-您可以在《 Manning 的一个月的午餐学习亚马逊》一书中学习到很多有关如何在Amazon AWS上工作的信息。但是现在，让我们首先确保我们对基础知识感到满意。

Running server operations using clusters of either physical or virtual computers is all about improving both reliability and performance over and above what you could expect from a single, high-powered server. You add reliability by avoiding hanging your entire infrastructure on a single point of failure (i.e., a single server). And you can increase performance through the ability to very quickly add computing power and capacity by scaling up and out.

使用物理或虚拟计算机的群集来运行服务器操作的目的，就是要提高可靠性和性能，这要比单台高性能服务器所期望的要高。通过避免将整个基础架构挂在单个故障点(即单个服务器)上来增加可靠性。您还可以通过向上和向外扩展来快速增加计算能力和容量，从而提高性能。

This might happen through intelligently spreading your workloads among diverse geographic and demand environments (load balancing), providingbackup servers that can be quickly brought into service in the event a working node fails (failover), optimizing the way your data tier is deployed, or allowing for fault tolerance through loosely coupled architectures.

这可能是通过在不同的地理环境和需求环境中智能地分散工作负载(负载平衡)，提供可以在工作节点发生故障(故障转移)时快速启用的备份服务器，优化数据层的部署方式或允许通过松耦合架构实现容错。

We’ll get to all that. First, though, here are some basic definitions:

我们将处理所有这些。首先，这里有一些基本定义：

Node: A single machine (either physical or virtual) running server operations independently on its own operating system. Since any single node can fail, meeting availability goals requires that multiple nodes operate as part of a cluster.

节点：单台计算机(物理或虚拟)在其自己的操作系统上独立运行服务器操作。由于任何单个节点都可能发生故障，因此要达到可用性目标，就需要多个节点作为集群的一部分进行操作。

Cluster: Two or more server nodes running in coordination with each other to complete individual tasks as part of a larger service, where mutual awareness allows one or more nodes to compensate for the loss of another.

群集：两个或多个服务器节点相互协作，以完成较大的服务的一部分中的各个任务，通过相互了解，一个或多个节点可以补偿另一个节点的损失。

Server failure: The inability of a server node to respond adequately to client requests. This could be due to a complete crash, connectivity problems, or because it has been overwhelmed by high demand.

服务器故障 ：服务器节点无法充分响应客户端请求。这可能是由于完全崩溃，连接问题，或者是由于需求过大而导致的。

Failover: The way a cluster tries to accommodate the needs of clients orphaned by the failure of a single server node by launching or redirecting other nodes to fill a service gap.

故障转移 ：群集通过启动或重定向其他节点来填补服务缺口，从而尝试满足因单个服务器节点故障而孤立的客户端需求的方式。

Failback: The restoration of responsibilities to a server node as it recovers from a failure.

故障回复 ：服务器节点从故障中恢复时责任的恢复。

Replication: The creation of copies of critical data stores to permit reliable synchronous access from multiple server nodes or clients and to ensure they will survive disasters. Replication is also used to enable reliable load balancing.

复制：创建关键数据存储的副本，以允许从多个服务器节点或客户端进行可靠的同步访问，并确保它们在灾难中幸免。复制还用于实现可靠的负载平衡。

Redundancy: The provisioning of multiple identical physical or virtual server nodes of which any one can adopt the orphaned clients of another one that fails.

冗余：设置多个相同的物理或虚拟服务器节点，其中任何一个节点都可以采用另一个失败的孤立客户端。

Split brain: An error state in which network communication between nodes or shared storage has somehow broken down and multiple individual nodes, each believing it’s the only node still active, continue to access and update a common data source. While this doesn’t impact shared-nothing designs, it can lead to client errors and data corruption within shared clusters.

裂脑：错误状态，节点或共享存储之间的网络通信以某种方式被破坏，多个单独的节点(每个节点都认为这是唯一仍处于活动状态的节点)继续访问和更新公共数据源。尽管这不会影响无共享设计，但可能导致共享群集中的客户端错误和数据损坏。

Fencing: To prevent split brain, the stonithd daemon can be configured to automatically shut down a malfunctioning node or to impose a virtual fence between it and the data resources of the rest of a cluster. As long as there is a chance that the node could still be active, but is not properly coordinating with the rest of the cluster, it will remain behind the fence. Stonith stands for “Shoot the other node in the head”. Really.

防护：为防止脑裂，可以将stonithd守护程序配置为自动关闭发生故障的节点，或在该节点与集群其余部分的数据资源之间施加虚拟防护。只要该节点仍然有可能处于活动状态，但与群集的其余部分没有适当地协调，它就将保留在篱笆后面。 Stonith代表“射击头部中的另一个节点”。真。

Quorum: You can configure fencing (or forced shutdown) to be imposed on nodes that have fallen out of contact with each other or with some shared resource. Quorum is often defined as more than half of all the nodes on the total cluster. Using such defined configurations, you avoid having two subclusters of nodes, each believing the other to be malfunctioning, attempting to knock the other one out.

法定人数 ：您可以将防护(或强制关闭)配置为强加给彼此失去联系或与某些共享资源失去联系的节点。仲裁通常定义为整个群集中所有节点的一半以上。使用这样定义的配置，您可以避免有两个节点子群集，每个节点都认为另一个节点出现故障，并试图将另一个节点击倒。

Disaster Recovery: Your infrastructure can hardly be considered highly available if you’ve got no automated backup system in place along with an integrated and tested disaster recovery plan. Your plan will need to account for the redeployment of each of the servers in your cluster.

灾难恢复 ：如果您没有适当的自动备份系统以及经过测试的集成灾难恢复计划，那么您的基础架构几乎就不会被视为高度可用。您的计划将需要考虑群集中每台服务器的重新部署。

主动/被动集群 (Active/Passive Cluster)

The idea behind service failover is that the sudden loss of any one node in a service cluster would quickly be made up by another node taking its place. For this to work, the IP address is automatically moved to the standby node in the event of a failover. Alternatively, network routing tools like load balancers can be used to redirect traffic away from failed nodes. The precise way failover happens depends on the way you have configured your nodes.

服务故障转移背后的想法是，服务群集中任何一个节点的突然丢失将很快由另一个节点取代。为此，在发生故障转移时，IP地址会自动移至备用节点。或者，可以使用网络路由工具(例如负载平衡器)将流量重定向到发生故障的节点。故障转移的确切方式取决于您配置节点的方式。

Only one node will initially be configured to serve clients, and will continue to do so alone until it somehow fails. The responsibility for existing and new clients will then shift (i.e., “failover”) to the passive — or backup — node that until now has been kept passively in reserve. Applying the model to multiple servers or server room components (like power supplies), n+1 redundancy provides just enough resources for the current demand plus one more unit to cover for a failure.

最初仅将一个节点配置为为客户端提供服务，并且将继续单独进行服务，直到它因某种原因失败为止。然后，现有客户和新客户的职责将转移(即“故障转移”)到被动节点(即备用节点)，该节点到现在为止一直处于被动状态。将模型应用到多个服务器或服务器机房组件(如电源)后，n + 1冗余可为当前需求提供足够的资源，再加上一个单元来覆盖故障。

主动/主动集群 (Active/Active Cluster)

A cluster using an active/active design will have two or more identically configured nodes independently serving clients.

使用主动/主动设计的群集将具有两个或更多相同配置的节点，这些节点分别为客户端提供服务。

Should one node fail, its clients will automatically connect with the second node and, as far as resources permit, receive full resource access.

如果一个节点发生故障，其客户端将自动与第二个节点连接，并在资源允许的情况下获得完全的资源访问权限。

Once the first node recovers or is replaced, clients will once again be split between both server nodes.

一旦第一个节点恢复或更换，客户端将再次在两个服务器节点之间分配。

The primary advantage of running active/active clusters lies in the ability to efficiently balance a workload between nodes and even networks. The load balancer — which directs all requests from clients to available servers — is configured to monitor node and network activity and use some predetermined algorithm to route traffic to those nodes best able to handle it. Routing policies might follow a round-robin pattern, where client requests are simply alternated between available nodes, or by a preset weight where one node is favored over another by some ratio.

运行主动/主动群集的主要优势在于能够有效平衡节点甚至网络之间的工作负载。负载平衡器(将来自客户端的所有请求定向到可用服务器)被配置为监视节点和网络活动，并使用一些预定的算法将流量路由到最能够处理这些流量的节点。路由策略可能遵循循环模式，在该模式中，客户端请求在可用节点之间简单地交替，或者按预设权重进行分配，其中一个节点以某种比率优先于另一个节点。

Having a passive node acting as a stand-by replacement for its partner in an active/passive cluster configuration provides significant built-in redundancy. If your operation absolutely requires uninterrupted service and seamless failover transitions, then some variation of an active/passive architecture should be your goal.

在主动/被动群集配置中，让被动节点代替其伙伴的备用节点可提供大量的内置冗余。如果您的操作绝对需要不间断的服务和无缝的故障转移转换，那么您的目标应该是主动/被动架构的一些变化。

无共享集群与共享磁盘集群 (Shared-Nothing vs. Shared-Disk Clusters)

One of the guiding principles of distributed computing is to avoid having your operation rely on any single point of failure. That is, every resource should be either actively replicated (redundant) or independently replaceable (failover), and there should be no single element whose failure could bring down your whole service.

分布式计算的指导原则之一是避免让您的操作依赖于任何单点故障。也就是说，每个资源都应该被主动复制(冗余)或独立地可替换(故障转移)，并且不应存在任何单个元素的故障可能导致整个服务中断。

Now, imagine that you’re running a few dozen nodes that all rely on a single database server for their function. Even though the failure of any number of the nodes will not affect the continued health of those nodes that remain, should the database go down, the entire cluster would become useless. Nodes in a shared-nothing cluster, however, will (usually) maintain their own databases so that — assuming they’re being properly synced and configured for ongoing transaction safety — no external failure will impact them.

现在，假设您正在运行数十个节点，这些节点的功能全部依赖一台数据库服务器。即使任何数量的节点发生故障都不会影响剩余节点的持续运行状况，但如果数据库出现故障，整个集群将变得毫无用处。但是，无共享群集中的节点通常会维护自己的数据库，以便(假设已正确同步和配置它们以实现持续的事务安全性)没有外部故障会影响它们。

This will have a more significant impact on a load balanced cluster, as each load balanced node has a constant and critical need for simultaneous access to the data. The passive node on a simple failover system, however, might be able to survive some time without access.

这将对负载均衡群集产生更大的影响，因为每个负载均衡节点对同时访问数据都有持续且至关重要的需求。但是，简单故障转移系统上的被动节点可能可以存活一段时间而不进行访问。

While such a setup might slow down the way the cluster responds to some requests — partly because fears of split-brain failures might require periodic fencing through stonith — the trade off can be justified for mission critical deployments where reliability is the primary consideration.

尽管这样的设置可能会减慢群集对某些请求的响应速度(部分原因是因为担心裂脑故障可能需要通过stonith进行定期防护)，但对于以可靠性为首要考虑因素的关键任务部署，这种权衡是合理的。

可用性 (Availability)

When designing your cluster, you’ll need to have a pretty good sense of just how tolerant you can be of failure. Or, in other words, given the needs of the people or machines consuming your services, how long can a service disruption last before the mob comes pouring through your front gates with pitch forks and flaming torches. It’s important to know this, because the amount of redundancy you build into your design will have an enormous impact on the down-times you will eventually face.

在设计集群时，您需要对如何容忍失败有很好的认识。或者换句话说，鉴于使用您的服务的人员或机器的需求，在暴民用音叉和火把从您的前门倾泻之前，服务中断能持续多长时间。知道这一点很重要，因为构建到您的设计中的冗余量将对您最终将面临的停机时间产生巨大影响。

Obviously, the system you build for a service that can go down for a weekend without anyone noticing will be very different from an e-commerce site whose customers expect 24/7 access. At the very least, you should generally aim for an availability average of at least 99% — with some operations requiring significantly higher real-world results. 99% up time would translate to a loss of less than a total of four days out of every year.

显然，您为某项服务而构建的系统可能会在一个周末中断而无人注意，这与客户希望24/7全天候访问的电子商务站点大不相同。至少，您通常应将平均可用性目标定为至少99％-有些操作需要更高的实际结果。正常运行时间达到99％意味着每年总共损失少于四天。

There is a relatively simple formula you can use to build a useful estimate of Availability (A). The idea is to divide the Mean Time Before Failure by the Mean Time Before Failure plus Mean Time To Repair.

您可以使用一个相对简单的公式来构建可用性的有用估计值(A)。想法是将平均故障前时间除以平均故障前时间与平均维修时间。

A = MTBF / (MTBF + MTTR)

A = MTBF /(MTBF + MTTR)

The closer the value of A comes to 1, the more highly available your cluster will be. To obtain a realistic value for MTBF, you’ll probably need to spend time exposing a real system to some serious punishment, and watching it carefully for software, hardware, and networking failures. I suppose you could also consult the published life cycle metrics of hardware vendors or large-scale consumers like Backblaze to get an idea of how long heavily-used hardware can be expected to last.

A的值越接近1，则群集将具有更高的可用性。为了获得MTBF的现实价值，您可能需要花费时间将真实系统暴露在严重的惩罚之下，并仔细观察它的软件，硬件和网络故障。我想您还可以参考已发布的硬件供应商或Backblaze之类的大规模消费者的生命周期指标，以了解大量使用的硬件可以持续多长时间。

The MTTR will be a product of the time it takes your cluster to replace the functionality of a server node that’s failed (a process that’s similar to, though not identical with, disaster recovery — which focuses on quickly replacing failed hardware and connectivity). Ideally, that would be a value as close to zero seconds as possible.

MTTR将是群集替换发生故障的服务器节点的功能所花费的时间的产品(此过程与灾难恢复类似，但不完全相同，它专注于快速替换发生故障的硬件和连接)。理想情况下，该值应尽可能接近零秒。

The problem is that, in the real world, there are usually far too many unknown variables for this formula to be truly accurate, as nodes running different software configurations and built with hardware of varying profiles and ages will have a wide range of life expectancies. Nevertheless, it can be a good tool to help you identify the cluster design that’s best for your project.

问题在于，在现实世界中，此公式通常要使用太多未知变量才能真正准确地确定，因为运行不同软件配置并由具有不同配置文件和年龄的硬件构建的节点的预期寿命范围很广。但是，它可能是帮助您确定最适合您的项目的集群设计的好工具。

With that information, you can easily generate an estimate of how much overall downtime your service will likely in the course of an entire year.

借助这些信息，您可以轻松地估算出您的服务在整个一年中可能发生的总体停机时间。

A related consideration, if you’re deploying your resources on a third-party platform provider like VMWare or Amazon Web Services, is the provider’s Service Level Agreement (SLA). Amazon’s EC2, for instance, guarantees that their compute instances and block store storage devices will deliver a Monthly Uptime Percentage of at least 99.95% — which is less than five hours’ down time per year. AWS will issue credits for months in which they missed their targets — though not nearly enough to compensate for the total business costs of your downtime. With that information, you can arrange for a level of service redundancy that’s suitable for your unique needs.

如果要在第三方平台提供商(如VMWare或Amazon Web Services)上部署资源，则相关的考虑因素是提供商的服务水平协议(SLA)。例如，Amazon EC2保证其计算实例和大容量存储设备的每月正常运行时间百分比至少为99.95％，每年的停机时间少于5个小时。 AWS将在达不到预期目标的月份中发放信用额度，尽管该额度不足以弥补停机时间的总业务成本。借助这些信息，您可以安排适合您独特需求的服务冗余级别。

Naturally, as a service provider to your own customers, you may need to publish your own SLA based on your MTBF and MTTR estimates.

自然，作为您自己客户的服务提供商，您可能需要根据MTBF和MTTR估算值发布自己的SLA。

会话处理 (Session Handling)

For any server-client relationship, the data generated by stateful HTTP sessions needs to be saved in a way that makes it available for future interactions. Cluster architectures can introduce serious complexity into these relationships, as the specific server a client or user interacts with might change between one step and the next.

对于任何服务器-客户端关系，需要以有状态的HTTP会话生成的数据进行保存，以使其可用于将来的交互。集群体系结构可能会给这些关系带来严重的复杂性，因为客户端或用户与之交互的特定服务器可能会在一个步骤与另一个步骤之间发生变化。

To illustrate, imagine you’re logged onto Amazon.com, browsing through their books on LPIC training, and periodically adding an item to your cart (hopefully, more copies of this book). By the time you’re ready to enter your payment information and check out, however, the server you used to browse may no longer even exist. How will your current server know which books you decided to purchase?

为了说明这一点，假设您已登录Amazon.com，浏览了有关LPIC培训的书籍，并定期将商品添加到购物车(希望这本书有更多副本)。但是，当您准备好输入付款信息并结帐时，用于浏览的服务器可能甚至不存在。您当前的服务器将如何知道您决定购买哪些书籍？

I don’t know exactly how Amazon handles this, but the problem is often addressed through a data replication tool like memcached running on anexternal node (or nodes). The goal is to provide constant access to a reliable and consistent data source to any node that might need it.

我不确切知道Amazon如何处理此问题，但问题通常是通过在外部节点(或多个节点)上运行的数据复制工具(如memcached)解决的。目标是为可能需要它的任何节点提供对可靠一致的数据源的持续访问。

This article is adapted from “Teach Yourself Linux Virtualization and High Availability: prepare for the LPIC-3 304 certification exam”. Check out my other books on AWS and Linux administration, including Linux in Action and Linux in Motion — a hybrid course made up of more than two hours of video and around 40% of the text of Linux in Action.

本文改编自“ 自学Linux虚拟化和高可用性：为LPIC-3 304认证考试做准备 ”。 查阅我 有关AWS和Linux管理的其他书籍 ，包括《 Linux in Action》 和《 Linux in Motion》 ，这是一门混合课程，由两个多小时的视频和大约40％的Linux in Action文字组成。

翻译自: https://www.freecodecamp.org/news/high-availability-concepts-and-theory/

可用性计算