Raft

Abstract

This article will briefly introduce the Raft protocol in distributed systems. It will cover the basic ideas of Raft and the essential functions it supports to explain “consensus”.

References

Please start by watching the animation demonstration for an intuitive understanding. (P.s. In the election section, a follower can only vote once, and some animations may be incorrect.)

Leader election

A new leader needs to be elected when there is no leader or the leader breaks down for some reason.

Each server is regarded as a follower initially, with a period called the election timeout. It will transit to be a candidate for an election. Each candidate sends all left servers messages to request a vote. Each follower votes or rejects as a response to the request. As long as any majority of the servers vote for a candidate, it becomes a leader.


You can recognize that there will be a disaster when too many candidates exist, the “Timing and availability” subsection will mitigate this issue.


Log replication

We need multiple servers for recovery backup and there needs to be a consensus process.

After a server is a leader, log entries received from the clients are appended to the follower, and corresponding responses are obtained. If any majority of responses are received, commit the log and inform the followers to also commit.

Raft ensures the following:

  • The leader never overwrites or deletes its own logs, only appends to them (Leader Append-Only)
  • If two logs have the same index and term, then the logs are the same (Log Matching)
  • If two logs are the same, then all logs that come before them are also the same.

The leader maintains two arrays, nextIndex and matchIndex, which respectively keep track of the log index that each follower is about to append and the log index that has already been matched.
Each time a new leader is elected, initialize nextIndex to the index of the leader’s log plus one, and matchIndex is initialized to zero.

然后发送包来append日志.对于follower,考虑两种情况

  1. 缺少日志,只需要append,然后更新next和match
  2. append多次后重启follower丢失部分日志,需要leader减少next,直到找到匹配的日志,然后append后更新next和match

当log replication后得到的follower的回应超过半数时,commit然后发送包告知可以进行commit(更新相应字段)

img

回过头看raft保证的几点

  1. 显式指定
  2. leader最多在一个任期里在指定的一个日志索引位置创建一条日志条目,同时日志条目在日志中的位置也从来不会改变.
  3. 初始空日志状态满足,每次append保持这个特性,因此每个成功而存在的index及前面日志都满足这个特性

Safety

状态切换和故障情况下的日志冲突的解决办法.

Election restriction

保证了每个leader拥有所有被commit过的日志.

img

如图,

( a )中S1是领导者并且部分复制索引2处的日志条目。

( b ) S1崩溃;S5以S3、S4和自身的票数当选第3任期的领导人,并在对数指标2接受不同的进入。

( c ) S5崩溃;S1重新启动,当选领导者,继续复制。此时,term2的日志条目已经在大部分服务器上被复制,但并没有被提交。

如果S1像( d )一样崩溃,则S5可以当选领导者(投票来自S2、S3和S4),并从第3项开始用自己的条目覆盖该条目。

然而,如果S1在崩溃前在大多数服务器上复制了当前项,如( e )所示,则该项为提交( S5无法赢得选举)。此时,日志中的所有前项也被提交。

这里约束的原则为term和index,raft偏好于任期更高的和日志更多的leader选举.因此一个follower给出vote并非仅仅先来后到,只有

  1. term更大
  2. 相同term时,index更大

才会vote,否则拒绝

Committing entries from previous terms

同样图中(c)我们会发现即使log replication占据majority仍然会被(d)中的s5 overwrite.

为了防止这个情况,我们对于leader不仅仅关心旧任期的log replication.也就是说.对于c,我们不进行term2的log replication,而是对还未replication的所有log进行log replication.因此如果term4进行了commit,能够保证了term2的log也会被commit.如果没有被commit,也会因为election restriction而阻止s5在s1下线后成为leader

Timing and availability

广播时间(broadcastTime) << 选举超时时间(electionTimeout) << 平均故障间隔时间(MTBF)

广播时间必须比选举超时时间小一个量级,这样领导人才能够发送稳定的心跳消息来阻止跟随者开始进入选举状态;通过随机化选举超时时间的方法,这个不等式也使得选票瓜分的情况变得不可能。选举超时时间应该要比平均故障间隔时间小上几个数量级,这样整个系统才能稳定的运行。

广播时间和平均故障间隔时间是由系统决定的,但是选举超时时间是我们自己选择的。Raft 的 RPCs 需要接收方将信息持久化的保存到稳定存储中去,所以广播时间大约是 0.5 毫秒到 20 毫秒,取决于存储的技术。因此,选举超时时间可能需要在 10 毫秒到 500 毫秒之间。大多数的服务器的平均故障间隔时间都在几个月甚至更长,很容易满足时间的需求。

成员变更

讨论成员数量变化下的情况。

Log compaction

讨论日志增长的背景下,如何减少日志的占用空间。

客户端交互

ToDo Lists

  • 状态列表
  • RPC各种
  • ?commited的状态谁知情:leader存储commitIndex,然后发送给follower。
  • 日志复制时:append多次后重启leader丢失部分日志.这时候会出现强行覆盖follower的log,这是raft不考虑的部分
  • 日志复制的参数和过程
Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2020-2024 环烷烃
  • Visitors: | Views:

我很可爱,请我喝一瓶怡宝吧~

支付宝
微信