Distributed systems: consensus and RAFT algorithm

Szymon Kulec

@Scooletz

http://blog.scooletz.com

Outline

warm up: what is a distributed system?
CAP theorem
logs and state machines
Paxos by Leslie Lamport
RAFT: I want to understand

what a is

———distributed

————system

what is it?

A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.

Wikipedia

message passing you say?

Here be dragons!

out of order messages
duplicated messages
lost messages

a nice distributed system

scales up/out
has homogenic nodes
supports change number of nodes
for databases: don't brake under Jepsen test © Aphyr

a few examples

Databases: Cassandra, Riak, EventStore, FoundationDB
Queues: RabbitMQ (:D), Azure Queues
Data processing: Storm
Configuration: Zookeeper, Consul, etcd

CAP theorem

Eric Brewer asks you to choose two of them. You cannot satisfy them all.

Consistency - linearizability (allows perceving a history of operations as a sequence)
Availability - every request receives a response ok/error
Partition tolerance - can operate despite loosing messages, message copies etc.

CP, AP, AC?

Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.

Eric Brewer

CP, AP, AC? 2

CP: EventStore, Cassasnda (lightweight transactions), FoundationDB
AP: Riak (allow_mult), Cassandra, Dynamo (Amazon)
AC: lies, lies, lies...

Consensus, where are you?

It's needed for CP systems.

State machine

F(state, input) -> newState
F(F(F(s0, i0), i1), i2) -> s3
a sequence of s0, i1, i2, i3, i4, ... will give the same result for the same F

State machine replication

If there was a way to replicate a sequence of entries across multiple machines, having the same function applied to the sequence, would result in having the same state on all machines. That would bring consensus to all machines/processes...

Paxos

rooted in Leslie Lamport state machine approach
published in 1989
the algorithm family consists of many algorithms with different trade-offs

Paxos - a dictionary

processor - a node in a given cluster
quorum - a strong majority of processors (for 2N+1 nodes its N+1)
hard to understand like Paxos - a common phrase for implementors of distributed systems

Paxos - processor roles

Client
Acceptor (voter)
Proposer
Learner
Leader
Shouldn't we switch to RAFT?

RAFT - basic info

authors: Diego Ongaro; John Ousterhout
written to be easy to understand
it is easy to understand
splits algorithm into:
1. leader election
2. replication

RAFT - logical clock

synchronizing clocks is hard (Google Spanner: GPS + atomic clocks)
use logical, natural, incrementing numbers
term - a logical epoch of the system
clock can only go forward - never accepts messages sent in earlier terms

RAFT - roles

follower - simply follow the leader and vote if no leader. The initial state
candidate - votes for itself, steps down if leader elected, or becomes the leaders
leader - a strong leader, replicating its logs to other nodes

RAFT - election

when no msgs from leader, after given timeout, inc the term become a candidate
candidate votes for itself in the given term and asks others for votes
follower, votes for the first candidate in the given term
all votes are persisted, when a crush occur it's disallowed to vote second time in the same term
if no leader emerged, inc the term, reelection occurs

RAFT - election example

N1, N2, N3 - nodes
N1 and N2 times out and becomes candidates
each votes for itself and sends vote request to N3
N3 votes for N1
N1 becomes a leader

RAFT - election questions

is it possible to get two leaders in one term?
is it possible to get two leaders in different terms?

RAFT - replication

leader sends AppendEntries messages
when a follower has a corrupted entries, leaders steps back in its history finiding the matching one and resends
leader commits entries via AppendEntries messages
leader commits ONLY iff it successfully appended at least one entry

I want to know moreaarar!

Thank you and let's RAFT!

Szymon Kulec

@Scooletz

http://blog.scooletz.com