Distributed systems: consensus and RAFT algorithm

Szymon Kulec

@Scooletz

http://blog.scooletz.com

Outline

what a is

———distributed

————system

what is it?

A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
Wikipedia

message passing you say?

Here be dragons!

  1. out of order messages
  2. duplicated messages
  3. lost messages

a nice distributed system

  1. scales up/out
  2. has homogenic nodes
  3. supports change number of nodes
  4. for databases: don't brake under Jepsen test © Aphyr

a few examples

  1. Databases: Cassandra, Riak, EventStore, FoundationDB
  2. Queues: RabbitMQ (:D), Azure Queues
  3. Data processing: Storm
  4. Configuration: Zookeeper, Consul, etcd

CAP theorem

Eric Brewer asks you to choose two of them. You cannot satisfy them all.

  1. Consistency - linearizability (allows perceving a history of operations as a sequence)
  2. Availability - every request receives a response ok/error
  3. Partition tolerance - can operate despite loosing messages, message copies etc.

CP, AP, AC?

Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.
Eric Brewer

CP, AP, AC? 2

  1. CP: EventStore, Cassasnda (lightweight transactions), FoundationDB
  2. AP: Riak (allow_mult), Cassandra, Dynamo (Amazon)
  3. AC: lies, lies, lies...

Consensus, where are you?

It's needed for CP systems.

State machine

  1. F(state, input) -> newState
  2. F(F(F(s0, i0), i1), i2) -> s3
  3. a sequence of s0, i1, i2, i3, i4, ... will give the same result for the same F

State machine replication

If there was a way to replicate a sequence of entries across multiple machines, having the same function applied to the sequence, would result in having the same state on all machines. That would bring consensus to all machines/processes...

Paxos

  1. rooted in Leslie Lamport state machine approach
  2. published in 1989
  3. the algorithm family consists of many algorithms with different trade-offs

Paxos - a dictionary

  1. processor - a node in a given cluster
  2. quorum - a strong majority of processors (for 2N+1 nodes its N+1)
  3. hard to understand like Paxos - a common phrase for implementors of distributed systems

Paxos - processor roles

  1. Client
  2. Acceptor (voter)
  3. Proposer
  4. Learner
  5. Leader
  6. Shouldn't we switch to RAFT?

RAFT - basic info

  1. authors: Diego Ongaro; John Ousterhout
  2. written to be easy to understand
  3. it is easy to understand
  4. splits algorithm into:
    1. leader election
    2. replication

RAFT - logical clock

  1. synchronizing clocks is hard (Google Spanner: GPS + atomic clocks)
  2. use logical, natural, incrementing numbers
  3. term - a logical epoch of the system
  4. clock can only go forward - never accepts messages sent in earlier terms

RAFT - roles

  1. follower - simply follow the leader and vote if no leader. The initial state
  2. candidate - votes for itself, steps down if leader elected, or becomes the leaders
  3. leader - a strong leader, replicating its logs to other nodes

RAFT - election

  1. when no msgs from leader, after given timeout, inc the term become a candidate
  2. candidate votes for itself in the given term and asks others for votes
  3. follower, votes for the first candidate in the given term
  4. all votes are persisted, when a crush occur it's disallowed to vote second time in the same term
  5. if no leader emerged, inc the term, reelection occurs

RAFT - election example

  1. N1, N2, N3 - nodes
  2. N1 and N2 times out and becomes candidates
  3. each votes for itself and sends vote request to N3
  4. N3 votes for N1
  5. N1 becomes a leader

RAFT - election questions

  1. is it possible to get two leaders in one term?
  2. is it possible to get two leaders in different terms?

RAFT - replication

  1. leader sends AppendEntries messages
  2. when a follower has a corrupted entries, leaders steps back in its history finiding the matching one and resends
  3. leader commits entries via AppendEntries messages
  4. leader commits ONLY iff it successfully appended at least one entry

I want to know moreaarar!

  1. @Kellabyte
  2. @Peter Bailis
  3. Aphyr

Thank you and let's RAFT!

Szymon Kulec

@Scooletz

http://blog.scooletz.com