Replica Assignment assigns SPUs to a replica set and Replica Election coordinates their roles. The election algorithm manages replica sets in an attempt to designate one active leader at all times. SPUs have a powerful multi-threaded engine that can process a large number of leaders and followers at the same time.
If an SPU becomes incapacitated, the election algorithm identifies all impacted replica sets and triggers a re-election. The following section describes the algorithm utilized by each replica set as it elects a new leader.
The Leader
and Followers
of a Replica Sets have different responsibilities.
Leader
responsibilities:
- ingests data from producers
- stores the data in the local store
- sends data to consumers
- forwards incremental data changes to followers
- keeps live replica sets (LRS) updated
Followers
responsibilities:
- establishes connection to the leader (and run periodic health-checks)
- receives data changes from the leader
- stores data in the local store
All followers are in hot-standby and ready to take-over as leader.
Each data stream has a Live Replica Set (LRS) that describes the SPUs actively replicating data records in their local data store. LRS status can be viewed in show partitions
CLI command.
Replica election covers two core cases:
- SPU
goes offline
- SPU that was previously part of the cluster
comes back online
When an SPU goes offline, the SC identifies all impacted Replica Sets
and triggers an election:
-
set
Replica Set
status to Election -
choose leader candidate from the followers in LRS based on smallest lag behind the previous leader:
-
leader candidate found:
- set
Replica Set
status to CandidateFound - notifies all follower SPUs
- start wait-for-response timer
- set
-
no eligible leader candidate available:
- set
Replica Set
status to Offline
- set
-
Follower SPUs receive Leader Candidate
Notification
All SPUs in the Replica Set receive proposed leader candidate
and perform the following operations:
-
SPU that matches
leader candidate
tries to promote follower replica to leader replica:-
follower
toleader
promotion successful- SPU notifies the SC
-
follower
toleader
promotion failed- no notification is sent
-
-
Other SPUs ignore the message
SC receives Promotion Successful
from Leader Candidate
SPU
The SC perform the follower operations:
- set
Replica Set
status to Online - update LRS
- notifies all follower SPUs in the LRS list
SC ‘wait-for-response’ timer fired
The SC chooses the next Leader Candidate
from the LRS list and the process repeats.
If no eligible leader candidate left:
- set
Replica Set
status to Offline
SPUs receive new LRS
All SPU followers update their LRS:
- reconnect to new leader
- synchronize internal checkpoints with new leader (as described below).
- wait for changes from new leader
When an known SPU comes back Online, the SC identifies all impacted Replica Sets
and triggers a refresh.
For all Replica Sets with status Offline, the SC performs the following operations:
-
set
Replica Set
status to Election -
choose leader candidate from follower membership list based on smallest lag behind the previous leader:
-
leader candidate found:
- set
Replica Set
status to CandidateFound - notifies all follower SPUs (see above)
- start wait-for-response timer
- set
-
no eligible leader candidate left:
- set
Replica Set
status to Offline
- set
-
The algorithm repeats the same steps as in the “SPU goes Offline” section.
Each SPU has a Leader Controller that manages leader replicas, and a Follower Controller that manages follower replicas. SPU utilizes Rust async framework to run a virtually unlimited number of leader and follower operations simultaneously.
Each Replica Set has a communication channel where for the leader and followers exchange replica information. It is the responsibility of the followers to establish a connection to the leader. Once a connection between two SPUs is created, it is shared by all replica sets.
For example, three replica sets a, b, and c that are distributed across SPU-1
, SPU-2
, and SPU-3
:
The first follower (b, or c) from SPU-1
that tries to communicate with its leader in SPU-2
generates a TCP connection. Then, all subsequent communication from SPU-1
to SPU-2
, irrespective of the replica set, will reuse the same connection.
Hence, each SPU pair will have at most 2 connections. For example:
SPU-1
<=>SPU-2
SPU-1
followers =>SPU-2
leadersSPU-2
followers =>SPU-1
leaders
Replicas use offsets to indicate the position of a record in a data stream. Offsets starts at zero
and are incremented by one anytime a new record is appended.
Log End Offset (LEO) represents the offset of last record in the local store of a replica. A records is considered committed only when replicated by all live replicas. Live Replica Sets (LRS) is the set of active replicas in the membership list. High Watermark (HW) is the last offset of the record committed by the LRS.
Synchronization algorithm collects the LEOs, computes the HW, and manages the (LRS).
In this example:
- LRS = 3
- HW = 2
- LEO (Leader = 4, Follower-1 = 3, Follower-2 = 2)
If Follower-2 goes offline: LRS = 2 and HW = 3.
Leader/Follower Synchronization
All replica followers send their replica status, LEO
and HW
, to their leader. The leader:
- uses
LEO
to compute the missing records and send to follower - computes the ’new’
HW
(from min replicaLEOs
). - sends
HW
,LEO
, andLRS
to all followers
Replica followers receive the data records, LEO
, and HW
from the leader and perform the following operations:
- append records to local stores
- update local
LEO
andHW
- send updated status to leader
And the cycle repeats.
Lagging Follower
If the leader detects any of follower’s HW
is less than the LRS HW
by a maximum number of records, the leader removes the follower from the LRS.
Followers removed from the LRS are ineligible for election but continue to receive records. If follower catches up with the leader it is added back to LRS and once again becomes eligible for election.
Leader Failure
If a leader goes offline, an election is triggered and one of the followers takes over as leader. The rest of the followers connect to the new leader and synchronize their data store.
When the failed leader rejoins the replica set, it detects the new leader and turns itself into a follower. The replica set continue under the new leadership until a new election is triggered.
Replica leaders receive data records from producers and sends them to consumers.
Consumers can choose to receive either COMMITTED or UNCOMMITTED records. The second method is discouraged as it cannot deterministically survive various failure scenarios.
By default, UNCOMMITTED messages are sent to consumers.