Слайд 1CS 525
Advanced Distributed Systems
Spring 2011
Indranil Gupta (Indy)
Membership Protocols (and
Failure Detectors)
March 31, 2011
All Slides © IG
Слайд 2Target Settings
Process ‘group’-based systems
Clouds/Datacenters
Replicated servers
Distributed databases
Crash-stop/Fail-stop process failures
Слайд 3Group Membership Service
Application Queries
e.g., gossip, overlays, DHT’s,
etc.
Membership
Protocol
Group
Membership List
joins, leaves, failures
of members
Unreliable
Communication
Application Process pi
Membership
List
Слайд 4Two sub-protocols
Application Process pi
Group
Membership List
Unreliable
Communication
Almost-Complete list (focus of
this talk)
Gossip-style, SWIM, Virtual synchrony, …
Or Partial-random list (other papers)
SCAMP,
T-MAN, Cyclon,…
Слайд 5Large Group: Scalability A Goal
this is us (pi)
1000’s of processes
Process
Group
“Members”
Слайд 6 pj
Group Membership Protocol
Crash-stop Failures only
Слайд 7I. pj crashes
Nothing we can do about it!
A
frequent occurrence
Common case rather than exception
Frequency goes up at least
linearly with size of datacenter
Слайд 8II. Distributed Failure Detectors: Desirable Properties
Completeness = each failure is
detected
Accuracy = there is no mistaken detection
Speed
Time to first detection
of a failure
Scale
Equal Load on each member
Network Message Load
Слайд 9Distributed Failure Detectors: Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal
Load on each member
Network Message Load
Слайд 10What Real Failure Detectors Prefer
Completeness
Accuracy
Speed
Time to first detection of a
failure
Scale
Equal Load on each member
Network Message Load
Guaranteed
Partial/Probabilistic
guarantee
Слайд 11Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load
on each member
Network Message Load
Time until some
process detects the
failure
Guaranteed
Partial/Probabilistic
guarantee
Слайд 12Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load
on each member
Network Message Load
Time until some
process detects the
failure
Guaranteed
Partial/Probabilistic
guarantee
No bottlenecks/single
failure point
Слайд 13Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load
on each member
Network Message Load
In spite of
arbitrary simultaneous
process
failures
Слайд 14Centralized Heartbeating
…
pi, Heartbeat Seq. l++
pi
pj
Heartbeats sent periodically
If heartbeat not
received from pi within
timeout, mark pi as failed
Слайд 15Ring Heartbeating
pi, Heartbeat Seq. l++
pi
…
…
pj
Слайд 16All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
pj
Слайд 17Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi
Слайд 18Gossip-Style Failure Detection
1
2
4
3
Protocol:
Nodes periodically gossip their membership list
On receipt,
the local membership list is updated
Current time : 70 at
node 2
(asynchronous clocks)
Address
Heartbeat Counter
Time (local)
Fig and animation by: Dongyun Jin and Thuy Ngyuen
Слайд 19Gossip-Style Failure Detection
If the heartbeat has not increased for more
than Tfail seconds,
the member is considered failed
And after Tcleanup
seconds, it will delete the member from the list
Why two different timeouts?
Слайд 20Gossip-Style Failure Detection
What if an entry pointing to a failed
node is deleted right after Tfail seconds?
Fix: remember for another
Tfail
1
2
4
3
Current time : 75 at node 2
Слайд 21Multi-level Gossiping
Network topology is hierarchical
Random gossip target selection => core
routers face O(N) load (Why?)
Fix: Select gossip target in subnet
I, which contains ni nodes, with probability 1/ni
Router load=O(1)
Dissemination time=O(log(N))
Why?
What about latency for multi-level topologies?
[Gupta et al, TPDS 06]
Router
N/2 nodes in a subnet
N/2 nodes in a subnet
Слайд 22Analysis/Discussion
What happens if gossip period Tgossip is decreased?
A single
heartbeat takes O(log(N)) time to propagate. So: N heartbeats take:
O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N)
O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
What about O(k) bandwidth?
What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
Tradeoff: False positive rate vs. detection time
Слайд 23Simulations
As # members increases, the detection time increases
As requirement is
loosened, the detection time decreases
As # failed members increases, the
detection time increases significantly
The algorithm is resilient to message loss
Слайд 24Failure Detector Properties …
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal
Load on each member
Network Message Load
Слайд 25…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load
on each member
Network Message Load
Guarantee always
Probability PM(T)
T time units
Слайд 26…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load
on each member
Network Message Load
Guarantee always
Probability PM(T)
T time units
N*L: Compare
this across protocols
Слайд 27All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
Every T units
L=N/T
Слайд 28Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi
Every tg units
=gossip
period,
send O(N) gossip
message
T=logN * tg
L=N/tg=N*logN/T
Слайд 29
Worst case load L*
as a function of T, PM(T),
N
Independent Message Loss probability pml
(proof in PODC 01 paper)
What’s the Best/Optimal we can do?
Слайд 30Heartbeating
Optimal L is independent of N (!)
All-to-all and gossip-based: sub-optimal
L=O(N/T)
try
to achieve simultaneous detection at all processes
fail to distinguish Failure
Detection and Dissemination components
Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component
Слайд 31SWIM Failure Detector Protocol
pj
Слайд 32SWIM versus Heartbeating
Process Load
First Detection
Time
Constant
Constant
O(N)
O(N)
SWIM
For Fixed :
False Positive Rate
Message Loss Rate
Heartbeating
Heartbeating
Слайд 34Accuracy, Load
PM(T) is exponential in -K. Also depends on pml
(and pf )
See paper
for up to 15 % loss rates
Слайд 35Prob. of being pinged in T’=
E[T ] =
Completeness: Any
alive member detects failure
Eventually
By using a trick: within worst case
O(N) protocol periods
Detection Time
Слайд 37Dissemination Options
Multicast (Hardware / IP)
unreliable
multiple simultaneous multicasts
Point-to-point (TCP /
UDP)
expensive
Zero extra messages: Piggyback on Failure Detector messages
Infection-style Dissemination
Слайд 38Infection-style Dissemination
pj
K random
processes
Слайд 39Infection-style Dissemination
Epidemic style dissemination
After protocol periods, processes would not
have heard about an update
Maintain a buffer of recently joined/evicted
processes
Piggyback from this buffer
Prefer recent updates
Buffer elements are garbage collected after a while
After protocol periods; this defines weak consistency
Слайд 40Suspicion Mechanism
False detections, due to
Perturbed processes
Packet losses, e.g., from congestion
Indirect
pinging may not solve the problem
e.g., correlated message losses near
pinged host
Key: suspect a process before declaring it as failed in the group
Слайд 41Suspicion Mechanism
Alive
Suspected
Failed
Dissmn (Suspect pj)
Dissmn (Alive pj)
Dissmn (Failed pj)
pi ::
State Machine for pj view element
FD:: pi ping failed
Dissmn::(Suspect pj)
Time
out
FD::pi ping success
Dissmn::(Alive pj)
Слайд 42Suspicion Mechanism
Distinguish multiple suspicions of a process
Per-process incarnation number
Inc # for pi can be incremented only by pi
e.g.,
when it receives a (Suspect, pi) message
Somewhat similar to DSDV
Higher inc# notifications over-ride lower inc#’s
Within an inc#: (Suspect inc #) > (Alive, inc #)
Nothing overrides a (Failed, inc #)
See paper
Слайд 43Time-bounded Completeness
Key: select each membership element once as a ping
target in a traversal
Round-robin pinging
Random permutation of list after each
traversal
Each failure is detected in worst case 2N-1 (local) protocol periods
Preserves FD properties
Слайд 44Results from an Implementation
Current implementation
Win2K, uses Winsock 2
Uses only UDP
messaging
900 semicolons of code (including testing)
Experimental platform
Galaxy cluster: diverse collection
of commodity PCs
100 Mbps Ethernet
Default protocol settings
Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)]
No partial membership lists observed in experiments
Слайд 45Per-process Send and Receive Loads
are independent of group size
Слайд 46Time to First Detection of a process failure
T1
T1+T2+T3
Слайд 47T1
Time to First Detection of a process failure
apparently uncorrelated
to group size
T1+T2+T3
Слайд 48Membership Update Dissemination Time
is low at high group sizes
T2
+
T1+T2+T3
Слайд 49Excess time taken by
Suspicion Mechanism
T3
+
T1+T2+T3
Слайд 50Benefit of Suspicion Mechanism:
Per-process 10% synthetic packet loss
Слайд 51More discussion points
It turns out that with a partial membership
list that is uniformly random, gossiping retains same properties as
with complete membership lists
Why? (Think of the equation)
Partial membership protocols
SCAMP, Cyclon, TMAN, …
Gossip-style failure detection underlies
Astrolabe
Amazon EC2/S3 (rumored!)
SWIM used in
CoralCDN/Oasis anycast service: http://oasis.coralcdn.org
Mike Freedman used suspicion mechanism to blackmark frequently-failing nodes
Слайд 52Reminder – Due this Sunday April 3rd at 11.59 PM
Project
Midterm Report due, 11.59 pm [12pt font, single-sided, 8 +
1 page Business Plan max]
Wiki Term Paper - Second Draft Due (Individual)
Reviews – you only have to submit reviews for 15 sessions (any 15 sessions) from 2/10 to 4/28. Keep track of your count! Take a breather!