Разделы презентаций


1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta

Содержание

Target SettingsProcess ‘group’-based systemsClouds/Datacenters Replicated serversDistributed databasesCrash-stop/Fail-stop process failures

Слайды и текст этой презентации

Слайд 1CS 525 Advanced Distributed Systems Spring 2011
Indranil Gupta (Indy)

Membership Protocols (and

Failure Detectors)
March 31, 2011
All Slides © IG

CS 525  Advanced Distributed Systems Spring 2011Indranil Gupta (Indy)Membership Protocols (and Failure Detectors)March 31, 2011All Slides

Слайд 2Target Settings
Process ‘group’-based systems
Clouds/Datacenters
Replicated servers
Distributed databases


Crash-stop/Fail-stop process failures

Target SettingsProcess ‘group’-based systemsClouds/Datacenters Replicated serversDistributed databasesCrash-stop/Fail-stop process failures

Слайд 3Group Membership Service
Application Queries
e.g., gossip, overlays, DHT’s,

etc.
Membership
Protocol
Group
Membership List
joins, leaves, failures
of members
Unreliable
Communication
Application Process pi
Membership

List
Group Membership ServiceApplication Queries   e.g., gossip, overlays, 	DHT’s, etc.MembershipProtocolGroup Membership List joins, leaves, failuresof membersUnreliable

Слайд 4Two sub-protocols
Application Process pi
Group
Membership List
Unreliable
Communication
Almost-Complete list (focus of

this talk)
Gossip-style, SWIM, Virtual synchrony, …
Or Partial-random list (other papers)
SCAMP,

T-MAN, Cyclon,…
Two sub-protocolsApplication Process piGroup Membership ListUnreliable CommunicationAlmost-Complete list (focus of this talk)Gossip-style, SWIM, Virtual synchrony, …Or Partial-random

Слайд 5Large Group: Scalability A Goal
this is us (pi)
1000’s of processes
Process

Group
“Members”

Large Group: Scalability A Goalthis is us (pi)1000’s of processesProcess Group“Members”

Слайд 6 pj
Group Membership Protocol
Crash-stop Failures only

pjGroup Membership ProtocolCrash-stop Failures only

Слайд 7I. pj crashes
Nothing we can do about it!
A

frequent occurrence
Common case rather than exception
Frequency goes up at least

linearly with size of datacenter

I. pj crashes Nothing we can do about it! A frequent occurrenceCommon case rather than exceptionFrequency goes

Слайд 8II. Distributed Failure Detectors: Desirable Properties
Completeness = each failure is

detected
Accuracy = there is no mistaken detection
Speed
Time to first detection

of a failure
Scale
Equal Load on each member
Network Message Load

II. Distributed Failure Detectors: Desirable PropertiesCompleteness = each failure is detectedAccuracy = there is no mistaken detectionSpeedTime

Слайд 9Distributed Failure Detectors: Properties
Completeness
Accuracy

Speed
Time to first detection of a failure
Scale
Equal

Load on each member
Network Message Load

Distributed Failure Detectors: PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message Load

Слайд 10What Real Failure Detectors Prefer
Completeness
Accuracy
Speed
Time to first detection of a

failure
Scale
Equal Load on each member
Network Message Load
Guaranteed
Partial/Probabilistic
guarantee

What Real Failure Detectors PreferCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadGuaranteed

Слайд 11Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

Time until some
process detects the

failure

Guaranteed

Partial/Probabilistic
guarantee

Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadTime until some

Слайд 12Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

Time until some
process detects the

failure

Guaranteed

Partial/Probabilistic
guarantee

No bottlenecks/single
failure point

Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadTime until some

Слайд 13Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

In spite of
arbitrary simultaneous
process

failures
Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadIn spite of

Слайд 14Centralized Heartbeating

pi, Heartbeat Seq. l++
pi
pj
Heartbeats sent periodically
If heartbeat not

received from pi within
timeout, mark pi as failed

Centralized Heartbeating…pi, Heartbeat Seq. l++ pipjHeartbeats sent periodicallyIf heartbeat not received from pi withintimeout, mark pi as

Слайд 15Ring Heartbeating
pi, Heartbeat Seq. l++
pi


pj

Ring Heartbeatingpi, Heartbeat Seq. l++pi……pj

Слайд 16All-to-All Heartbeating
pi, Heartbeat Seq. l++

pi
pj

All-to-All Heartbeatingpi, Heartbeat Seq. l++…pipj

Слайд 17Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi

Gossip-style HeartbeatingArray of Heartbeat Seq. lfor member subsetpi

Слайд 18Gossip-Style Failure Detection
1
2
4
3
Protocol:
Nodes periodically gossip their membership list
On receipt,

the local membership list is updated
Current time : 70 at

node 2
(asynchronous clocks)

Address

Heartbeat Counter

Time (local)

Fig and animation by: Dongyun Jin and Thuy Ngyuen

Gossip-Style Failure Detection1243Protocol: Nodes periodically gossip their membership listOn receipt, the local membership list is updatedCurrent time

Слайд 19Gossip-Style Failure Detection
If the heartbeat has not increased for more

than Tfail seconds, the member is considered failed
And after Tcleanup

seconds, it will delete the member from the list
Why two different timeouts?
Gossip-Style Failure DetectionIf the heartbeat has not increased for more than Tfail seconds,  the member is

Слайд 20Gossip-Style Failure Detection
What if an entry pointing to a failed

node is deleted right after Tfail seconds?







Fix: remember for another

Tfail

1

2

4

3

Current time : 75 at node 2

Gossip-Style Failure DetectionWhat if an entry pointing to a failed node is deleted right after Tfail seconds?Fix:

Слайд 21Multi-level Gossiping
Network topology is hierarchical
Random gossip target selection => core

routers face O(N) load (Why?)

Fix: Select gossip target in subnet

I, which contains ni nodes, with probability 1/ni
Router load=O(1)
Dissemination time=O(log(N))
Why?
What about latency for multi-level topologies?
[Gupta et al, TPDS 06]


Router

N/2 nodes in a subnet

N/2 nodes in a subnet

Multi-level GossipingNetwork topology is hierarchicalRandom gossip target selection => core routers face O(N) load (Why?)Fix: Select gossip

Слайд 22Analysis/Discussion
What happens if gossip period Tgossip is decreased?
A single

heartbeat takes O(log(N)) time to propagate. So: N heartbeats take:


O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N)
O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
What about O(k) bandwidth?
What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
Tradeoff: False positive rate vs. detection time
Analysis/DiscussionWhat happens if gossip period Tgossip is decreased? A single heartbeat takes O(log(N)) time to propagate. So:

Слайд 23Simulations
As # members increases, the detection time increases
As requirement is

loosened, the detection time decreases
As # failed members increases, the

detection time increases significantly

The algorithm is resilient to message loss

SimulationsAs # members increases, the detection time increasesAs requirement is loosened, the detection time decreasesAs # failed

Слайд 24Failure Detector Properties …
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal

Load on each member
Network Message Load

Failure Detector Properties …CompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message Load

Слайд 25…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

Guarantee always
Probability PM(T)
T time units

…Are application-defined RequirementsCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadGuarantee alwaysProbability PM(T)T

Слайд 26…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

Guarantee always
Probability PM(T)
T time units
N*L: Compare

this across protocols
…Are application-defined RequirementsCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each memberNetwork Message LoadGuarantee alwaysProbability PM(T)T

Слайд 27All-to-All Heartbeating
pi, Heartbeat Seq. l++

pi
Every T units
L=N/T

All-to-All Heartbeatingpi, Heartbeat Seq. l++…piEvery T unitsL=N/T

Слайд 28Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi
Every tg units
=gossip

period,
send O(N) gossip
message
T=logN * tg
L=N/tg=N*logN/T

Gossip-style HeartbeatingArray of Heartbeat Seq. lfor member subsetpiEvery tg units=gossip period,send O(N) gossipmessageT=logN * tgL=N/tg=N*logN/T

Слайд 29
Worst case load L*
as a function of T, PM(T),

N
Independent Message Loss probability pml

(proof in PODC 01 paper)


What’s the Best/Optimal we can do?

Worst case load L* as a function of T, PM(T), NIndependent Message Loss probability pml

Слайд 30Heartbeating
Optimal L is independent of N (!)
All-to-all and gossip-based: sub-optimal
L=O(N/T)
try

to achieve simultaneous detection at all processes
fail to distinguish Failure

Detection and Dissemination components

Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component

HeartbeatingOptimal L is independent of N (!)All-to-all and gossip-based: sub-optimalL=O(N/T)try to achieve simultaneous detection at all processesfail

Слайд 31SWIM Failure Detector Protocol
pj

SWIM Failure Detector Protocolpj

Слайд 32SWIM versus Heartbeating
Process Load
First Detection
Time
Constant
Constant
O(N)
O(N)
SWIM
For Fixed :
False Positive Rate

Message Loss Rate
Heartbeating
Heartbeating

SWIM versus HeartbeatingProcess LoadFirst DetectionTimeConstantConstantO(N)O(N)SWIMFor Fixed : False Positive Rate Message Loss RateHeartbeatingHeartbeating

Слайд 33SWIM Failure Detector

SWIM Failure Detector

Слайд 34Accuracy, Load

PM(T) is exponential in -K. Also depends on pml

(and pf )
See paper


for up to 15 % loss rates


Accuracy, LoadPM(T) is exponential in -K. Also depends on pml (and pf )See paper

Слайд 35Prob. of being pinged in T’=

E[T ] =

Completeness: Any

alive member detects failure
Eventually
By using a trick: within worst case

O(N) protocol periods


Detection Time

Prob. of being pinged in T’=E[T ] = Completeness: Any alive member detects failureEventuallyBy using a trick:

Слайд 36III. Dissemination
HOW ?

III. DisseminationHOW ?

Слайд 37Dissemination Options
Multicast (Hardware / IP)
unreliable
multiple simultaneous multicasts
Point-to-point (TCP /

UDP)
expensive
Zero extra messages: Piggyback on Failure Detector messages
Infection-style Dissemination

Dissemination OptionsMulticast (Hardware / IP)unreliable multiple simultaneous multicastsPoint-to-point (TCP / UDP)expensiveZero extra messages: Piggyback on Failure Detector

Слайд 38Infection-style Dissemination
pj
K random
processes

Infection-style DisseminationpjK randomprocesses

Слайд 39Infection-style Dissemination
Epidemic style dissemination
After protocol periods, processes would not

have heard about an update
Maintain a buffer of recently joined/evicted

processes
Piggyback from this buffer
Prefer recent updates
Buffer elements are garbage collected after a while
After protocol periods; this defines weak consistency
Infection-style DisseminationEpidemic style disseminationAfter  		protocol periods, 		processes would not have heard about an updateMaintain a buffer

Слайд 40Suspicion Mechanism
False detections, due to
Perturbed processes
Packet losses, e.g., from congestion
Indirect

pinging may not solve the problem
e.g., correlated message losses near

pinged host
Key: suspect a process before declaring it as failed in the group

Suspicion MechanismFalse detections, due toPerturbed processesPacket losses, e.g., from congestionIndirect pinging may not solve the probleme.g., correlated

Слайд 41Suspicion Mechanism
Alive
Suspected
Failed
Dissmn (Suspect pj)
Dissmn (Alive pj)
Dissmn (Failed pj)
pi ::

State Machine for pj view element
FD:: pi ping failed
Dissmn::(Suspect pj)
Time

out

FD::pi ping success
Dissmn::(Alive pj)

Suspicion MechanismAliveSuspectedFailedDissmn (Suspect pj)Dissmn (Alive pj)Dissmn (Failed pj) pi :: State Machine for pj view elementFD:: pi

Слайд 42Suspicion Mechanism
Distinguish multiple suspicions of a process
Per-process incarnation number

Inc # for pi can be incremented only by pi
e.g.,

when it receives a (Suspect, pi) message
Somewhat similar to DSDV
Higher inc# notifications over-ride lower inc#’s
Within an inc#: (Suspect inc #) > (Alive, inc #)
Nothing overrides a (Failed, inc #)
See paper

Suspicion MechanismDistinguish multiple suspicions of a process Per-process incarnation number Inc # for pi can be incremented

Слайд 43Time-bounded Completeness
Key: select each membership element once as a ping

target in a traversal
Round-robin pinging
Random permutation of list after each

traversal
Each failure is detected in worst case 2N-1 (local) protocol periods
Preserves FD properties
Time-bounded CompletenessKey: select each membership element once as a ping target in a traversalRound-robin pingingRandom permutation of

Слайд 44Results from an Implementation
Current implementation
Win2K, uses Winsock 2
Uses only UDP

messaging
900 semicolons of code (including testing)
Experimental platform
Galaxy cluster: diverse collection

of commodity PCs
100 Mbps Ethernet
Default protocol settings
Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)]
No partial membership lists observed in experiments
Results from an ImplementationCurrent implementationWin2K, uses Winsock 2Uses only UDP messaging900 semicolons of code (including testing)Experimental platformGalaxy

Слайд 45Per-process Send and Receive Loads
are independent of group size

Per-process Send and Receive Loads are independent of group size

Слайд 46Time to First Detection of a process failure

T1
T1+T2+T3

Time to First Detection of a process failure T1T1+T2+T3

Слайд 47T1
Time to First Detection of a process failure
apparently uncorrelated

to group size
T1+T2+T3

T1Time to First Detection of a process failure apparently uncorrelated to group sizeT1+T2+T3

Слайд 48Membership Update Dissemination Time
is low at high group sizes
T2
+
T1+T2+T3

Membership Update Dissemination Time is low at high group sizesT2+T1+T2+T3

Слайд 49Excess time taken by
Suspicion Mechanism
T3
+
T1+T2+T3

Excess time taken by Suspicion MechanismT3+T1+T2+T3

Слайд 50Benefit of Suspicion Mechanism:
Per-process 10% synthetic packet loss

Benefit of Suspicion Mechanism:Per-process 10% synthetic packet loss

Слайд 51More discussion points
It turns out that with a partial membership

list that is uniformly random, gossiping retains same properties as

with complete membership lists
Why? (Think of the equation)
Partial membership protocols
SCAMP, Cyclon, TMAN, …
Gossip-style failure detection underlies
Astrolabe
Amazon EC2/S3 (rumored!)
SWIM used in
CoralCDN/Oasis anycast service: http://oasis.coralcdn.org
Mike Freedman used suspicion mechanism to blackmark frequently-failing nodes
More discussion pointsIt turns out that with a partial membership list that is uniformly random, gossiping retains

Слайд 52Reminder – Due this Sunday April 3rd at 11.59 PM
Project

Midterm Report due, 11.59 pm [12pt font, single-sided, 8 +

1 page Business Plan max]
Wiki Term Paper - Second Draft Due (Individual)

Reviews – you only have to submit reviews for 15 sessions (any 15 sessions) from 2/10 to 4/28. Keep track of your count! Take a breather!
Reminder – Due this Sunday April 3rd at 11.59 PMProject Midterm Report due, 11.59 pm [12pt font,

Слайд 53Questions

Questions

Обратная связь

Если не удалось найти и скачать доклад-презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое TheSlide.ru?

Это сайт презентации, докладов, проектов в PowerPoint. Здесь удобно  хранить и делиться своими презентациями с другими пользователями.


Для правообладателей

Яндекс.Метрика