Разделы презентаций


NAMD-BluegeneL

Содержание

OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication optimizationsSummary

Слайды и текст этой презентации

Слайд 1Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD
Sameer

Kumar,
Blue Gene Software Group,
IBM T J Watson Research Center,
Yorktown

Heights, NY
sameerk@us.ibm.com
Achieving Strong Scaling On Blue Gene/L: Case Study with NAMDSameer Kumar, Blue Gene Software Group,IBM T J

Слайд 2Outline
Motivation
NAMD and Charm++
BGL Techniques
Problem mapping
Overlap of communication with computation
Grain size
Load-balancing
Communication

optimizations
Summary

OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication optimizationsSummary

Слайд 3Blue Gene/L

Blue Gene/L

Слайд 4Blue Gene/L
2.8/5.6 GF/s
4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s
1.0 GB
(32

chips 4x4x2)
16 compute, 0-2 IO cards
90/180 GF/s
16 GB
32 Node

Cards

2.8/5.6 TF/s
512 GB

64 Racks, 64x32x32

180/360 TF/s
32 TB

Rack

System

Node Card

Compute Card

Chip

Blue Gene/L2.8/5.6 GF/s4 MB2 processors2 chips, 1x2x15.6/11.2 GF/s1.0 GB (32 chips 4x4x2)16 compute, 0-2 IO cards90/180 GF/s16

Слайд 5Application Scaling
Weak
Problem size increases with processors
Strong
Constant problem size


Linear to sub-linear decrease in computation time with processors
Cache performance
Communication

overhead
Communication to computation ratio
Application ScalingWeak Problem size increases with processorsStrong Constant problem size Linear to sub-linear decrease in computation time

Слайд 6Scaling on Blue Gene/L
Several applications have demonstrated weak scaling

Strong scaling

on a large number of benchmarks still needs to be

achieved

Scaling on Blue Gene/LSeveral applications have demonstrated weak scalingStrong scaling on a large number of benchmarks still

Слайд 7NAMD and Charm++

NAMD and Charm++

Слайд 8NAMD: A Production MD program
NAMD
Fully featured program
NIH-funded development
Distributed free of

charge (thousands downloads so far)
Binaries and source code
Installed at NSF

centers
User training and support
Large published simulations (e.g., aquaporin simulation featured in keynote)
NAMD: A Production MD programNAMDFully featured programNIH-funded developmentDistributed free of charge (thousands downloads so far)Binaries and source

Слайд 9NAMD, CHARMM27, PME
NpT ensemble at 310 or 298 K
1ns

equilibration, 4ns production

Protein: ~ 15,000 atoms
Lipids (POPE): ~ 40,000 atoms
Water: ~

51,000 atoms
Total: ~ 106,000 atoms

3.5 days / ns - 128 O2000 CPUs
11 days / ns - 32 Linux CPUs
.35 days/ns–512 LeMieux CPUs

Acquaporin Simulation

F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)
M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns productionProtein:	~  15,000 atomsLipids (POPE):	~

Слайд 10Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds
Newtonian mechanics
Thousands

of atoms (10,000 - 500,000)
At each time-step
Calculate forces on each

atom
Bonds:
Non-bonded: electrostatic and van der Waal’s
Short-distance: every timestep
Long-distance: using PME (3D FFT)
Multiple Time Stepping : PME every 4 timesteps
Calculate velocities and advance positions
Challenge: femtosecond time-step, millions needed!
Molecular Dynamics in NAMDCollection of [charged] atoms, with bondsNewtonian mechanicsThousands of atoms (10,000 - 500,000)At each time-stepCalculate

Слайд 11NAMD Benchmarks
BPTI
3K atoms
Estrogen Receptor
36K atoms (1996)
ATP Synthase
327K atoms
(2001)

NAMD BenchmarksBPTI3K atomsEstrogen Receptor36K atoms (1996)ATP Synthase327K atoms(2001)

Слайд 12Parallel MD: Easy or Hard?
Easy
Tiny working data
Spatial locality
Uniform atom density
Persistent

repetition
Multiple time-stepping
Hard
Sequential timesteps
Very short iteration time
Full electrostatics
Fixed problem size
Dynamic variations

Parallel MD: Easy or Hard?EasyTiny working dataSpatial localityUniform atom densityPersistent repetitionMultiple time-steppingHardSequential timestepsVery short iteration timeFull electrostaticsFixed

Слайд 13NAMD Computation
Application data divided into data objects called patches
Sub-grids determined

by cutoff
Computation performed by migratable computes
13 computes per patch pair

and hence much more parallelism
Computes can be further split to increase parallelism
NAMD ComputationApplication data divided into data objects called patchesSub-grids determined by cutoffComputation performed by migratable computes13 computes

Слайд 14NAMD
Scalable molecular dynamics simulation
2 types of objects: patches and

computes, to expose more parallelism
Requires more careful load balancing


















NAMDScalable molecular dynamics simulation 2 types of objects: patches and computes, to expose more parallelismRequires more careful

Слайд 15Communication to Computation Ratio
Scalable
Constant with number of processors
In practice grows

at a very small rate

Communication to Computation RatioScalableConstant with number of processorsIn practice grows at a very small rate

Слайд 16Charm++ and Converse
Charm++: object-based asynchronous message-driven parallel programming paradigm
Converse: communication

layer for Charm++
Send, recv, progress, on node level
User View

Charm++ and ConverseCharm++: object-based asynchronous message-driven parallel programming paradigmConverse: communication layer for Charm++Send, recv, progress, on node

Слайд 17Optimizing NAMD on Blue Gene/L

Optimizing NAMD on Blue Gene/L

Слайд 18Single Processor Performance
Worked with IBM Toronto for 3 weeks
Inner loops

slightly altered to enable software pipelining
Aliasing issues resolved through the

use of
#pragma disjoint (*ptr1, *ptr2)
40% serial speedup
Current best performance is with 440
Continued efforts with Toronto to get good 440d performance
Single Processor PerformanceWorked with IBM Toronto for 3 weeksInner loops slightly altered to enable software pipeliningAliasing issues

Слайд 19NAMD on BGL
Advantages
Both application and hardware are 3D grids
Large 4MB

L3 cache
On large number of processors NAMD will run

from L3
Higher bandwidth for short messages
Midpoint of peak bandwidth achieved quickly
Six outgoing links from each node
No OS Daemons
NAMD on BGLAdvantagesBoth application and hardware are 3D gridsLarge 4MB L3 cache On large number of processors

Слайд 20NAMD on BGL
Disadvantages
Slow embedded CPU
Small memory per node
Low bisection bandwidth


Hard to scale full electrostatics
Limited support for overlap of computation

and communication
No cache coherence
NAMD on BGLDisadvantagesSlow embedded CPUSmall memory per nodeLow bisection bandwidth Hard to scale full electrostaticsLimited support for

Слайд 21BGL Parallelization
Topology driven problem mapping
Load-balancing schemes
Overlap of computation and communication
Communication

optimizations

BGL ParallelizationTopology driven problem mappingLoad-balancing schemesOverlap of computation and communicationCommunication optimizations

Слайд 22Problem Mapping


X
Y
Z
X
Y
Z
Application Data Space
Processor Grid

Problem MappingXYZXYZApplication Data SpaceProcessor Grid

Слайд 23Problem Mapping


X
Y
Z
X
Y
Z
Application Data Space
Processor Grid

Problem MappingXYZXYZApplication Data SpaceProcessor Grid

Слайд 24Problem Mapping
Application Data Space

Problem MappingApplication Data Space

Слайд 25Problem Mapping

Problem Mapping

Слайд 26Two Away Computation
Each data object (patch) is split along a

dimension
Patches now interact with neighbors of neighbors
Makes application more fine

grained
Improves load balancing
Messages of smaller size sent to more processors
Improves torus bandwidth
Two Away ComputationEach data object (patch) is split along a dimensionPatches now interact with neighbors of neighborsMakes

Слайд 27Two Away X
















Two Away X

Слайд 28Load Balancing Steps






Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing



Load Balancing StepsRegular TimestepsInstrumented TimestepsDetailed, aggressive Load BalancingRefinement Load Balancing

Слайд 29Load-balancing Metrics
Balancing load
Minimizing communication hop-bytes
Place computes close to patches
Biased through

placement of proxies on near neighbors
Minimizing number of proxies
Effects connectivity

of each data object
Load-balancing MetricsBalancing loadMinimizing communication hop-bytesPlace computes close to patchesBiased through placement of proxies on near neighborsMinimizing number

Слайд 30Overlap of Computation and Communication
Each FIFO has 4 packet buffers
Progress

engine should be called every 4400 cycles
Overhead of about 200

cycles
5 % increase in computation
Remaining time can be used for computation
Overlap of Computation and CommunicationEach FIFO has 4 packet buffersProgress engine should be called every 4400 cyclesOverhead

Слайд 31Network Progress Calls
NAMD makes progress engine calls from the compute

loops
Typical frequency is10000 cycles, dynamically tunable
for ( i =

0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i];
//……………………………
//Compute Pairlists
for (k=0; k //Compute forces
}
}

void CmiNetworkProgress() {
new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) {
lastProgress = new_time;
return;
}
lastProgress = new_time;
AdvanceCommunication();
}

Network Progress CallsNAMD makes progress engine calls from the compute loopsTypical frequency is10000 cycles, dynamically tunable for

Слайд 32MPI Scalability
Charm++ MPI Driver
Iprobe based implementation
Higher progress overhead of MPI_Test
Statically

pinned FIFOs for point to point communication

MPI ScalabilityCharm++ MPI DriverIprobe based implementationHigher progress overhead of MPI_TestStatically pinned FIFOs for point to point communication

Слайд 33Charm++ Native Driver
BGX Message Layer (developed by George Almasi)
Lower progress

overhead
Active messages
Easily design complex communication protocols
Dynamic FIFO mapping
Low overhead remote

memory access
Interrupts
Charm++ BGX driver was developed by Chao Huang over this summer
Charm++ Native DriverBGX Message Layer (developed by George Almasi)Lower progress overheadActive messagesEasily design complex communication protocolsDynamic FIFO

Слайд 34BG/L Msglayer

Advance loop








post


0
1
2
n-1





Scratchpad
Msq queue
Torus
Msq queue
Collective
Msq queue
Msg Queues


Torus FIFOs
0
1
2
H
0
1
2
H
I0
I1
R0
R1
x+
x-
y+
y-
z+
z-
H
x+
x-
y+
y-
z+
z-
H
Coll. network FIFO
Network
FIFO
pinning


Torus

pkt. registry
0
1
2

p
Coll. pkt. disp.
Dispatching

packets


Templates
TorusDirectMessage

( This slide is taken from G.

Almási’s talk on the “new” msglayer. )
BG/L MsglayerAdvance looppost012n-1…ScratchpadMsq queueTorusMsq queueCollectiveMsq queueMsg QueuesTorus FIFOs012H012HI0I1R0R1x+x-y+y-z+z-Hx+x-y+y-z+z-HColl. network FIFONetworkFIFOpinningTorus pkt. registry012…pColl. pkt. disp.DispatchingpacketsTemplatesTorusDirectMessage( This slide is

Слайд 35Optimized Multicast
pinFifo Algorithms
Decide which of the 6 FIFOs to use

when send msg to {x,y,z,t}
Cones, Chessboard
Dynamic FIFO mapping
A special send

queue that msg can go from whichever FIFO that is not full
Optimized MulticastpinFifo AlgorithmsDecide which of the 6 FIFOs to use when send msg to {x,y,z,t}Cones, ChessboardDynamic FIFO

Слайд 36Communication Pattern in PME









108
procs
108 procs

Communication Pattern in PME108procs108 procs

Слайд 37PME
Plane decomposition for 3D-FFT
PME objects placed close to patch objects

on the torus
PME optimized through an asynchronous all-to-all with dynamic

FIFO mapping
PMEPlane decomposition for 3D-FFTPME objects placed close to patch objects on the torusPME optimized through an asynchronous

Слайд 38Performance Results

Performance Results

Слайд 39BGX Message layer vs MPI
NAMD Co-Processor Mode Performance (ms/step)
Message layer

has sender side blocking communication here
APoA1 Benchmark
Fully non-blocking version performed

below par on MPI
Polling overhead high for a list of posted receives
BGX message layer works well with asynchronous communication
BGX Message layer vs MPINAMD Co-Processor Mode Performance (ms/step)Message layer has sender side blocking communication hereAPoA1 BenchmarkFully

Слайд 40Blocking vs Overlap
APoA1 Benchmark in Co-Processor Mode

Blocking vs OverlapAPoA1 Benchmark in Co-Processor Mode

Слайд 41Effect of Network Progress
(Projections timeline of a 1024-node run without

aggressive network progress)
Network progress not aggressive enough: communication gaps eat

up utilization


Effect of Network Progress(Projections timeline of a 1024-node run without aggressive network progress)Network progress not aggressive enough:

Слайд 42Effect of Network Progress (2)
(Projections timeline of a 1024-node run

with aggressive network progress)
More frequent advance closes gaps

Effect of Network Progress (2)(Projections timeline of a 1024-node run with aggressive network progress)More frequent advance closes

Слайд 43Virtual Node Mode
Processors
Step Time (ms)
APoA1 step time with PME

Virtual Node ModeProcessorsStep Time (ms)APoA1 step time with PME

Слайд 44Spring vs Now
Processors
Step Time (ms)
APoA1 step time with PME

Spring vs NowProcessorsStep Time (ms)APoA1 step time with PME

Слайд 45Summary

Summary

Слайд 46Summary
Demonstrated good scaling to 4k processors for the APoA1 with

a speedup of 2100
Still working on 8k results
ATPase scales well

to 8k processors with a speedup of 4000+
SummaryDemonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100Still working on 8k

Слайд 47Lessons Learnt
Eager messages lead to contention
Rendezvous messages don’t perform well

with mid size messages
Topology optimizations are a big winner
Overlap of

computation and communication is possible
Overlap however makes compute load less predictable
Lack of operating system daemons leads to massive scaling
Lessons LearntEager messages lead to contentionRendezvous messages don’t perform well with mid size messagesTopology optimizations are a

Слайд 48Future Plans
Experiment with new communication protocols
Remote memory access
Adaptive eager
Fast asynchronous

collectives
Improve load-balancing
Newer distributed strategies
Heavy processors dynamically unload to neighbors
Pencil decomposition

for PME
Using the double hummer
Future PlansExperiment with new communication protocolsRemote memory accessAdaptive eagerFast asynchronous collectivesImprove load-balancingNewer distributed strategiesHeavy processors dynamically unload

Обратная связь

Если не удалось найти и скачать доклад-презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое TheSlide.ru?

Это сайт презентации, докладов, проектов в PowerPoint. Здесь удобно  хранить и делиться своими презентациями с другими пользователями.


Для правообладателей

Яндекс.Метрика