NAMD-BluegeneL

Содержание

1. NAMD-BluegeneL
2. OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication optimizationsSummary
3. Blue Gene/L
4. Blue Gene/L2.8/5.6 GF/s4 MB2 processors2 chips, 1x2x15.6/11.2
5. Application ScalingWeak Problem size increases with processorsStrong
6. Scaling on Blue Gene/LSeveral applications have demonstrated
7. NAMD and Charm++
8. NAMD: A Production MD programNAMDFully featured programNIH-funded
9. NAMD, CHARMM27, PMENpT ensemble at 310 or
10. Molecular Dynamics in NAMDCollection of [charged] atoms,
11. NAMD BenchmarksBPTI3K atomsEstrogen Receptor36K atoms (1996)ATP Synthase327K atoms(2001)
12. Parallel MD: Easy or Hard?EasyTiny working dataSpatial
13. NAMD ComputationApplication data divided into data objects
14. NAMDScalable molecular dynamics simulation 2 types of
15. Communication to Computation RatioScalableConstant with number of processorsIn practice grows at a very small rate
16. Charm++ and ConverseCharm++: object-based asynchronous message-driven parallel
17. Optimizing NAMD on Blue Gene/L
18. Single Processor PerformanceWorked with IBM Toronto for
19. NAMD on BGLAdvantagesBoth application and hardware are
20. NAMD on BGLDisadvantagesSlow embedded CPUSmall memory per
21. BGL ParallelizationTopology driven problem mappingLoad-balancing schemesOverlap of computation and communicationCommunication optimizations
22. Problem MappingXYZXYZApplication Data SpaceProcessor Grid
23. Problem MappingXYZXYZApplication Data SpaceProcessor Grid
24. Problem MappingApplication Data Space
25. Problem Mapping
26. Two Away ComputationEach data object (patch) is
27. Two Away X
28. Load Balancing StepsRegular TimestepsInstrumented TimestepsDetailed, aggressive Load BalancingRefinement Load Balancing
29. Load-balancing MetricsBalancing loadMinimizing communication hop-bytesPlace computes close
30. Overlap of Computation and CommunicationEach FIFO has
31. Network Progress CallsNAMD makes progress engine calls
32. MPI ScalabilityCharm++ MPI DriverIprobe based implementationHigher progress
33. Charm++ Native DriverBGX Message Layer (developed by
34. BG/L MsglayerAdvance looppost012n-1…ScratchpadMsq queueTorusMsq queueCollectiveMsq queueMsg QueuesTorus
35. Optimized MulticastpinFifo AlgorithmsDecide which of the 6
36. Communication Pattern in PME108procs108 procs
37. PMEPlane decomposition for 3D-FFTPME objects placed close
38. Performance Results
39. BGX Message layer vs MPINAMD Co-Processor Mode
40. Blocking vs OverlapAPoA1 Benchmark in Co-Processor Mode
41. Effect of Network Progress(Projections timeline of a
42. Effect of Network Progress (2)(Projections timeline of
43. Virtual Node ModeProcessorsStep Time (ms)APoA1 step time with PME
44. Spring vs NowProcessorsStep Time (ms)APoA1 step time with PME
45. Summary
46. SummaryDemonstrated good scaling to 4k processors for
47. Lessons LearntEager messages lead to contentionRendezvous messages
48. Future PlansExperiment with new communication protocolsRemote memory
49. Скачать презентанцию

OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication optimizationsSummary

Слайды и текст этой презентации

Слайд 1Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD
Sameer

Kumar,
Blue Gene Software Group,
IBM T J Watson Research Center,
Yorktown

Heights, NY
sameerk@us.ibm.com

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMDSameer Kumar, Blue Gene Software Group,IBM T J

Слайд 2Outline
Motivation
NAMD and Charm++
BGL Techniques
Problem mapping
Overlap of communication with computation
Grain size
Load-balancing
Communication

optimizations
Summary

OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication optimizationsSummary

Слайд 3Blue Gene/L

Слайд 4Blue Gene/L
2.8/5.6 GF/s
4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s
1.0 GB
(32

chips 4x4x2)
16 compute, 0-2 IO cards
90/180 GF/s
16 GB
32 Node

Cards

2.8/5.6 TF/s
512 GB

64 Racks, 64x32x32

180/360 TF/s
32 TB

Rack

System

Node Card

Compute Card

Chip

Слайд 5Application Scaling
Weak
Problem size increases with processors
Strong
Constant problem size

Linear to sub-linear decrease in computation time with processors
Cache performance
Communication

overhead
Communication to computation ratio

Application ScalingWeak Problem size increases with processorsStrong Constant problem size Linear to sub-linear decrease in computation time

Слайд 6Scaling on Blue Gene/L
Several applications have demonstrated weak scaling

Strong scaling

on a large number of benchmarks still needs to be

achieved

Scaling on Blue Gene/LSeveral applications have demonstrated weak scalingStrong scaling on a large number of benchmarks still

Слайд 7NAMD and Charm++

Слайд 8NAMD: A Production MD program
NAMD
Fully featured program
NIH-funded development
Distributed free of

charge (thousands downloads so far)
Binaries and source code
Installed at NSF

centers
User training and support
Large published simulations (e.g., aquaporin simulation featured in keynote)

NAMD: A Production MD programNAMDFully featured programNIH-funded developmentDistributed free of charge (thousands downloads so far)Binaries and source

Слайд 9NAMD, CHARMM27, PME
NpT ensemble at 310 or 298 K
1ns

equilibration, 4ns production

Protein: ~ 15,000 atoms
Lipids (POPE): ~ 40,000 atoms
Water: ~

51,000 atoms
Total: ~ 106,000 atoms

3.5 days / ns - 128 O2000 CPUs
11 days / ns - 32 Linux CPUs
.35 days/ns–512 LeMieux CPUs

Acquaporin Simulation

F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)
M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns productionProtein: ~ 15,000 atomsLipids (POPE): ~

Слайд 10Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds
Newtonian mechanics
Thousands

of atoms (10,000 - 500,000)
At each time-step
Calculate forces on each

atom
Bonds:
Non-bonded: electrostatic and van der Waal’s
Short-distance: every timestep
Long-distance: using PME (3D FFT)
Multiple Time Stepping : PME every 4 timesteps
Calculate velocities and advance positions
Challenge: femtosecond time-step, millions needed!

Molecular Dynamics in NAMDCollection of [charged] atoms, with bondsNewtonian mechanicsThousands of atoms (10,000 - 500,000)At each time-stepCalculate

Слайд 11NAMD Benchmarks
BPTI
3K atoms
Estrogen Receptor
36K atoms (1996)
ATP Synthase
327K atoms
(2001)

Слайд 12Parallel MD: Easy or Hard?
Easy
Tiny working data
Spatial locality
Uniform atom density
Persistent

repetition
Multiple time-stepping
Hard
Sequential timesteps
Very short iteration time
Full electrostatics
Fixed problem size
Dynamic variations

Parallel MD: Easy or Hard?EasyTiny working dataSpatial localityUniform atom densityPersistent repetitionMultiple time-steppingHardSequential timestepsVery short iteration timeFull electrostaticsFixed

Слайд 13NAMD Computation
Application data divided into data objects called patches
Sub-grids determined

by cutoff
Computation performed by migratable computes
13 computes per patch pair

and hence much more parallelism
Computes can be further split to increase parallelism

NAMD ComputationApplication data divided into data objects called patchesSub-grids determined by cutoffComputation performed by migratable computes13 computes

Слайд 14NAMD
Scalable molecular dynamics simulation
2 types of objects: patches and

computes, to expose more parallelism
Requires more careful load balancing

NAMDScalable molecular dynamics simulation 2 types of objects: patches and computes, to expose more parallelismRequires more careful

Слайд 15Communication to Computation Ratio
Scalable
Constant with number of processors
In practice grows

at a very small rate

Слайд 16Charm++ and Converse
Charm++: object-based asynchronous message-driven parallel programming paradigm
Converse: communication

layer for Charm++
Send, recv, progress, on node level
User View

Charm++ and ConverseCharm++: object-based asynchronous message-driven parallel programming paradigmConverse: communication layer for Charm++Send, recv, progress, on node

Слайд 17Optimizing NAMD on Blue Gene/L

Слайд 18Single Processor Performance
Worked with IBM Toronto for 3 weeks
Inner loops

slightly altered to enable software pipelining
Aliasing issues resolved through the

use of
#pragma disjoint (*ptr1, *ptr2)
40% serial speedup
Current best performance is with 440
Continued efforts with Toronto to get good 440d performance

Single Processor PerformanceWorked with IBM Toronto for 3 weeksInner loops slightly altered to enable software pipeliningAliasing issues

Слайд 19NAMD on BGL
Advantages
Both application and hardware are 3D grids
Large 4MB

L3 cache
On large number of processors NAMD will run

from L3
Higher bandwidth for short messages
Midpoint of peak bandwidth achieved quickly
Six outgoing links from each node
No OS Daemons

NAMD on BGLAdvantagesBoth application and hardware are 3D gridsLarge 4MB L3 cache On large number of processors

Слайд 20NAMD on BGL
Disadvantages
Slow embedded CPU
Small memory per node
Low bisection bandwidth

Hard to scale full electrostatics
Limited support for overlap of computation

and communication
No cache coherence

NAMD on BGLDisadvantagesSlow embedded CPUSmall memory per nodeLow bisection bandwidth Hard to scale full electrostaticsLimited support for

Слайд 21BGL Parallelization
Topology driven problem mapping
Load-balancing schemes
Overlap of computation and communication
Communication

optimizations

BGL ParallelizationTopology driven problem mappingLoad-balancing schemesOverlap of computation and communicationCommunication optimizations

Слайд 22Problem Mapping

X
Y
Z
X
Y
Z
Application Data Space
Processor Grid

Слайд 23Problem Mapping

X
Y
Z
X
Y
Z
Application Data Space
Processor Grid

Слайд 24Problem Mapping
Application Data Space

Слайд 25Problem Mapping

Слайд 26Two Away Computation
Each data object (patch) is split along a

dimension
Patches now interact with neighbors of neighbors
Makes application more fine

grained
Improves load balancing
Messages of smaller size sent to more processors
Improves torus bandwidth

Two Away ComputationEach data object (patch) is split along a dimensionPatches now interact with neighbors of neighborsMakes

Слайд 27Two Away X

Слайд 28Load Balancing Steps

Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing

Слайд 29Load-balancing Metrics
Balancing load
Minimizing communication hop-bytes
Place computes close to patches
Biased through

placement of proxies on near neighbors
Minimizing number of proxies
Effects connectivity

of each data object

Load-balancing MetricsBalancing loadMinimizing communication hop-bytesPlace computes close to patchesBiased through placement of proxies on near neighborsMinimizing number

Слайд 30Overlap of Computation and Communication
Each FIFO has 4 packet buffers
Progress

engine should be called every 4400 cycles
Overhead of about 200

cycles
5 % increase in computation
Remaining time can be used for computation

Overlap of Computation and CommunicationEach FIFO has 4 packet buffersProgress engine should be called every 4400 cyclesOverhead

Слайд 31Network Progress Calls
NAMD makes progress engine calls from the compute

loops
Typical frequency is10000 cycles, dynamically tunable
for ( i =

0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i];
//……………………………
//Compute Pairlists
for (k=0; k //Compute forces
}
}

void CmiNetworkProgress() {
new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) {
lastProgress = new_time;
return;
}
lastProgress = new_time;
AdvanceCommunication();
}

Network Progress CallsNAMD makes progress engine calls from the compute loopsTypical frequency is10000 cycles, dynamically tunable for

Слайд 32MPI Scalability
Charm++ MPI Driver
Iprobe based implementation
Higher progress overhead of MPI_Test
Statically

pinned FIFOs for point to point communication

MPI ScalabilityCharm++ MPI DriverIprobe based implementationHigher progress overhead of MPI_TestStatically pinned FIFOs for point to point communication

Слайд 33Charm++ Native Driver
BGX Message Layer (developed by George Almasi)
Lower progress

overhead
Active messages
Easily design complex communication protocols
Dynamic FIFO mapping
Low overhead remote

memory access
Interrupts
Charm++ BGX driver was developed by Chao Huang over this summer

Charm++ Native DriverBGX Message Layer (developed by George Almasi)Lower progress overheadActive messagesEasily design complex communication protocolsDynamic FIFO

Слайд 34BG/L Msglayer

Advance loop

post

0
1
2
n-1
…

Scratchpad
Msq queue
Torus
Msq queue
Collective
Msq queue
Msg Queues

Torus FIFOs
0
1
2
H
0
1
2
H
I0
I1
R0
R1
x+
x-
y+
y-
z+
z-
H
x+
x-
y+
y-
z+
z-
H
Coll. network FIFO
Network
FIFO
pinning

Torus

pkt. registry
0
1
2
…
p
Coll. pkt. disp.
Dispatching

packets

Templates
TorusDirectMessage

( This slide is taken from G.

Almási’s talk on the “new” msglayer. )

BG/L MsglayerAdvance looppost012n-1…ScratchpadMsq queueTorusMsq queueCollectiveMsq queueMsg QueuesTorus FIFOs012H012HI0I1R0R1x+x-y+y-z+z-Hx+x-y+y-z+z-HColl. network FIFONetworkFIFOpinningTorus pkt. registry012…pColl. pkt. disp.DispatchingpacketsTemplatesTorusDirectMessage( This slide is

Слайд 35Optimized Multicast
pinFifo Algorithms
Decide which of the 6 FIFOs to use

when send msg to {x,y,z,t}
Cones, Chessboard
Dynamic FIFO mapping
A special send

queue that msg can go from whichever FIFO that is not full

Optimized MulticastpinFifo AlgorithmsDecide which of the 6 FIFOs to use when send msg to {x,y,z,t}Cones, ChessboardDynamic FIFO

Слайд 36Communication Pattern in PME

108
procs
108 procs

Слайд 37PME
Plane decomposition for 3D-FFT
PME objects placed close to patch objects

on the torus
PME optimized through an asynchronous all-to-all with dynamic

FIFO mapping

PMEPlane decomposition for 3D-FFTPME objects placed close to patch objects on the torusPME optimized through an asynchronous

Слайд 38Performance Results

Слайд 39BGX Message layer vs MPI
NAMD Co-Processor Mode Performance (ms/step)
Message layer

has sender side blocking communication here
APoA1 Benchmark
Fully non-blocking version performed

below par on MPI
Polling overhead high for a list of posted receives
BGX message layer works well with asynchronous communication

BGX Message layer vs MPINAMD Co-Processor Mode Performance (ms/step)Message layer has sender side blocking communication hereAPoA1 BenchmarkFully

Слайд 40Blocking vs Overlap
APoA1 Benchmark in Co-Processor Mode

Слайд 41Effect of Network Progress
(Projections timeline of a 1024-node run without

aggressive network progress)
Network progress not aggressive enough: communication gaps eat

up utilization

Effect of Network Progress(Projections timeline of a 1024-node run without aggressive network progress)Network progress not aggressive enough:

Слайд 42Effect of Network Progress (2)
(Projections timeline of a 1024-node run

with aggressive network progress)
More frequent advance closes gaps

Effect of Network Progress (2)(Projections timeline of a 1024-node run with aggressive network progress)More frequent advance closes

Слайд 43Virtual Node Mode
Processors
Step Time (ms)
APoA1 step time with PME

Слайд 44Spring vs Now
Processors
Step Time (ms)
APoA1 step time with PME

Слайд 45Summary

Слайд 46Summary
Demonstrated good scaling to 4k processors for the APoA1 with

a speedup of 2100
Still working on 8k results
ATPase scales well

to 8k processors with a speedup of 4000+

SummaryDemonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100Still working on 8k

Слайд 47Lessons Learnt
Eager messages lead to contention
Rendezvous messages don’t perform well

with mid size messages
Topology optimizations are a big winner
Overlap of

computation and communication is possible
Overlap however makes compute load less predictable
Lack of operating system daemons leads to massive scaling

Lessons LearntEager messages lead to contentionRendezvous messages don’t perform well with mid size messagesTopology optimizations are a

Слайд 48Future Plans
Experiment with new communication protocols
Remote memory access
Adaptive eager
Fast asynchronous

collectives
Improve load-balancing
Newer distributed strategies
Heavy processors dynamically unload to neighbors
Pencil decomposition

for PME
Using the double hummer

Future PlansExperiment with new communication protocolsRemote memory accessAdaptive eagerFast asynchronous collectivesImprove load-balancingNewer distributed strategiesHeavy processors dynamically unload

Скачать презентацию

Разделы презентаций

NAMD-BluegeneL

Содержание

Слайды и текст этой презентации

Слайд 1Achieving Strong Scaling On Blue Gene/L: Case Study with NAMDSameer

Kumar, Blue Gene Software Group,IBM T J Watson Research Center,Yorktown

Слайд 2OutlineMotivationNAMD and Charm++BGL TechniquesProblem mappingOverlap of communication with computationGrain sizeLoad-balancingCommunication

optimizationsSummary

Слайд 3Blue Gene/L

Слайд 4Blue Gene/L2.8/5.6 GF/s4 MB2 processors2 chips, 1x2x15.6/11.2 GF/s1.0 GB (32

chips 4x4x2)16 compute, 0-2 IO cards90/180 GF/s16 GB 32 Node

Слайд 5Application ScalingWeak Problem size increases with processorsStrong Constant problem size

Linear to sub-linear decrease in computation time with processorsCache performanceCommunication

Слайд 6Scaling on Blue Gene/LSeveral applications have demonstrated weak scalingStrong scaling

on a large number of benchmarks still needs to be

Слайд 7NAMD and Charm++

Слайд 8NAMD: A Production MD programNAMDFully featured programNIH-funded developmentDistributed free of

charge (thousands downloads so far)Binaries and source codeInstalled at NSF

Слайд 9NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns

equilibration, 4ns productionProtein: ~ 15,000 atomsLipids (POPE): ~ 40,000 atomsWater: ~

Слайд 10Molecular Dynamics in NAMDCollection of [charged] atoms, with bondsNewtonian mechanicsThousands

of atoms (10,000 - 500,000)At each time-stepCalculate forces on each

Слайд 11NAMD BenchmarksBPTI3K atomsEstrogen Receptor36K atoms (1996)ATP Synthase327K atoms(2001)

Слайд 12Parallel MD: Easy or Hard?EasyTiny working dataSpatial localityUniform atom densityPersistent

repetitionMultiple time-steppingHardSequential timestepsVery short iteration timeFull electrostaticsFixed problem sizeDynamic variations

Слайд 13NAMD ComputationApplication data divided into data objects called patchesSub-grids determined

by cutoffComputation performed by migratable computes13 computes per patch pair

Слайд 14NAMDScalable molecular dynamics simulation 2 types of objects: patches and

computes, to expose more parallelismRequires more careful load balancing

Слайд 15Communication to Computation RatioScalableConstant with number of processorsIn practice grows

at a very small rate

Слайд 16Charm++ and ConverseCharm++: object-based asynchronous message-driven parallel programming paradigmConverse: communication

layer for Charm++Send, recv, progress, on node levelUser View

Слайд 17Optimizing NAMD on Blue Gene/L

Слайд 18Single Processor PerformanceWorked with IBM Toronto for 3 weeksInner loops

slightly altered to enable software pipeliningAliasing issues resolved through the

Слайд 19NAMD on BGLAdvantagesBoth application and hardware are 3D gridsLarge 4MB

L3 cache On large number of processors NAMD will run

Слайд 20NAMD on BGLDisadvantagesSlow embedded CPUSmall memory per nodeLow bisection bandwidth

Hard to scale full electrostaticsLimited support for overlap of computation

Слайд 21BGL ParallelizationTopology driven problem mappingLoad-balancing schemesOverlap of computation and communicationCommunication

optimizations

Слайд 22Problem MappingXYZXYZApplication Data SpaceProcessor Grid

Слайд 23Problem MappingXYZXYZApplication Data SpaceProcessor Grid

Слайд 24Problem MappingApplication Data Space