Слайд 1Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD
Sameer
Kumar,
Blue Gene Software Group,
IBM T J Watson Research Center,
Yorktown
Heights, NY
sameerk@us.ibm.com
Слайд 2Outline
Motivation
NAMD and Charm++
BGL Techniques
Problem mapping
Overlap of communication with computation
Grain size
Load-balancing
Communication
optimizations
Summary
Слайд 4Blue Gene/L
2.8/5.6 GF/s
4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s
1.0 GB
(32
chips 4x4x2)
16 compute, 0-2 IO cards
90/180 GF/s
16 GB
32 Node
Cards
2.8/5.6 TF/s
512 GB
64 Racks, 64x32x32
180/360 TF/s
32 TB
Rack
System
Node Card
Compute Card
Chip
Слайд 5Application Scaling
Weak
Problem size increases with processors
Strong
Constant problem size
Linear to sub-linear decrease in computation time with processors
Cache performance
Communication
overhead
Communication to computation ratio
Слайд 6Scaling on Blue Gene/L
Several applications have demonstrated weak scaling
Strong scaling
on a large number of benchmarks still needs to be
achieved
Слайд 8NAMD: A Production MD program
NAMD
Fully featured program
NIH-funded development
Distributed free of
charge (thousands downloads so far)
Binaries and source code
Installed at NSF
centers
User training and support
Large published simulations (e.g., aquaporin simulation featured in keynote)
Слайд 9NAMD, CHARMM27, PME
NpT ensemble at 310 or 298 K
1ns
equilibration, 4ns production
Protein: ~ 15,000 atoms
Lipids (POPE): ~ 40,000 atoms
Water: ~
51,000 atoms
Total: ~ 106,000 atoms
3.5 days / ns - 128 O2000 CPUs
11 days / ns - 32 Linux CPUs
.35 days/ns–512 LeMieux CPUs
Acquaporin Simulation
F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)
M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)
Слайд 10Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds
Newtonian mechanics
Thousands
of atoms (10,000 - 500,000)
At each time-step
Calculate forces on each
atom
Bonds:
Non-bonded: electrostatic and van der Waal’s
Short-distance: every timestep
Long-distance: using PME (3D FFT)
Multiple Time Stepping : PME every 4 timesteps
Calculate velocities and advance positions
Challenge: femtosecond time-step, millions needed!
Слайд 11NAMD Benchmarks
BPTI
3K atoms
Estrogen Receptor
36K atoms (1996)
ATP Synthase
327K atoms
(2001)
Слайд 12Parallel MD: Easy or Hard?
Easy
Tiny working data
Spatial locality
Uniform atom density
Persistent
repetition
Multiple time-stepping
Hard
Sequential timesteps
Very short iteration time
Full electrostatics
Fixed problem size
Dynamic variations
Слайд 13NAMD Computation
Application data divided into data objects called patches
Sub-grids determined
by cutoff
Computation performed by migratable computes
13 computes per patch pair
and hence much more parallelism
Computes can be further split to increase parallelism
Слайд 14NAMD
Scalable molecular dynamics simulation
2 types of objects: patches and
computes, to expose more parallelism
Requires more careful load balancing
Слайд 15Communication to Computation Ratio
Scalable
Constant with number of processors
In practice grows
at a very small rate
Слайд 16Charm++ and Converse
Charm++: object-based asynchronous message-driven parallel programming paradigm
Converse: communication
layer for Charm++
Send, recv, progress, on node level
User View
Слайд 18Single Processor Performance
Worked with IBM Toronto for 3 weeks
Inner loops
slightly altered to enable software pipelining
Aliasing issues resolved through the
use of
#pragma disjoint (*ptr1, *ptr2)
40% serial speedup
Current best performance is with 440
Continued efforts with Toronto to get good 440d performance
Слайд 19NAMD on BGL
Advantages
Both application and hardware are 3D grids
Large 4MB
L3 cache
On large number of processors NAMD will run
from L3
Higher bandwidth for short messages
Midpoint of peak bandwidth achieved quickly
Six outgoing links from each node
No OS Daemons
Слайд 20NAMD on BGL
Disadvantages
Slow embedded CPU
Small memory per node
Low bisection bandwidth
Hard to scale full electrostatics
Limited support for overlap of computation
and communication
No cache coherence
Слайд 21BGL Parallelization
Topology driven problem mapping
Load-balancing schemes
Overlap of computation and communication
Communication
optimizations
Слайд 22Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space
Processor Grid
Слайд 23Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space
Processor Grid
Слайд 24Problem Mapping
Application Data Space
Слайд 26Two Away Computation
Each data object (patch) is split along a
dimension
Patches now interact with neighbors of neighbors
Makes application more fine
grained
Improves load balancing
Messages of smaller size sent to more processors
Improves torus bandwidth
Слайд 28Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
Слайд 29Load-balancing Metrics
Balancing load
Minimizing communication hop-bytes
Place computes close to patches
Biased through
placement of proxies on near neighbors
Minimizing number of proxies
Effects connectivity
of each data object
Слайд 30Overlap of Computation and Communication
Each FIFO has 4 packet buffers
Progress
engine should be called every 4400 cycles
Overhead of about 200
cycles
5 % increase in computation
Remaining time can be used for computation
Слайд 31Network Progress Calls
NAMD makes progress engine calls from the compute
loops
Typical frequency is10000 cycles, dynamically tunable
for ( i =
0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i];
//……………………………
//Compute Pairlists
for (k=0; k
//Compute forces
}
}
void CmiNetworkProgress() {
new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) {
lastProgress = new_time;
return;
}
lastProgress = new_time;
AdvanceCommunication();
}
Слайд 32MPI Scalability
Charm++ MPI Driver
Iprobe based implementation
Higher progress overhead of MPI_Test
Statically
pinned FIFOs for point to point communication
Слайд 33Charm++ Native Driver
BGX Message Layer (developed by George Almasi)
Lower progress
overhead
Active messages
Easily design complex communication protocols
Dynamic FIFO mapping
Low overhead remote
memory access
Interrupts
Charm++ BGX driver was developed by Chao Huang over this summer
Слайд 34BG/L Msglayer
Advance loop
post
0
1
2
n-1
…
Scratchpad
Msq queue
Torus
Msq queue
Collective
Msq queue
Msg Queues
Torus FIFOs
0
1
2
H
0
1
2
H
I0
I1
R0
R1
x+
x-
y+
y-
z+
z-
H
x+
x-
y+
y-
z+
z-
H
Coll. network FIFO
Network
FIFO
pinning
Torus
pkt. registry
0
1
2
…
p
Coll. pkt. disp.
Dispatching
packets
Templates
TorusDirectMessage
( This slide is taken from G.
Almási’s talk on the “new” msglayer. )
Слайд 35Optimized Multicast
pinFifo Algorithms
Decide which of the 6 FIFOs to use
when send msg to {x,y,z,t}
Cones, Chessboard
Dynamic FIFO mapping
A special send
queue that msg can go from whichever FIFO that is not full
Слайд 36Communication Pattern in PME
108
procs
108 procs
Слайд 37PME
Plane decomposition for 3D-FFT
PME objects placed close to patch objects
on the torus
PME optimized through an asynchronous all-to-all with dynamic
FIFO mapping
Слайд 39BGX Message layer vs MPI
NAMD Co-Processor Mode Performance (ms/step)
Message layer
has sender side blocking communication here
APoA1 Benchmark
Fully non-blocking version performed
below par on MPI
Polling overhead high for a list of posted receives
BGX message layer works well with asynchronous communication
Слайд 40Blocking vs Overlap
APoA1 Benchmark in Co-Processor Mode
Слайд 41Effect of Network Progress
(Projections timeline of a 1024-node run without
aggressive network progress)
Network progress not aggressive enough: communication gaps eat
up utilization
Слайд 42Effect of Network Progress (2)
(Projections timeline of a 1024-node run
with aggressive network progress)
More frequent advance closes gaps
Слайд 43Virtual Node Mode
Processors
Step Time (ms)
APoA1 step time with PME
Слайд 44Spring vs Now
Processors
Step Time (ms)
APoA1 step time with PME
Слайд 46Summary
Demonstrated good scaling to 4k processors for the APoA1 with
a speedup of 2100
Still working on 8k results
ATPase scales well
to 8k processors with a speedup of 4000+
Слайд 47Lessons Learnt
Eager messages lead to contention
Rendezvous messages don’t perform well
with mid size messages
Topology optimizations are a big winner
Overlap of
computation and communication is possible
Overlap however makes compute load less predictable
Lack of operating system daemons leads to massive scaling
Слайд 48Future Plans
Experiment with new communication protocols
Remote memory access
Adaptive eager
Fast asynchronous
collectives
Improve load-balancing
Newer distributed strategies
Heavy processors dynamically unload to neighbors
Pencil decomposition
for PME
Using the double hummer