Chombo

Chombo Parallel Scaling Results

David Tilley


We tested the parallel performance of the Chombo parallel framework (Colella et al. 2009) combined with Balsara’s ADER-MHD code (Balsara 2009) on the Notre Dame CRC cluster. The test problem we used was an MHD blast wave, consisting of a sphere of high-pressure plasma expanding into a low-density environment. This is a stringent test problem for MHD, as it involves strong shocks that propagate at oblique angles to the magnetic field. This is a problem that is expected to scale well with a well-designed parallelization strategy, as most of the computational effort is on local grids.

The goal of an MHD/hydrodynamical code such as the one used here is to evolve a set of fluid variables (in this case, density, pressure, velocities and magnetic fields) in time according to a set of coupled partial differential equations. For this test problem, the equations involve first derivatives in space and time of various combinations of the fluid variables. The solution approach was thus to convert the spatial derivatives to discrete approximations involving the slopes of the fluid variables between zones and their neighbours (Balsara & Spicer 1999ab, Balsara 2004, Balsara 2009). This is the origin of the parallel communication that arises when running the test problem on distributed systems – the solution requires the fluid variables not only at each zone, but also at their neighbours. We subdivide our domain into many sub-domains that can be distributed among the parallel processors available. We then add a layer of “ghost zones” around each of these sub-domains to contain the boundary information, which we continually update every timestep with the fluid variables in the neighbouring sub-domains. The fraction of time spent in parallel communication is thus expected to be proportional to the ratio of the number of ghost zones, to the number of zones in each of the sub-grids.

The initial conditions for the test consisted of a uniform density and magnetic field over the entire computational domain. The blast was initiated by creating a high-pressure sphere with a radius 10% of the length of the computational domain, with a pressure 10000 times larger than the ambient pressure. The magnetic field strength was chosen such that the magnetic pressure (B2/8π) everywhere was 39.8 times the ambient pressure.

We ran the problem for 40 time steps for four sets of data: using grids of constant total size of 963 zones, divided into sub-boxes so that approximately the same number of boxes were updated by each parallel process; using grids of sizes such that there was approximately an equal amount of computational work done per process; and using grids with exactly the same computational load per process.

In the following tables, we report two measures of the time taken for the program to execute. We report a CPU time, as measure by the C++ function clock(). We also report a wall clock time, as measured by the TraceTimer package included with Chombo. This also gives a detailed breakdown of the wall times spent in various important function calls.

From these two times, we calculate two derived quantities:

Parallel Efficiency = CPU Time / Wall Time

% Communication = (1 – Parallel Efficiency / Number of cores) x 100

For each data set, we also plot the number of millions zones updated per second, a measure of how fast the computation is proceeding. The total number of zones used here does not include ghost zones, which are not considered part of the computational domain.

Mzones / s = ( Total number of zones ) x 40 x 10-6 / ( Wall clock time )

Results:

The simulations with a constant size domain, described in Table 1 and Figure 1, exhibited very poor scaling. This was expected, as the grids on individual processors becomes very small as the number of processors increases. This leads to a significantly larger fraction of the total time spent communicating, and also increased the ratio of the number of ghost zones to valid zones.

The simulations on a variable-size domain, which contain approximately 483 zones per core, exhibited better scaling. These results are contained in Table 2, and the scaling is illustrated in Figure 2. In many cases, an exactly equal division between processors was not possible; as a result, some processors may have had up to twice the work of other processors in the same simulation. This leads to increased communications costs as the processors with smaller loads wait for the processes with large loads. Still, these simulations may be representative of production simulations with AMR, when one cannot in general guarantee precisely equal loads on each processor. The factor of ~2 loss in performance at 50+ processors may be acceptable for many such tasks.

In Tables 3 and 4 we record the data from the two sets of simulations with exactly equal load per process, of 603 (Table 3) and 963 (Table 4) zones per processor. These both exhibit very good scaling up to 125 processors. We did start to see some variance the execution times with many processors. In particular, the first run on 64 processors in Table 4 took an extremely long time to complete. Analysis of some of the profiling information that was written out showed that nearly all of the excess times was spent in MPI_waitall functions, a message-passing function that is used to synchronize the different processors by delaying further code execution until all the processors have sent the signal that they have reached that point. A few of the processes took significantly longer wall times to execute the portions of the code that handle the update of the fluid variables (note that these blocks of code are communication-free), while having CPU times approximately the same as the other, faster-executing processes. This suggests that these slow processes could have been sharing processor time with jobs from other users.


References:

Balsara, D. S., 2004, ApJS, 151, 149 Balsara, D. S., 2009, J. Comput. Phys, 228, 2480 Balsara, D. S., & Spicer, D., 1999a, J. Comput. Phys, 148, 133 Balsara, D. S., & Spicer, D., 1999b, J. Comput. Phys, 149, 270 Colella et al. (2009) “Chombo Software Package for AMR Applications Design Document“ (https://seesar.lbl.gov/ANAG/chombo/index.html)

--Dtilley 18:06, 19 November 2009 (EST)

Table 1: Constant size problems, Intel compiler, OpenMPI parallel libraries, on mixture of Opteron and Nehalem machines, times reported per time step by code.
# of cores # of zones CPU Time Wall time Parallel Efficiency % Communication Head machine
1 963 262.7 265.2
4 963 573.6 147.4 3.889 2.770
6 963 941.2 161.8 5.817 3.052 ddcopt128
8 963 1127. 145.6 7.739 3.264 ddcopt056
12 963 2304. 196.8 11.70 2.475 ddcopt045
16 963 2654. 172.1 15.43 3.584 ddcopt037
24 963 3556. 154.3 23.05 3.945 dqcneh003
32 963 5528. 181.8 30.43 4.921 ddcopt113
64 963 11320 188.0 60.21 5.917 dqcneh008
128 963 24280 202.2 120.1 6.180 ddcopt135

Scaling simple const.jpg

Figure 1: The number of millions of zones updated per second, as a function of the number of processors used in the calculation, for a total grid size of 963.

Table 2: Variable size problems (approximately equal load per process), Intel compiler, OpenMPI parallel libraries, on mixture of Opteron and Nehalem machines, times reported per time step by code.
# of cores # of zones CPU Time Wall time Parallel Efficiency % Communication Head machine
1 483 70.20 71.4
4 763 245.4 63.64 3.855 3.615 ddcopt052
6 883 872.0 148.5 5.869 2.181 ddcopt055
8 963 1127. 145.6 7.739 3.264 ddcopt056
12 1083 2291. 196.6 11.65 3.879 ddcopt109
16 1203 5064. 328.5 15.42 3.647 ddcopt115
24 1443 6204. 269.1 23.05 3.946 ddcopt144
32 1523 8908. 291.5 30.56 4.516 ddcopt138
64 1923 18330 307.4 59.62 6.843 ddcopt038
128 2403 48840 392.5 124.4 3.788 ddcopt060

Scaling simple variable.jpg

Figure 2: The number of millions of zones updated per second, as a function of the number of processors, for approximately equal load per processor. The red line shows the ideal scaling, normalized to the number of zones updated per second for 8 processors.

Table 3: Exactly constant load (603 zones) per process. Intel compiler, OpenMPI parallel libraries, on mixture of Opteron and Nehalem machines, total times reported.
# Processors 1 8 27 64 125
Grid size 603 1203 1803 2403 3003
#zones updated (x106) 8.640 69.120 233.280 552.960 1080.000
Wall time (s) 89.37 188.67 248.95 361.10 379.41
CPU time (s) 88.97 1494.63 6626.04 22783.50 46954.70
Parallel Efficiency 7.9221 26.616 63.095 123.76
% Comm. 0.97434 1.4309 1.4146 0.99347
Head Machine: ddcopt099 ddcopt099 ddcopt099 ddcopt099 ddcopt099
# Processors 216 343 512 729 1000
Grid size 3603 4203 4803 5403 6003
#zones updated (x106) 1866.240 2963.520 4423.680 6298.560 8640.000
341.833 622.6414 511.6779 517.5079 774.3829

Chombo scaling 60cubed.jpg

Figure 3: The number of Mzones updated per second as a function of the number of processors, for an exactly equal load on each processor. The red line shows the ideal scaling, normalized to the result for 8 processors.

Table 4: Exactly constant load (763 zones) per process. Intel compiler, OpenMPI parallel libraries, on mixture of Opteron and Nehalem machines, total times reported.
# Processors 1 8 27 64 64 125
Grid size 763 1523 2283 3043 3803
#zones updated (x106) 17.55904 140.47232 474.09408 1123.77856 2194.88000
Wall time #1 (s) 177.3207 234.8639 392.8518 443.5226 628.8480
Wall time #2 (s) 390.0954 450.2569 450.4608
# Processors 216 343 512 729 1000
Grid size 4563 5323 6083 6843 7603
#zones updated (x106) 3792.75264 6022.75072 8990.22848 12800.54067 17559.04000
Wall time #1 (s) 460.9112 533.1435 744.6929 1110.3699 1624.9462
Wall time #2 (s) 994.9976 1039.2158 1026.0387 1028.6404 1101.0686


Table 5: Exactly constant load (963 zones) per process. Intel compiler, OpenMPI parallel libraries, on mixture of Opteron and Nehalem machines, total times reported.
# Processors 1 8 27 64 64 125
Grid size 963 1923 2883 3843 3843 4803
#zones updated (x103) 35.389 283.115 955.514 2264.924 2264.924 4423.680
Wall time (s) 457.29 737.67 727.44 7109.98 863.90 1173.33
CPU time (s) 457.07 5869.07 19532.80 435867.59 54748.60 145538.88
Parallel Efficiency 7.9562 26.851 61.304 63.374 124.039
% Comm. 0.54717 0.55095 4.2131 0.97811 0.76889
Head Machine: ddcopt084 ddcopt099 ddcopt099 ddcopt099 ddcopt099 ddcopt099
Process Distribution 1 process on 1 opteron 8 processes on 3 opterons 27 processes on 11 opterons 39 processes on 15 opterons, 25 process on 6 nehalems 59 processes on 47 opterons, 5 process on 1 nehalem 89 processes on 48 opterons, 36 process on 9 nehalems

Chombo scaling 96cubed.jpg

Figure 4: The number of Mzones updated per second, as a function of the number of processors used. The red line again denotes ideal scaling. The run at 64 processors was re-run, as the original run spent an unusually long amount of time in MPI_waitall statements.