It’s one thing to have a powerful supercomputer cluster, it’s quite another thing to use it at its full potential. For anyone who has ever chopped wood, you know that slight changes in one’s stance or grip can dramatically increase the amount force the axe can delivered. Similarly, when dealing with a multi-core computer, over 5,000 cores in our case, slight adjustments in how you run a complex calculation can greatly impact the processing time.
Here’s an example using simplified numbers. Imagine you have 4 cores and each can run 10 instructions per second (IPS). Potentially, your machine could carry out a set of 100 instructions in 2.5 seconds. (100 instructions divided evenly among 4 processors=25 each; divided by 10 IPS=2.5 sec) However, computers don’t intuitively know how to divvy up tasks. Worst case scenario, all the instructions get sent to one core while the others are left idle. This would take the job 10 seconds to process, which is 4 times as long. The practice of optimizing the distribution of the work load is known as load balancing. When epidemiological simulations are taking hours, days and weeks to process across thousands of cores, effective load balancing becomes crucial.
Applying Load Balancing to Epidemiological Modeling
A few months ago, we received the results of the static load balancing script, which makes sure that each core does an equal amount of work over the epidemiological simulation of Madagascar . While doing further analysis, we found that each core was also spending considerable time waiting for other cores, even though the static load balancing did provide substantial efficiency improvements. This waiting time is due to dynamic load imbalances that average to zero.
Below is an estimation of how we initially split up Madagascar along with the figure of the corresponding relative load by core over time, with no load balancing other than giving equal land areas to each processor.

Equal land area load distribution
Notice the big gap between cores 30 and 50. In this approach, where the core/processor number corresponds to a specific block region on the map, 30-50 are doing a lot less work than the cores. This is due to the island’s central plateau, which has very little malaria and thus is an inexpensive simulation. In addition, there are also varying dips along the time axis depending on the location. These are caused by seasonal differences across the country. The north is always fairly expensive, while the southern arid region has deep troughs during dry season. In a sense, the spatio-temporal patterns of malaria from the north to south of Madagascar are visible in processor loading.
These inconsistencies informed our static load balancing strategy where we divided Madagascar by average malaria prominence. This approach improves matters, and all core rows add up to the same value, but the columns differ at each point in time.

Static load balancing by average malaria prominence
Much better, and the overall job runs much faster, since the more even processor loading reduces the waiting time.
“Loading” is a relatively smooth function of location, however, also dependent on population density, temperature and rainfall. These factors vary over distances that are much larger than our simulation resolution. So if we grid off the country and divvy up ”pixels” between cores, then every core has an equal sample of every area in Madagascar. This results in automatic static and dynamic load balancing.

Checkerboard balancing
However, there is an overhead cost to checkerboard balancing. With large blocked regions, such as in static load balancing, almost all migration of individuals happens within one of these large blocks and thus within a single core, being handled in local memory with pointers. With a checkerboard, almost all migration is to other cores, and individuals and their infections must be packed, sent in a message, and unpacked. Over the course of a 90 minute simulation, this adds about 6 minutes of overhead. Yet the elimination of waiting time at synchronization gains time, and the even balancing of the load improves efficiency as well.
At the moment, both static load balancing which avoids the migration overhead, and checkerboard which eliminates processor waiting time both provide good improvements over the basic simulation efficiency.
We are working on dynamic load balancing that keeps areas connected and dynamically balanced. This has the potential to achieve both types of efficiency improvements with less overhead. We’ll report back when we have the results.