COMSOL 5.2 in Hybrid Parallel Mode
Investment in HPC infrastructures is never a small deal. To achieve the maximum efficiency, one expects to utilize all levels of computing resources, be it CPU cores, CPUs, servers, all at the same time. This is the so-called parallel computing, which can then be roughly classified into two types, namely symmetric multiprocessing (SMP) and distributed memory parallel (DMP). Although each of them can operate on their own, the highest performance is very often achieved by leveraging both. Such a practice is called hybrid parallel computing, and has been long supported by COMSOL, including the current release, version 5.2.
In COMSOL, the SMP and DMP configuration is defined by the NP and NNHOSTS parameters, indicating the number of threads per process and the number of processes per server node, respectively. NP times NNHOSTS must be less or equal to the total number of physical CPU cores on a node. The following lines list four types of parallel execution modes defined by NP and NNHOSTS.
-
If NP = 1 and NNHOSTS = 1, then COMSOL runs in serial-execution mode.
-
If NP = 1 and NNHOSTS > 1, then COMSOL runs in pure DMP mode.
-
If NP > 1 and NNHOSTS = 1, then COMSOL runs in pure SMP mode.
-
If NP > 1 and NNHOSTS > 1, then COMSOL runs in hybrid parallel mode.
Field measurements
To elaborate the influence of the NP and NNHOSTS paramters, we measured the performance of the standard 8517_RIM_SUBMODEL_PARDISO model with COMSOL 5.2 on one and eight servers each with two Intel Xeon E5-2690v4 CPUs and connected with the Intel Omni-Path 100Gbit/s high-speed network. We alternated among a few different combinations of NP and NNHOSTS while made sure that all cores on the node were used. The following diagram shows the obtained runtimes.
Looking at the single-node performance on the left side of the chart, we can clearly see that the yellow bar, representing {NP, NNHOSTS} = {2, 14}, has the shortest run time. With this best configuration, one saves 10 – 50% of time compared to other choices of parameter values.
On the right side, we increase the scale to eight nodes. There the run time is reduced as we all expected. However, the speedup factor differs significantly from one choice of {NP, NNHOSTS} to another. Clearly, the grey bar, representing {NP, NNHOSTS} = {4, 7}, delivers the best performance and shows a fairly decent, 6.4 times of speedup with respect to one node. The best values of {NP, NNHOSTS} = {2, 14} on a single node, however, becomes a much inferior choice, which only brings 2.0 times of speedup.
In general, one can conclude that the runtime is a convex function of NNHOSTS. This is caused by two reasons:
-
The larger NNHOSTS is (and hence the smaller NP is), the fewer threads there are on a node, and hence the less thread-synchronization overhead is. This explains why the runtime decreases when NNHOSTS increases from 2 to 14 on a single node (or from 2 to 7 on eight nodes).
-
The larger NNHOSTS is, the more inter-process communication overhead is. This explains why the runtime increases when NNHOSTS increase from 14 to 28 on a single node (or from 7 to 28 on eight nodes).
As a rule of thumb, one should first try all possibilities of NNHOSTS and NP on a single node to find the sweet spot. Then as the simulation scales out, slowly decrease NNHOSTS and increase NP.
Unfortunately, what we just concluded for the 8517_RIM_SUBMODEL_PARDISO model is not universally applicable. Instead, the optimal choice of NP and NNHOSTS depends on the simulated model and its corresponding solver. For example, execution of the BLOCK_BENCHMARK_MUMPS model reveals the following result.
According to the figure, {NP, NNHOSTS} = {7, 4} is the optimal choice on one server and {NP, NNHOSTS} = {14, 2} is the best choice on eight server nodes – both different from the chef’s choice for 8517_RIM_SUBMODEL_PARDISO. As a conclusion, one should not blindly rely on the best parameter values learned from a previously-run model, but always explore the hybrid-parallel parameter space, unless the difference between models is insignificant.
In this article, all examples have utilized all cores on a node. In reality, it is possible to use less cores available if the server needs to be reserved for other tasks at the same time. In such cases, attention must be paid to the exact CPU-core utilization scheme in order to achieve the best performance. For instance, one should not run two threads each on a different CPU socket, since thread synchronization overhead is much higher across CPUs than within a CPU.
As the last note, BIOS settings can also impact the performance of hybrid-parallel COMSOL. Generally speaking, disabling both „Early Snoop“ and „Cluster on Die“ brings the best result.
Conclusion
While the best COMSOL performance is usually achieved by exploiting hybrid parallelism, choosing the best hybrid-parallel parameter values requires a thorough understanding of the simulated model, system configuration, and the underlying hardware. At ict GmbH, Fujitsu HPC Competence Center Aachen, the interaction between COMSOL software, operating system, and hardware is deeply analyzed by professional performance profiling tools, such as Intel MPI Performance Snapshot and Intel Performance Counter Monitor, in order to conclude the most efficient way of executing COMSOL workloads, maximizing ROI of our customers. Furthermore, once the investigation is completed, the best-practice parameters can be configured in the working method on our Fujitsu HPC Gateway platform, further simplifying the working process and saving computing cycles.
Regarding Fujitsu’s offering for COMSOL application performance diagnosis and its Gateway integration, please contact our sales representative.
For more information please have a look at our COMSOL section on our website here.
Looking for the right hardware for your COMSOL simulation tasks? Our optimized and benchmarked solutions arrive ready to use with your COMSOL Multiphysics license preinstalled.
Download our COMSOL and Fujitsu Workstation Flyer here.