Xeon Phi “Knights Landing”-Processor Crafted for Massively Parallel Simulation
For high performance simulation experts, the Intel® Xeon® series is probably the de-facto standard CPU for computations nowadays. Nevertheless, if you are keen to get the optimal efficiency for OpenFOAM, you should not miss another nice alternative, the Intel® Xeon Phi™ processor family, which was first launched in 2013 as a co-processor, and then updated in 2016 as a host-processor which bearsthe code name “Knights Landing” (KNL).
As a host processor, Xeon Phi KNL behaves not so much differently from the traditional Xeon in the way that Xeon Phi also boots up a server by itself, runs the operating system, and is compatible for most codes on Xeon. To gain higher performance, recompilation with Xeon Phi-specific flags and modernization of codes towards the Xeon Phi architecture might be needed. Luckily, for OpenFOAM, such effort has mostly been done by Intel.
Having said that, Xeon Phi of course has its own unique features in contrast to Xeon, which are listed as follows:
-
64 – 72 cores are built into a Xeon Phi processor in contrast to 12 – 22 cores on a Xeon “Broadwell” chip.
-
High-bandwidth memory, integrated into the Xeon Phi chip, provides 4x memory bandwidth than traditional DDR-4 memory on the Xeon platform.
-
The vector unit of Xeon Phi is twice as wide as of Xeon, allowing twice more operands to be calculated at a time if the code is vectorized.
-
Intel Omni-Path 100 Gbit interconnect can be integrated into the CPU, which further reduces the cost of an additional PCIe adapter.
The following figure depicts the architecture of Intel Xeon Phi KNL.
Source: Intel
To make efficient use of Intel Xeon Phi processors, Fujitsu has developed the PRIMERGY CX600 M1 server platform. The 2-HU platform houses 8x PRIMERGY CX 1640 M1 compute nodes, four on the front and four on the rear side. Each node is equipped with one KNL CPU. As of April 2017, Fujitsu has finished two large installations of this server type at Jülich Supercomputing Center in Germany (QPACE3) and at Joint Center for Advanced HPC nearby Tokyo in Japan (Oakforest-PACS). The rear side of CX 600 is shown below.
OpenFOAM on Xeon Phi
In my last article << Drawing realistic CPU-performance expectations >>, I wrote that an application’s performance could be influenced by several factors in the CPU instead of only by the frequency. Now you may wonder how KNL’s strengths apply to OpenFOAM.
OpenFOAM typically scales much better across nodes than within a node, and in particular, not with high core-count Xeon CPUs. The main reason for such a phenomenon is memory bandwidth. As OpenFOAM usually consumes all memory bandwidth on a Xeon node, its performance is more determined by the memory bandwidth, which is fixed on a single node, than by the computational power of the processor, which is proportional to the number of cores.
Xeon Phi KNL, with its 16 GB of integrated high-bandwidth memory, is just the right cure for OpenFOAM’s memory-bandwidth bound behavior. While you might think 16 GB too small, it is actually not when the model has been decomposed. In our analysis, a 100-million-cell model, when run on 32 nodes, only consumes around 18 GB per node. When the model is larger than the high-bandwidth memory, the standard DDR-4 memory will automatically be used.
In fact, build rules for Xeon Phi KNL have been added to OpenFOAM since the OpenCFD release v1612+ and the OpenFOAM Foundation release 4.1+. One just needs to write WM_COMPILER=GccKNL or IccKNL in the etc/bashrc file, then KNL-specific compilation options will be applied. On top of that, one is encouraged to used Intel-optimized GaussSeidel and symGaussSeidel smoothers to obtain higher performance. The code is freely available at https://github.com/OpenFOAM/OpenFOAM-Intel.
In summary, OpenFOAM’s biggest performance bottleneck is addressed by the high-bandwidth memory on Xeon Phi KNL. OpenFOAM’s source code is officially ready for Xeon Phi – just specify the right build rule, then you get the right performance.
Field measurements
After all, theories are only useful when proven by experiments. To show that Xeon Phi is indeed the better choice than Xeon for OpenFOAM, we evaluated the performance with a real-world simulation model – the motorbike case, as shown below, from OpenFOAM’s standard tutorial suite from the The OpenFOAM Foundation.
The original model in the tutorial only has three million cells. To simulate more realistic workloads, we have refined the mesh to make it 88 million cells. The remaining software configurations are provided below.
Solver |
pisoFoam |
Category |
Incompressible flow |
OpenFOAM |
4.1 + Intel-optimized smoothers for KNL |
Compiler |
GCC 6.1.0 |
MPI |
Intel MPI 2017 update 1 |
The configurations of the studied hardware platforms – Xeon and Xeon Phi – are provided in the following table. Be noted that the Xeon platform is about 25% more expensive than Xeon Phi in this case.
Platform |
Xeon |
Xeon Phi |
Compute Node |
Fujitsu PRIMERGY CX 2550 M2 |
Fujitsu PRIMERGY CX 1640 M1 |
CPU |
2x Intel Xeon E5-2690 v414 cores/CPU, 2.9 GHz |
1x Intel Xeon Phi 7210 (KNL)64 cores/CPU, 1.3 GHz |
High Bandwidth Memory |
None |
16 GB |
DDR Memory |
128 GB DDR4-2400 |
192 GB DDR4-2133 |
Interconnect |
Intel Omni-Path |
Intel Omni-Path |
Storage |
64 GB SATA DOM |
64 GB SATA DOM |
Peak Power Consumption |
394 Watt |
321 Watt |
We have summarized our measurement results in the following chart. As it can be seen from the height of the bars, Xeon Phi always finished the simulation significantly ahead of Xeon. On average, with Xeon Phi, you can run 30% more simulations per day than with Xeon.
But Xeon Phi is not just more powerful than Xeon, it is more superior than Xeon in all aspects of performance-price, performance-power, and performance-space (high unit). We have calculated these numbers in the following chart, based on single-node measurements. Be noted that all numbers have been normalized against the Xeon platform.
Conclusion
Fujitsu’s PRIMERGY CX 1640 M1 compute node together with Intel’s Xeon Phi processor unleashes the performance of OpenFOAM with a massively parallel core architecture and high bandwidth memory. The application has been optimized for Xeon Phi for maximum performance. Tested with a realistic 88-million-cell incompressible-flow simulation model, we have shown three advantages of the Fujitsu’s PRIMERGY CX 1640 M1 platform for Xeon Phi:
-
High performance-price: One Xeon Phi node run each simulation 1.7x faster than a 2-socket Xeon node of the same price.
-
Energy efficient: One Xeon Phi node consumes 2/3 of energy of a 2-socket Xeon node for every simulation.
-
Space efficient: One Xeon Phi node takes only 1/3 of space of a 2-socket Xeon node for every simulation.
If you are interested in our Fujitsu PRIMERGY CX 1640 M1/CX 600 M1 offering for Xeon Phi, please contact our sales representative.
Additionally you also have the possibility to make your own first-hand experience of the power of an HPC environment with the advantage of Intel® Xeon Phi™ processors, for free. Register for access and you can try for yourself a system which is preloaded with a set of applications ready to use, or alternatively, you bring your own codes. Fujitsu’s HPC team will provide technical support and assistance to ensure you get the highest return from your experience.
Register and experience the benefits now!