Calculate the performance of your cluster on the “truck_111m” benchmark from the “ANSYS Fluent” software suite, or find out how many cores you need in your cluster to get a specific performance rating, in tasks per day.
ANSYS Fluent 13.0.0 Simple Performance Model
I am brave and bold, get me right to the model!
Introduction
ANSYS Fluent is a computational fluid dynamics (CFD) application — a CAE software used to design a wide range of technical devices, from concrete mixers to cars to spaceships.
ANSYS Fluent has been on the market for many years, and has gained significant popularity and user’s confidence. The software is successfully used on parallel computers, big and small; it makes proper use of InfiniBand and works under both Linux and Windows.
What makes ANSYS Fluent special is that it is trusted by leading companies. That’s why research teams here and there want to build computer clusters that would work with ANSYS Fluent and would deliver a particular level of performance.
But how to design a cluster that would deliver a required performance level? Technical teams from big vendors usually analyse customer’s workloads and then refer to available benchmark results to draw conclusions about the amount of hardware the future cluster needs to have to fulfil performance goals.
It is always helpful to automate this mundane task. We present you the performance model of ANSYS Fluent on a typical large scale CFD problem, truck_111m — airflow around a truck body. This model contains 111 million elements. With a high-speed interconnect — i.e., InfiniBand — it can successfully run on parallel configurations with up to 3072 cores. Using inappropriate interconnect solution, 10 Gigabit Ethernet, limits scalability to about 384 cores.
Direct and Inverse Performance Modelling
This performance model takes the publicly available benchmarking results and approximates them with simple formulae — these analytic expressions make up the analytic performance model. Coefficients in the formulae are selected in such a way so as to provide good agreement between the model and the measured data.
There are two ways how a performance model can be used. First is the direct performance modelling: given the computer parameters (in our case, the number of cores and type of the interconnection network), calculate the performance of such a computer. The other one is the inverse problem: given the required performance, find the parameters of a computer that would attain this performance. Solving the inverse problem involves solving a direct problem several (often many) times, therefore it is more complex by nature.
Inverse problem is precisely the problem that technical teams have to solve for their customers: given desired performance, design a supercomputer. In our model, the following formulation is used: given the desired performance and interconnection network type, calculate the required number of cores.
The Unit of Measurement for Performance
Traditionally, ANSYS Fluent uses the number of tasks that can be solved in a day as a performance measure of a supercomputer. They call it “rating”. For example, if a supercomputer has a rating of 48 on a particular benchmark, this means that in 24 hours it can run 48 tasks. Termed differently, this means that one benchmark run takes 0,5 hours, or 30 minutes.
Therefore, the performance of a supercomputer depends on its hardware and software characteristics as well as on the nature of the benchmark. Benchmarks are used to represent real-life scenarios of using software. The above mentioned 111-million element benchmark — with its size typical for current engineering practise — can take about a day to run in serial mode even on a modern workstation. That’s why parallel runs are used to decrease time to solution: you run as many concurrent threads as you have available CPU cores.
Throughput Computing Mode
However, you cannot indefinitely increase the number of concurrent threads, because after reaching a certain limit the time to solution will stop decreasing and will, in fact, increase! Hence, for every combination of hardware and workload there is a limit on the number of cores that should not be exceeded.
For example, it is not reasonable to use more than 3072 cores when running the truck_111m benchmark on a computer cluster equipped with InfiniBand network. In case of 10 Gigabit Ethernet, the upper limit on the number of cores is significantly smaller: it’s only 384 cores.
Smaller problems have smaller limits: a 500,000 cell problem (turbo_500k, turbomachinery flow) hardly allows to effectively utilize more than about 200 cores, even when using the fastest networks.
If you have more cores in your cluster than the above limits, you can run several problems simultaneously. This is often helpful for engineering practise when several variants of a technical device need to be considered. This mode is called throughput computing mode.
You still obtain a solution to every individual problem as fast as possible by running it on a portion of cluster’s cores, while other cores are used to run other problems. The total “throughput performance” of your cluster is the sum of its parts.
Example. Suppose you have 500 cores in the cluster, and your problem can only effectively utilize 200 cores or less. When you run this problem on 200 cores, the performance is, say, 4 tasks per day. With 100 cores, it is 2,5 tasks per day. With 50 cores, it is 1,4 tasks per day. Scenario A: if you want to minimize the time to solution, you can assign 200 cores to the first task, another 200 cores to the second task, and the remaining 100 cores to the third task. In 24 hours, the first partition of 200 cores will solve 4 tasks, and so will the second partition of the other 200 cores. The last partition of 100 cores would solve 2,5 tasks. The total throughput is 10,5 tasks. The minimal time to solution is obtained on the first and second partitions and equals 6 hours (because 4 tasks are solved in 24 hours). Scenario B: if you want to maximize the throughput and are not interested in obtaining the results of each individual problem as soon as possible, you can assign 10 tasks to 10 partitions, each consisting of 50 cores. In a day, 10 partitions will solve 14 tasks (which is 33% more than 10,5 in Scenario A). The time to solution of each individual task is 24/1,4=17,1 hours.
This example highlights that you have to carefully choose your optimal number of cores depending on your requirements.
Model Limitations
This model was inferred from scarce and incomplete measurement data. Moreover, it uses a simple linear approximation for parallel efficiency. Therefore, we have to describe the model’s limitations:
1. Best agreement between model and measurement starts from 192 cores. For less than 192 cores, the model deviation from measured results is within +/- 15%.
2. Due to the lack of data, for 1..192 cores, performance for InfiniBand and 10GigE networks is determined by the same formula.
3. Measurement data were available only for Intel Xeon 5600 Series CPUs. For other architectures, the performance would be either higher or lower. The clock speed of 3,47 GHz is used by default, and performance is assumed to scale linearly with regard to clock speed.
4. Starting with 3072 cores for InfiniBand and 384 cores for 10GigE, the model automatically recognizes the need for “throughput computing” mode. It partitions cores using a greedy algorithm; for example, it would divide 10,000 cores into 3072*3+784 cores and will calculate the total cluster performance as a sum of performance of individual partitions.
Using the model
To solve the direct performance modelling problem, fill in the number of cores and choose the network type. In the output you will receive the performance of the cluster.
To solve the inverse problem — the most interesting one, used to design clusters — clear the field with the number of cores and conversely fill in the desired performance, in tasks solved per day. For example, the performance rating of 288 indicates the time to solution of 5 minutes. Then choose the network type. In the output you will find the number of cores that you cluster must have in order to attain the requested performance. E.g., with InfiniBand you would need 312 cores, while with 10GigE this grows to 374 cores.
Automated queries
You can also programmatically query the model in non-interactive mode and obtain machine-readable output. See the help link at the bottom of the main window.
Questions? Comments?
Want to host this tool on your own computer for local use? Proceed to the “Download” section. Or get in touch!