Recently, industry analyst John Barr wrote at the ISC blog about “Potholes on the Road to Exascale”. John speaks about a unified programming environment that should be able to support all sorts of computing devices of the future. That’s right: we need that. But only if we can actually build those future computing devices, because exascale cost is going to be prohibitively high.
In my previous blog post — Exascale Supercomputers: Anything but Cheap — it was calculated that exascale computers, if built with today’s components, would cost around 6 billion dollars to build and operate. That’s perhaps the biggest “pothole”, “obstacle” or “roadblock” on the road to exascale, or whatever metaphor you would like to use.
But the 6 billion dollar figure was based on the assumption that we take a compute node from Tianhe-2 that has two server-grade CPUs and three accelerator cards. The accelerators do the main floating-point intensive work, whereas the CPUs are mainly used to feed data into the accelerators and get the results back, and the motherboard physically connects the network adapter to the rest of the system.
What will be the benefits of getting rid of the CPUs and the network adapter and shifting this functionality directly to accelerators? Significant savings!
In current cluster supercomputers, accelerator boards (be they GPUs or Intel Xeon Phi cards) have to be plugged into a motherboard together with ordinary CPUs and a network adapter, such as InfiniBand. The accelerator chip is a powerful computer by itself, and yet it requires CPUs to get access to external storage and a network adapter to communicate with other compute nodes. It would be best if we could integrate the required hardware right on the accelerator chip; this would allow to throw away a lot of useless hardware, making computers simpler, leaner, more compact and reliable — and in the end cheaper.
It is certainly possible to move a lot of functionality onto the chip. Calxeda integrated onto their chips (see image above) DDR3 memory controller, SATA controller for local storage, and a 10 Gbit Ethernet switch, the latter giving the chip 10 Gbit Ethernet connectivity — to neighbouring chips and to the outside world. So a manufacturer of future accelerator cards can do the same, and the resulting device will need neither a motherboard to be plugged into nor a separate network adapter.
In fact, integrating a network adapter is a big thing. IBM has been enjoying the benefits of this approach in all their Blue Gene chips produced to date: L, P, and Q. Imagine now an accelerator card — Nvidia Tesla or Intel Xeon Phi — that doesn’t need to be plugged into any motherboard, has its own network connector and can be connected directly to a switch. Isn’t this the bright future? And Intel has been preparing for this.
In 2011, Intel acquired Fulcrum Microsystems, a company that designed 10 Gbit and 40 Gbit Ethernet switch chips. In 2012, it acquired QLogic’s InfiniBand assets, and then bought Cray’s interconnect assets. It’s high time for Intel to offer some sort of connectivity — Ethernet, InfiniBand, or some other — on their chips.
So, what would a modern cluster supercomputer look like if it was built with accelerator cards, each with its own network port? There would be no need for motherboards or server CPUs, or InfiniBand adapters. Each accelerator card would be an independent compute node. These cards would be very tightly packaged in racks, possibly with local storage if needed (remember, you can integrate a SATA or similar controller on the chip and then connect solid-state drives). Cards could be placed into chassis to provide connectivity to a switch via a backplane. And they would be cooled with some variation of liquid cooling, such as hot-plate cooling like in IBM’s Aquasar system, or with two-phase immersion cooling.
Of course, since the accelerator would now perform I/O operations itself (like talking to a file system), you would need corresponding blocks on the chip, too. For Nvidia Tesla, you will have to add some general-purpose cores (Nvidia is already adding ARM-based cores to its Tegra chips, but these are still marketed as mobile solutions, not HPC). For Intel Xeon Phi, you may not need any changes at all, because its cores are general-purpose by design and can run the Linux kernel.
How much can we save if we thus throw away CPUs, the motherboard and the InfiniBand adapter, compared to the $6.6 billion calculation from the previous blog post? Let us count.
First, let’s be kind and assume that cost and power consumption of the accelerator board (and we took Intel Xeon Phi 3120P boards as an example) do not increase from adding the InfiniBand controller on the chip. If it still provides 1 TFLOPS of peak performance with the TDP of 300 W and the cost of $1695, we will need one million of such cards to reach 1 ExaFLOPS (EFLOPS). That would cost us $1.7 billion and consume 300 MW of power.
Now, we can design the interconnection network. Let’s assume a torus network built with 36-port switches; that’s the most straightforward way to connect our chips together. In this case, no matter the number of dimensions your torus network will have, 18 ports of the switch will be connected to compute nodes, and the other 18 ports to neighbouring switches. This way, to connect one million nodes, we will need roughly 56,000 switches. With a single switch costing us $11,000 and consuming 200 W, the whole network will cost $616 millions and consume about 11 MW.
The total power consumption is 311 MW. As we remember, each year has about 8,760 hours, and with the electricity price of $0.13 per kW·hour, the cost of 1 MW·year is $1.14M. So our electricity costs are $1.14M/(MW·year) * 4 years * 311 MW = 1.42 billion dollars.
Capital costs are $1.7 billion for the computing hardware and $0.62 billion for the network, for a total of $2.32 billion. Operating costs here are electricity costs, $1.42 billion. The total cost of ownership in 4 years is 2.32+1.42=3.74 billion dollars. This is a 43% reduction compared to the previous $6.6 billion estimate when CPUs, motherboard and network adapter were used. Power consumption is 311 MW, compared to 423 MW in the previous scenario — a 26% reduction.
So, throwing out unnecessary hardware — and integrating required functionality onto the chip — makes exascale future a little bit closer, and exascale supercomputers a little bit more accessible.
Thanks for reading this far, and next time we’ll see how we can further reduce costs.