My friend is working on a research project dedicated to many-core architectures, such as NVIDIA’s GPUs or Intel’s Xeon Phi, that have lots of simple cores best suited to straightforward computations.
Sometimes those cores need to communicate with their neighbours that are located far away — perhaps on another accelerator board in the same compute node, or even in a different compute node. And the cores are often so simple that they are not well-suited to MPI communications.
Hence, the key idea of the project is to offload MPI communications to the CPU of the compute node. The research team is now working on the MPI implementation that would facilitate this offloading.
However, in this situation scientists have to adapt to existing hardware. If this idea with offloading is worth the trouble, I would suggest a couple of steps that vendors could take to make life easier for scientists:
1. Equip accelerator board with a general-purpose CPU that would handle communication. (It’s much like in IBM Roadrunner, where AMD Opteron CPUs were helping IBM PowerXCell CPUs by “feeding the Cells with useful data“, as Wikipedia terms it). Currently, a CPU is located on the motherboard, and communicates with accelerator boards via PCIe. The suggestion is to put the CPU as close as possible to the accelerator — on the same board, or even integrate the CPU on the same die with the accelerator, i.e., GPU.
2. Add an InfiniBand interface (or two) to the accelerator board, and then it becomes an independent computer by itself. It won’t longer need to be plugged into a PCIe slot somewhere. It can be just connected to the InfiniBand fabric.
3. (Optional) Now that CPU, GPU and memory chips are all on the same board, you can cover them with a heatsink — a plate with water channels inside — and use liquid cooling to remove heat. That will produce a very compact computing unit, resulting in dense installations.
4. Moving even further. Dense installations with water cooling are likely to face a problem: how to extract a failed module to replace it with a new one? Because you need to disconnect it from water pipes first. The solution was proposed elsewhere (and in a slightly different context), and in short it is “Don’t replace it”. Disconnect the failed module from power supply and disregard it when scheduling compute jobs. Given current reliability of electronic components, repairs may be unprofitable: perhaps your hardware will become obsolete earlier than 10% of its modules would fail.
Posted in the category of “[Crazy] Ideas”.
Update: One year after this blog post appeared we came up with a much more interesting proposal!