Marrying NVIDIA Tesla and InfiniBand?

My friend is working on a research project dedicated to many-core architectures, such as NVIDIA’s GPUs or Intel’s Xeon Phi, that have lots of simple cores best suited to straightforward computations.

Sometimes those cores need to communicate with their neighbours that are located far away — perhaps on another accelerator board in the same compute node, or even in a different compute node. And the cores are often so simple that they are not well-suited to MPI communications.

NVIDIA Tesla GPU accelerator. If only these boards had their own CPU that could boot an operating system, how many of them could you pack into a standard 42U rack? Image by: Ray Sonho. Source: Wikipedia

Hence, the key idea of the project is to offload MPI communications to the CPU of the compute node. The research team is now working on the MPI implementation that would facilitate this offloading.

However, in this situation scientists have to adapt to existing hardware. If this idea with offloading is worth the trouble, I would suggest a couple of steps that vendors could take to make life easier for scientists:

1. Equip accelerator board with a general-purpose CPU that would handle communication. (It’s much like in IBM Roadrunner, where AMD Opteron CPUs were helping IBM PowerXCell CPUs by “feeding the Cells with useful data“, as Wikipedia terms it). Currently, a CPU is located on the motherboard, and communicates with accelerator boards via PCIe. The suggestion is to put the CPU as close as possible to the accelerator — on the same board, or even integrate the CPU on the same die with the accelerator, i.e., GPU.

2. Add an InfiniBand interface (or two) to the accelerator board, and then it becomes an independent computer by itself. It won’t longer need to be plugged into a PCIe slot somewhere. It can be just connected to the InfiniBand fabric.

3. (Optional) Now that CPU, GPU and memory chips are all on the same board, you can cover them with a heatsink — a plate with water channels inside — and use liquid cooling to remove heat. That will produce a very compact computing unit, resulting in dense installations.

4. Moving even further. Dense installations with water cooling are likely to face a problem: how to extract a failed module to replace it with a new one? Because you need to disconnect it from water pipes first. The solution was proposed elsewhere (and in a slightly different context), and in short it is “Don’t replace it”. Disconnect the failed module from power supply and disregard it when scheduling compute jobs. Given current reliability of electronic components, repairs may be unprofitable: perhaps your hardware will become obsolete earlier than 10% of its modules would fail.

Posted in the category of “[Crazy] Ideas”.

Update: One year after this blog post appeared we came up with a much more interesting proposal!

This entry was posted in Ideas and tagged , , , , , . Bookmark the permalink.

2 Responses to Marrying NVIDIA Tesla and InfiniBand?

  1. about point 2, i don’t understand, even if the GPU has an on-board IB die, won’t it still communicate over “hard wired” PCI-E? they could certainly share the super fast GDDR5.
    And GPUDirect is there ease things up. On the other hand I really think nVidia and AMD is heading towards CPU-Less GPU operations; both released plans of integrating ARM cores. and honestly from my own experience, building GPU centric HPC apps, cpu role is diminishing, we could really save some room and power by having an OS running from within the gpu.

    • Konstantin S. Solnushkin says:

      Hi, Mohamed,

      Let me start with the end of your comment :) Yes, the idea of integrating the InfiniBand part on the GPU is all about saving space. In a typical GPU-equipped server, we have the motherboard whose main purpose is just to provide network access to the GPUs installed in the server. It doesn’t do much else.

      So the proposal here is to get rid of the motherboard. Just equip each GPU with its own InfiniBand interface, and you get a denser system. Moreover, in the typical server all GPUs have to share one network connection, which limits speed, while with this proposal each one has their own connection.

      I am glad to hear that nVidia and AMD released plans about integrating ARM cores — didn’t know about that! However, given the history of GPUs, I can’t understand why we still don’t have such products :) Think of it: a dual-purpose GPU card with an Ethernet interface. You plug it into your server, and it works like a usual GPU. But if you connect it to the Ethernet switch and reboot, it will act like a small but powerful computer by itself, using the server which it is plugged into just as a source of electrical power.

      Sounds simple to implement (add an ARM core and an Ethernet block on the GPU die), so why didn’t they release such a product? Seems that the software ecosystem must be ready, or the market will perceive this as a “product that failed”.

      Going back to the beginning of your question:

      about point 2, i don’t understand, even if the GPU has an on-board IB die, won’t it still communicate over “hard wired” PCI-E?

      I guess so; if there are two chips on a printed circuit board (PCB), they will communicate via PCI-E protocol, with the physical medium being traces on the PCB. If both blocks, GPU and InfiniBand, are implemented on the same die, they can communicate with some type of Network-on-Chip.

Comments are closed.