The Pipes has been an absorbing project. The design seems competitive design and estimates line up with real chips. It was a surprise to find that the silicon seems heat limited, and I’m not even using rules that map to the densest products now possible.
20 years ago Dennard scaling came to an end when CMOS chips got too hot to clock faster. In response, transistors did get smaller, voltages got lower (a squared benefit), and control over leakage current was substantially solved by FinFETs. Clock rates declined moderately and designs diversified with many functional units specialized to particular functions which didn’t all run at the same time. The circuits which were not running got called dark silicon ane are common in products like cell phones, watches, and even CPUs. This is not just that they’re idle, like your cell phone in your pocket all day, but even during use most of the function blocks run briefly and then go to stand-by. This is great! This is why we have a thriving electronics industry producing some amazing products.
The Blockchain Mining functionality is an outlier compared to other silicon. There is no dark silicon. Every element of the pipeline is being used on every clock cycle. Not just the pipeline but the inputs and outputs too. Nothing idles or pauses. Even with FinFETs it looks like we’re reaching heat density limits around 1.5 GHz with the equivalent of an Intel 4 process running at its minimum voltage.
However, BlockChain is not alone in being like that. AI looks like it will be much the same pattern. All gates run all the time. Those big tensors which you hear so much about are being calculated constantly, and the machines are so expensive that when you’re not asking a question the machine is answering somebody else’s question. The models are getting bigger and the arithmetic is getting vaster and nothing is dark. So how are those guys doing on energy density? And what is their future?
The arithmetic units in vector/array/tensor processors, the main functional parts of a GPU, are all densely designed and can be assumed near to optimal layout. Multipliers and accumulators are naturally grids which talk to their nearest neighbors, with just the occasional carry propagation to worry about. I’ve used the same methods as for the blockchain pipe to design minimal energy tensor units, and I come up with about 15 fJ per Int8 MAC - just slightly better than claims for actual GPUs. The difference is probably because those GPUs have to do other things like move data around before and after the arithmetic. They’re pretty close to the energy density limit.
Also, practitioners who use machines like Nvidia Hopper do say that they can’t run them at full clock, they’re commonly scaled back because they reach their heat limits.
So here we are at the frontier, and those aren’t even using the densest process. The brand new Blackwell chip from Nvidia, which has got the industry shaking in its boots as I write this, is only using a variant of TSMC’s N4 process. That’s about two years behind the densest TSMC N3 process. Yet the power intensity of the Blackwell is certainly way up there, with a 72 GPU rack consuming 120 kilowatts. And it will be interesting to see if that runs at full clock or heat limited like the Hopper chip.
What about the future CMOS logic processes? They’re bound to be better. We have a road map to CFET, in maybe 2028, which is about three times as dense as an N4 process. That’s bound to be much better, right?
Well not so fast! If you look at the literature for CFET performance you see that the voltage will be about the same as ribbon FETS - because they are ribbon FETS. Going from FinFET to RibbonFET doesn’t change capacitance much because even though the channel gets shorter, we now wrap 4 sides and the 4th side adds capacitance. By the time we get to CFET there may be about a 10% drop in capacitance.
A study of CFET parameters at IMEC
10% is not going to keep pace with the growing AI demand.
The density design goals for classic chips - things like ARM cores - are not aligned directly to minimum energy per operation. That is because conventional chips continue to benefit from more, dark silicon. Well, they will until they too need to run more inference with its dense arithmetic.
You might wonder why we care. Can’t we just build more of them, and put bigger heat sinks on? We sure can and we sure will, but the industry growth rate means that we will be constrained by finding enough new power to build those data centers. And we certainly aren’t going to have much incentive to use expensive new processes with smaller transistors if the work done per second is just the same.
The parameters which are critical for energy - the capacitance and the voltage - only vary slightly if we follow the path of planned processes. Even though we have three times density by the end of the decade, that density would mean we’re pumping a lot of heat out and yet must run at ever slower clocks to avoid melting. The energy we spend for each operation is not advancing much because it has not been getting the attention it needs. It has not been a primary KPI.
One precedent for heat limitation is obviously the Pentium 5 era, where CPUs had to have a goal reset away from clock density. However, there is a more recent problem which has many similarities to the upcoming arithmetic density limit. This is SRAM scaling - or rather, not scaling for more than 5 years now. SRAM has been an “ideal” design at the FinFET limit since the 7 nm node, which was the first node that could meet all the physics limits in the SRAM design. You can’t reduce the fins - it only uses 1 fin for both N and P. You can’t shorten the fins - the transistors will leak if the channels are too short, and N7 lithography met that limit. You can’t advance by changing the wiring - it is already ideal layout. So, while the rest of logic has been getting denser by various tricks, the SRAM advances by very tiny tweaks.
The same thing seems ready to happen with tensor arithmetic - but, have we got any wiggle room if we look at device optimizations? We can look at some of the same things that optimized SRAM - fewer fins, and ideal layouts - as well as how to operate at lower voltages. We can also wonder if capacitance can be improved by the process.
There is some hope in going to fewer fins. In the summary for the Digest, we see
that about 2/3rds of the switching energy is due to gate capacitance (logic and clocking gates). This is unusually good- EDA of ordinary functions like a CPU core will generate more wiring - but it may be not far off what happens in the optimal versions of dense arithmetic. So, if we drop from 3 fins to 1 fin the gate loads would drop to equal with the wiring, an overall reduction of 1/3rd. This would in principle drop the clock rate in half since the 1 fin delivers 1/3rd the drive into 2/3rds of the load. However, if the device is heat-limited we might actually have a faster clock and make better use of a denser process.
There is also a question of whether the wiring load shrinks. If the CFET layout succeeds in being 3x denser we might hope that is has a square root effect on the length of wires, cutting them by 30%. Not really that likely, but if the KPI is “ reduced fins and reduced wiring” I bet there is a relaxation on layout density which allows the new wiring to be shorter due to better routing that was part of the advanced process.
The EDA tools should be considered part of the technology. They optimize a chip design using goals, one of which is “logical effort” (LE). LE is a good metric for balancing conventional chips but it is not directly a lowest-energy goal. If the EDA tools allowed goal functions to be chosen which weighted more towards lowest energy per operation of a functional unit, it would be interesting to see how much EDA could contribute to solving the energy problem.
Can we get less capacitance per unit of wire? While we can get some improvements in capacitance per unit by changing the dielectric material that search has been going on for decades and is in use in the latest processes. The other way to get less capacitance by per unit is to space the wires further apart. That tends to run up against density but perhaps there are some relaxations possible in density worth it to get looser wires.
So, reduced fins and ideal layouts have some good potential. EDA could help. Somewhere between a 1/3rd and 1/2 reduction in capacitance load per switching pair may be possible.
What about voltage? It is interesting that designs with no dark silicon may run at lower voltages than chips which have a lot of idle functions. The reason is that a higher voltage allows transistors to switch off more completely, and it is important for idle circuits not to waste energy on leakage. A blockchain processor might run at 0.35V and leak but it does not matter because the gates operate all the time and leakage is a percent or so. A phone chip might need to operate at 0.45 or 0.5V to be sure that leakage in large amounts of idle silicon is not draining the battery.
Dense arithmetic is probably more like the blockchain and should be running - and leaking - at low voltage. It should be in power domains which can run lower than the more general-purpose parts of the GPU. Arithmetic units can be switched off when not in use. For example if an inference model only needs FP4/6/8 dense arithmetic, most of the FP32 or even FP16 units might be powered down.
To get below 0.35V we will need to chill. As in liquid nitrogen. The IMEC/ARM exploration published in 2022
Cryo-Computing for Infrastructure
explores the potential for low voltage switching. Consider Fig 3 (ii)
The much steeper curves at 100K mean we can make low voltage transistors that switch on/off currents at much less voltage than normal GPUs (which are running junctions at close to 400K, not 298K). A cryocomputer with 100K junction temperature and optimized CMOS could use 10x less energy than a hot GPU today at “normal” temperatures. Now that is really a leap!
Ah, but you need to cool it. Darn, yep, you need about 4x the power to cool it than to run it, so you only gain double. But there is a twist - cold can be stored, just like heat. You can make a “cold battery” for renewable enrgy with liquid nitrogen:
That 4x cooling energy can be obtained from cheap and friendly renewables and stored for days. Combined with weather prediction and multiple sites, supercomputing could run almost entirely off the regular grid.
There are other benefits of cryogenic operation. Wires have lower resistance which makes routing easier. At these low temperatures you can make a DRAM that has almost no leakage - glacial refresh times - and the performance penalty relative to SRAM disappears and the DRAM can be made with a logic-compatible process, while DRAM remains much denser than SRAM. That has been an impossible dream at hot-computer temperatures.
The outlook
I do think there is plenty of opportunity for circuit processes to be tweaked to be more energy aware, and for EDA tools to be redirected to save energy per operation over the traditional goals. I also believe that cryogenic computing will become practical by the end of the decade for a much greener computing world. It’s just too good to ignore and unlike other computing paradigms nothing logical really changes, we simply waste less energy for the same results. Feasibility and compatibility are not in doubt. The chips will be made in fabs that are very similar to the fabs now. Compute in memory will be real, not a hot mirage. This is going to be a wonderful challenge .
When you visit the 2030 data center you'll need a parka in the middle of a desert. Don’t touch anything frosty. DON’T lick it (I grew up in Ontario, that happens)!
Next Week
See you in a week or two for the start of my next series, which will be on a novel microprojector for efficient, lightweight, high resolution augmented reality!