Let’s look again at the overall schematic for an SRAM array, where we can see some of the main supporting elements.
If we can get a fair estimate for the size of the supporting elements these will allow their fractional overhead to be added, so we have the effective size of the SRAM cell. We will assume the cell itself is the dual-port design from last week’s post, where each port can either read or write, and the cell in Mock 4 scaling is 36,000 nm^2.
Data input Address decode enable drivers
The address decoders run across the top, since each word of data is vertical. The decoder selects just one column to enable using a binary address broadcast to all the columns and custom logic combinations to allow only one position that matches.
The address matching logic is too large to match up with the 100nm width of a data cell so we group sets of 4 columns to give a 400nm width an will share some of the address matching logic to optimize the use of space.
The top row of logic in the schematic matches the shared first-four bits of the address, using a pass-transistor approach which will output a low voltage only if the address bits match (in this example matching 1 1 0 1 x x for selecting from 64 words). Each of these pass gates fits the same width as one cell. The timing is then gated to allow for the address gates to have settled, then the signal flows to an inverter to boost the signal to 4 parallel selections on the 2 low bits of the address. Only one will match and pass the signal to a word-enable. Each inverting booster drives one of the word enables for a column in the data array.
The logic now fits within 400 nm wide and 6 rows high, which if Mock 4 rules are used with 3-fin transistors will be about 1.5 microns tall.
This is duplicated for the two ports on the SRAM array, making a total of 3 microns. If the word is 72 bits (64 bits plus 8 for ECC) and cell height is 360nm, then the array is 26 microns high, for an overhead of a little bit more than 11%.
Word enable Re-drivers
The word enables will need to be re-driven as they pass down the array. Each re-driver needs to be non-inverting, which requires a separate row for each of two inverters to fit within the 100nm width without sharing an output edge.
There will be 4 rows needed between groups, about 1 micron tall if they follow 3-fin Mock 4 rules. We may be able to build a word of 72 bits without needing a redriver. They would add 4% area if they are needed for every 72 bits.
Data Input (DQ) Drivers
The data signals are driven in from the left of the schematic.
The write-enable signal puts the driver into a high impedance state with no current flowing so that the write drivers are not interfering when the port is being read. The data to be written should be set up and settled before write is enabled. The column drivers should be sequenced after the write enable, giving time for the data to settle across the array. The column drivers should be pulsed briefly and then both the column and the write enables are removed, leaving the new data settled into the SRAM cells of that column.
The width of the Data Input Drivers will be 350 nm and there will be 2 of them, one set for each port. There will also be a boundary for transition between the logic and the SRAM fin and power patterns, making the total likely 800nm. If there are 64 words, 6.4 microns wide, then this adds 13% to area.
DQ Re-Drivers and Partitioning
As the data propagates from left to right it is both partitioned and boosted at boundaries between cell groups. The isolation allows each section to have shorter length DQ wires which float when a cell is being read. The cell is able to drive the short floating line with enough margin not to be flipped by charge remaining on a DQ line when a cell port is opened for reading.
The re-driver, like the driver, is in a high impedance state when not being used. The difference between driver and re-driver is that the re-driver input is a DQ/~DQ complementary pair. Swapping the wiring allows the re-driver to avoid inverting the signal logically, even though there is just one inversion on each path.
The re-driver can be built using the fin and power pattern inside the SRAM array. The transistors will be 2-fin not 3, but that is strong enough. A redriver pair might be used for every 64 columns (data words). There are circuit discussions in the literature indicating hundreds of columns might be possible between boosts. In a 64-word design that would result in no re-drivers. The limit is likely to be noise margins from long floating DQ wires which could disturb cells as they open for reading.
A pair of re-drivers, one for each port, will take up 0.7 microns of horizontal space in the Mock 4 layout rules. So, there is strong incentive to trade off some performance in return for longer distances. 64 columns of SRAM is only 6.4 microns, implying an overhead of more than 11%.
Data Output
In this design the output is a latch, a D Flip Flop with the DQ/~DQ pair as input.
The enable signal pulse gates the input from the DQ wires and the result is stored in the feedback pair for a stable output after the enable pulse is complete. The enable would be timed to delay until the DQ wires should have settled after opening the word enables (but not enabling the Data Input drivers).
This circuit will be 600nm wide in Mock4 rules and it will be associated with a pattern shift from SRAM fin and power pattern to logic pattern, figure 700nm total, for about 11% overheads relative to 64 columns.
Adding all the Overheads
Let’s assume we build a 64 word x 72 bit SRAM, which is about as small as you will ever see for a dual-port SRAM. Smaller-capacity arrays are likely going to be for register sets which are optimized more for speed and function than for size. Let’s say no redrivers are needed in this minimal size, so the overheads are:
11% for column addressing and drivers
13% for data input enabling and drivers
11% for data output
These add 35% to the effective size of the cells, for an average around 50,000 nm2 per bit. This was in a process where the nominal best SRAM cell size was 24,000. It jumped to 36,000 to add a second port, and the rest comes from the edge circuits.
If the array size is doubled in each direction, to 128 words of 144 bits, and we assume there are re-drivers in both DQ and word-enable directions, we can divide the primary overheads in half, to 19% (the addressing section is one rail taller which is why it is not 18%), but we add:
2% (of 144 bits in a column) for word re-drivers
6% (of 128 words width) for DQ re-drivers
The overhead is now 27%. If the size is increased to 256 words of 288 bits, the overhead drops to 18%. At 512 words of 576 bits (32kB of cache or scratchpad) the overhead could be under 15%. However, I have not counted the cost of the ECC circuitry, the timing generation, or the drivers for the address and enables.
What is the real world?
On a real AMD Genoa produced on TSMC N5 the SRAM blocks in the cache section are roughly 50,000 nm^2 per bit which includes the array edges, while TSMC reports the smallest SRAM in that process to be around 21,000 nm^2. The cache cells are quite likely to be dual ported, and the image dimensions include everything at the edges so will have timing, clock, various signal drivers, and so on. It is reasonably close to the numbers estimated in these last blog posts so this is likely to be a useful study of the size and overheads of SRAM arrays.
Comparing DRAM to SRAM
This exercise also helps understand why SRAM is so different from DRAM. SRAM has access times on the order of a nanosecond, while DRAM is around 20 nsec latency at best. The latest DRAM chips have densities, including all overheads, of around 400 bits per square micron, and are produced with cost-optimized processes that can make a wafer at $3,000 for 2.5 terabytes of yielded chips. An SRAM on a recent 5nm logic wafer is about 20 bits per square micron on a wafer costing maybe $15,000, which is about a 100x increase in cost per bit.
SRAM reads and writes bits at far lower energy cost than DRAM. SRAM can be mixed with logic and run at logic speeds. DRAM can have far higher capacity per unit of silicon, and even more advantage in capacity per dollar.
The two serve in two very different roles and over multiple decades computer architecture has adapted to make good use of both, and blend them both. Which is not to say that we would not like denser, cheaper SRAM, or faster lower power DRAM, but in the meantime we will continue to blend them in whole systems.
What’s Next
My next project is measuring the speed of gravity using data from GPS satellites. We know from LIGO that gravity waves from black hole collapses arrived here at the same time as electromagnetic radiation so we already know the speed, but just for fun I am following up with an alternative calculation using data very close to us.
That will probably take a couple of weeks and then the project after that will be about how low energy matrix multiplication could work for i8 and FP6 operands within a microscaled data format.
So my next project will go up in the middle of June. I might have some other posts in the meantime, perhaps a bit more about future DRAM.