Memory Wall Gets Higher

With SRAM failing to scale in recent process nodes, the industry must assess its impact on all forms of computing. There are no easy solutions on the horizon.

popularity

Key Takeaways

  • An increasing percentage of the chip area is consumed by the same amount of SRAM for each node shrink.
  • The problem is not limited to leading-edge AI, as it will eventually impact even small MCUs and MPUs.
  • Architectural changes may be required. Stacking SRAM chiplets on logic is possible but expensive.

SRAM is a vital piece of all computing systems, but its failure to keep pace with logic scaling has created increasingly difficult problems, which over the past five years have become much worse.

Back in 1990, Hennessy and Patterson published the book “Computer Architecture: A Quantitative Approach.” It was already clear to the authors that memory was going to be a key impediment to the future of processing in terms of capacity and performance (see figure 1). For decades, hardware architecture danced around this problem, often using the SRAM as a cache, backed by a larger off-chip DRAM. While it made memory appear larger, it was often much slower. This became known as the memory wall.

Fig. 1: Early identification of memory wall. Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach.

In all forms of computing, the program and data are held in SRAM. The processor reads instructions from this memory. Those instructions tell the processor what operations to perform on data, which are also stored in that memory.

SRAM is cheaper than registers that hold the data temporarily within the processor. While the register cell can use the same number of transistors as the SRAM, the register uses a more expensive decode and access mechanism that does not scale with register bank size.

An SRAM memory is composed of an array of memory cells surrounded by circuitry that enables the data to be fetched and stored in a random manner. In many cases, the surrounding logic is semi-custom because it changes as the memory array gets larger. In fact, many of the improvements in memory speed have come from improvements in this circuitry instead of in the memory array itself.

The future looks increasingly bleak as improvements in SRAM size and performance continue to be close to non-existent. That means that an increasing percentage of the chip area is consumed by the same amount of SRAM for each node shrink. As more chips reach the reticle limit, they cannot afford this, and have to fall back on external memory more than in the past. That memory is orders of magnitude slower.

In an era of AI, where access patterns are different, it has also quickly become the major limiter.

TSMC agrees there are problems associated with SRAM scaling, but the foundry claims to have made improvements with its new 2nm nanosheet technology (see figure 2). However, it is difficult to get hard data to support this. In the past, actual results have been less than those published before wide-scale utilization.

Fig. 2: TSMC SRAM cell size from public sources. Source: Semiconductor Engineering

While this can be seen as a memory problem, it’s ultimately a compute problem. “Performance is not limited by compute,” says Ramin Farjadrad, CEO and co-founder of Eliyan. “In many cases, we see 20% utilization of the processor for most functions, if not less. It is mainly limited by the memory and memory bandwidth. That is called the memory wall.”

SRAM scaling
It is logical to think that when the transistor size shrinks, so would the size and performance of an SRAM cell that is made up of six transistors. “SRAM scaling stalled because classic 6T bitcells hit physical and variability limits,” says Daryl Seitzer, principal product manager for embedded memory IP at Synopsys. “The SRAM bitcell was invented to be dense, but it has an inherent flaw of conflicting read and write requirements. The access transistor fights with the storage transistor, and this battle between the two needs to be carefully balanced and account for process variations. As the geometries become smaller, variations become big percentages of the bitcell read and write characteristics.”

The problems do not end there. “As nodes shrank, electrostatic control and random variation became dominant constraints, preventing proportional reductions in cell area,” says Andre Bonnardot, senior manager of product management at Arteris. “Moreover, SRAM speed has plateaued because the wire resistance and bit line capacitance have increased, while Vdd has hardly decreased with recent process nodes. Logic could continue scaling through device and routing innovations, but SRAM could not.”

These problems are getting worse with recent nodes. “SRAM bitcell scaling on leading-edge 2nm and below technologies has fallen to less than 15% density improvement,” says Gopi Ranganathan, fellow in the Silicon Solutions Group at Cadence. “This is far below the dramatic 50% to 100% node-to-node shrink we experienced during process technology generations from 65nm to 5nm. This decline can be attributed to the extremely narrow dimensions of the devices, gate contacts, MEOL, and V0/V1 being printed on leading-edge technology nodes, where further meaningful scaling is limited by tools and the ability to deliver good silicon yields.”

The impact is higher cost and less performance. “Primarily, it has manifested as memory falling behind in density scaling,” says Nigel Drego, CTO for Quadric. “Gate/mm2 has been advancing faster than MB/mm2. In addition, access speeds have suffered because wire delays and the laws of physics are unsympathetic to the needs of SoC designers. Smart architecture adaptations, however, can mitigate the dependencies between logic and SRAM speeds.”

Given that the divide has been growing ever since the 1980s, how does computing today compare to that of two decades ago? “The performance of the computer or processor has gone up by almost five orders of magnitude,” says Eliyan’s Farjadrad. “But these guys need to do processing on the data that comes from a memory. The bandwidth from the memory has not even gone up by 100X, so there’s a larger than 1,000X delta between the amount of data that these guys are processing, or can process, versus the data coming in.”

This is not just a problem for leading-edge AI technologies. Eventually it will impact everything – even down to small MCUs and MPUs — especially as AI moves to the edge. “At some point it becomes non-scalable, and then SRAM starts to occupy a larger percentage of the total die size,” says Kavita Char, principal product marketing manager at Renesas. “That is something we have to take into consideration. It also impacts the users of the chips because they have to consider what can be done on-chip, and at what point they have to move to external memory. It does start to get more and more expensive as you go down to finer geometries.”

It is not clear if bitcell area is better in N2 compared to the previous generation. “Recent SRAM gains come from utilizing the logic shrink and applying it to the decode and control circuitry of the SRAM macros,” says Rahul Thukral, senior director product manager for embedded memory IP at Synopsys. “This requires design innovations, and we have been able to show such area advantages despite non-scaling of bitcells. Future gains are expected as the gate‑all‑around (GAA) technology improves, and more flexibility in device width controls is expected. There are additional improvements expected with GAA transistors providing better electrostatic control, which reduces leakage and improves read and write behavior. For initial 2nm processes, memory area is improving, with much of the gains coming from logic devices in the decode and datapath circuits. However, further bitcell area scaling with GAA transistors is on the horizon, and more bitcell area reduction is expected in follow-on nodes.”

“We view the slowdown in SRAM scaling as being at a system architecture inflection point,” says Arteris’ Bonnardot. “When memory density growth slows, simply adding more cache becomes economically inefficient.”

Implications for software
The implications for software are wide-ranging, challenging the long-standing notion that software productivity is the most important thing to be optimized. This is being questioned in many areas these days, especially as more products become software-defined. “Processor architectures that rely upon massive local SRAMs with layers of fast caches will suffer the most,” says Quadric’s Drego. “CPUs cannot avoid those hardware-heavy memory architectures because the CPUs found in our phones, laptops, and data centers are designed to run random user code with unstructured memory references and juggle dozens of threads at the same time.”

For companies like these, there are few choices. “SRAM now consumes a larger fraction of die area and cost,” says Bonnardot. “Large register files and cache hierarchies no longer scale for free, increasing pressure on die size and yield, power efficiency, and data movement efficiency. This shifts the bottleneck from compute density to memory architecture and interconnect efficiency. Software must assume memory is more hierarchical and less uniformly fast. Locality, tiling, partitioning, and traffic predictability matter more, while latency variance becomes a system-level performance limiter.”

AI cannot escape the problems. “As AI model sizes and context lengths grow, memory bandwidth and on‑chip caching dominate performance,” says Synopsys’ Seitzer. “This shows up clearly in LLM inference, where KV‑cache bandwidth becomes the bottleneck. As a result, software must optimize data locality, memory‑aware scheduling, quantization, sparsity, and memory tiering, because compute improvements no longer compensate for slow memory scaling.”

Some differences in AI architectures can be exploited. “AI engines, particularly AI inference processors, have the luxury of working on well-structured code for periods of time that are orders of magnitude longer than a task-switching CPU,” says Drego. “Smart AI architectures push memory management into offline compilers that can schedule explicit code-driven DMA transfers of AI model weights and activations. Entire AI inference processing engines can be built that don’t need data caches of any sort. That relieves the pressure to design with the highest-speed, highest-power-burning SRAMs for layered caches and cache tags and translation buffers. And as more workloads rely upon AI models, a larger and larger percentage of the chip area of advanced SoCs can avoid SRAM density/speed bottlenecks, isolating that design challenge only in the critical CPU sub-blocks.”

Not everyone in the industry may be listening yet. “With AI models, there’s something called arithmetic intensity,” says Eliyan’s Farjadrad. “This means the number of functions, or operations, that the processor runs on the memory. Unfortunately, the arithmetic intensity of recent AI models is much less than it used to be. As a result, there is more demand for bandwidth from the memory into the processor.”

3D SRAM
If SRAM does not scale, there is little point using the most expensive node for it. There is growing pressure to treat SRAM as something that should be pushed onto a chiplet and mounted above processors. “SoC designers are exploring disaggregation, where a small amount of SRAM is placed on the die being designed on the leading-edge technology node,” says Cadence’s Ranganathan. “The most critical need is for CPU/GPU/AI workloads, such as Level 1, Level 2, or even Level 3. In this scenario, the larger SRAM capacity, say Level 4, is placed on a previous technology node die, which offers better cost per transistor. The advent of faster die-to-die communication links and finer interconnect pitches makes the integration of multi-memory hierarchies easier, lowering cost with a reasonable latency impact.”

That is currently an expensive solution. “3D and chiplet‑based SRAM are currently economical for higher‑end AI/HPC chips due to high packaging cost, thermal complexity, and limited standardization,” says Seitzer. “SRAM-heavy chiplets today remain concentrated in premium devices with custom solutions squeezing SRAM in with other high value IP. A near‑term path to low‑cost, mass‑market SRAM chiplets looks unlikely.”

But that day may come. “Chiplets have been a great solution to provide maybe orders of magnitude higher bandwidth at orders of magnitude lower power,” says Farjadrad. “Everybody needs to make it work, and that’s why there’s so much focus on the industry on solving these challenges. This is how the performance wall can be removed, not just 2.5D, but also 3D.”

Alternative approaches
Whenever problems with memory arise, there is inevitably a discussion about new memory technologies that could replace SRAM. “Emerging options help in specific role, but are not universal SRAM replacements,” says Bonnardot. “Most future systems will use more tiers of memory, not fewer.”

Those future systems also may be architected differently. “The notion of in-memory compute, or near-memory compute, which is what AI is driving toward, means that the traditional models will change a bit,” said Nandan Nayampally, chief commercial officer at Baya Systems. “The traditional models are built around giant compute engines that are trying to pull data from memory that is somewhat close by. So there is a consistent evolution that will start using different memory, because finally we can say SRAM has failed to scale. That’s one way to look at it. The other way to look at it is, have you run out of the architectural limits of how we use SRAM today? And I think that’s more the case. Cerebras took a big step in doing wafer-scale, which is to concentrate more memory on-die, to change some of those constraints.”

Even with these advancements, there are limits to the size of the model that can fit on a single die. “This leads to the main question, ‘What can actually be accomplished efficiently on one wafer?’ And if you start stacking wafers or producing larger ones, does the architecture continue to scale properly, or do you eventually encounter the same limitations? The so-called memory wall isn’t just a one-time obstacle,” Nayampally said. “If the architecture remains unchanged, every increase in model size simply introduces a new barrier. Therefore, design decisions must focus on how well the system can scale, from a single die to multiple dies and beyond. Initially, we saw CPU clusters. Later, chiplet clusters. Then, escalated to board-level clusters. Now, scaling up means trying to make an entire rack function as a unified computing resource, and pushing further than that. At every stage, whether at nanometer, millimeter, centimeter, meter, or even kilometer scales, new challenges will arise. Ultimately, how you divide and manage resources determines your ability to overcome these recurring barriers.”

New memories also are gaining a foothold. “Some emerging embedded memories do show real traction, especially where SRAM or embedded flash struggle,” says Seitzer. “For example, MRAM scales well, has low leakage, high endurance, and is expected to replace parts of embedded FLASH/SRAM in SoCs. ReRAM is gaining adoption due to easy integration and lower cost structure for embedded non‑volatile storage. These technologies augment, but do not replace, high‑performance SRAM in L1/L2 caches, but they are candidates to replace embedded memories in some controllers, MCUs, and accelerators.”

There has been a lot of focus on high bandwidth memory (HBM), which has significantly increased the bandwidth of DRAM. HBM is a stack of DRAMs, where the bottom layer historically has been a die-to-die PHY to the processor. That base die was limited by power density and thermal density because they used the same process technology as the bitcell layers – one optimized for memory cells and not logic. If you switch that base die to a logic optimized process, you can support many more potential functions and deliver higher performance.

“By doing that, we can implement a much higher bandwidth die-to-die interface between the HBM base die and the GPU,” says Farjadrad. “We can use the excess bandwidth on the other side of the base die to connect additional things. These additional things could be another row of HBM, potentially doubling the amount of HBM that a GPU can access. Or you could use this for I/O chiplets to provide higher bandwidth externally, or a combination of the two.”

There is also more room for managing cache. “In an era where SRAM scaling is no longer automatic, architectural efficiency, especially at the fabric and coherency level, becomes the primary lever for performance per mm² and performance per watt,” says Bonnardot. “By intelligently managing cache placement and traffic behavior, the cache can deliver adequate memory capacity and bandwidth gains without proportional increases in SRAM area.”

Conclusion
The memory wall is growing taller, and there are few signs that this will change in the near future. SRAM scaling is unlikely to regain the momentum it once had, meaning that alternatives must be found. 3D stacking will probably become more prevalent, especially if price drops. But there are no silver bullets. If fast memory becomes the limiter to compute, then compute must start to use the available memory more efficiently.



1 comments

Anonymous says:

3D SRAM is obviously the solution here, and a simplistic form of it is already being used in consumer AMD X3D CPUs and Epyc CPUs. AMD/TSMC has refined the design with Ryzen 9000 series to mitigate heat issues. It’s plausible that they could stack SRAM multiple layers high, and a variant of Zen 7 Epyc may disaggregate all SRAM from logic.

Also, has Samsung done anything with their “X-Cube” 3D SRAM technology since announcing it in August 2020?

Leave a Reply


(Note: This name will be displayed publicly)