The consensus
Memory is cyclical, the stocks have run, and the ending is obvious.
Living thesis / June 2026
The lazy bear case is that memory always mean-reverts. The bull case is that AI is turning memory from a commodity input into a bottleneck attached to every useful unit of frontier compute.
Memory is cyclical, the stocks have run, and the ending is obvious.
AI demand is not one demand shock. It is tokens, model size, context, turns, and memory attach.
Compute can scale and still be starved if enough state cannot sit close to the accelerator.
Routing and optimization are real. The question is whether they outrun demand.
NVIDIA H100/Rubin specs, Goldman token forecast, TrendForce HBM capacity comments.
Citi-linked Rubin NAND attach, Michael Dell 625x framing, Jeremy Werner context-growth framing.
2030 EB demand, KV share, supply gap, China modern-equivalent adjustment, scenario outputs.
The case in five numbers
Tokens create traffic. Larger models create resident weight demand. Longer context creates KV-cache demand. More agent turns create more inference per useful output. Supply has to answer all of that with HBM that is fast enough, qualified enough, and close enough to the accelerator to matter.
"Memory is cyclical, everyone knows that, and the recent run up in memory names is an obvious bubble."
That is the easy view. It might even be right for a trade. Memory stocks have always been dangerous when they look cheap. Supply catches up, pricing rolls over, margins collapse, and the market remembers why it never wanted to pay a growth multiple for commodity bits.
But I think that reflex may be blinding people to the simple scale of what AI is doing to memory demand.
The old cycle was driven by PCs, phones, servers, inventory, and capital discipline. The AI cycle is different because the demand unit is not just "more devices." It is more accelerators, more memory per accelerator, more memory bandwidth per accelerator, larger resident models, longer context, and more inference turns per completed task.
The first clue that there might be more to the story came when reporting around NVIDIA's next-generation Rubin platform started to show just how much memory the system would need. Citi-linked industry reporting said Rubin could require roughly 16 TB of NAND per GPU, or 1,152 TB per NVL72 rack. NVIDIA's own Rubin materials point in the same direction on HBM: the next generation is built around HBM4, up to 288 GB of HBM per GPU, and up to 22 TB/s of HBM bandwidth.
Accelerator memory step-up
Roadmap pressure
The thesis starts with observable attach rates before it gets to model assumptions. NVIDIA discloses H100 at 80 GB HBM and Rubin at up to 288 GB HBM4. TrendForce expects Rubin Ultra to move to 384 GB per GPU in 2027.
That is where the memory story starts to look different. It is not just "AI needs more GPUs." It is "known GPU roadmaps are pulling a much larger and much faster memory stack behind them."
This is the old memory wall problem with a new forcing function. Compute can keep improving, but if models need more state to be close to the accelerator, memory manufacturers do not just need more bits. They need faster bits, stacked tighter, packaged better, qualified into frontier systems, and delivered into a supply chain that is already constrained.
The first AI infrastructure debate was mostly about training clusters and raw compute. The next debate is about inference systems that serve persistent, agentic, long-context workloads. Those systems are not just doing math. They are holding state.
That state shows up in two buckets. The first is resident model weights: the parameters and experts that need to be loaded across serving replicas. The second is KV cache: the memory used to keep previous tokens available so the model can continue generating without recomputing the whole conversation or task history.
Agents make this much more extreme. A normal chat can be short. A useful agent may pull a codebase, Slack history, contract set, research corpus, or spreadsheet model into context, then iterate over the task many times. Longer context means more memory per active workload. More turns means more workload per finished output.
Goldman Sachs frames token consumption as rising 24x to 120 quadrillion tokens per month by 2030. Micron SVP Jeremy Werner has reportedly framed current context growth as roughly 30x per year. I do not assume that context keeps compounding anywhere near that rate. I do think it tells you which way the pressure points.
Memory mechanics
At the request level, KV cache grows roughly with active tokens, active sessions, model depth and width, and bytes per stored value. For a fixed model, longer context is close to linear. Across model generations, bigger models can also raise the KV bytes required per token, so context and model size are not perfectly independent in the real world.
The simple version is that data center memory demand is driven by four things: token demand, model size, context, and optimization. But those variables are not all multiplied together in one naive index.
Model size primarily pressures the resident-weight bucket. Context primarily pressures the KV-cache bucket. Token growth increases the number of workloads served. Agent turns increase the number of inference passes per useful output. Optimization divides against all of it through routing, quantization, compression, paging, offload, reuse, better kernels, and better scheduling.
This is less explosive than multiplying token growth by model growth by context growth. It is also more honest. But the additive model can still produce a very large number because the KV bucket gets big quickly once token demand, active sessions, and average useful context all rise at the same time.
Base-case bridge
This is the current base case, not a proof. The point is to show which intermediate assumptions do the work.
The bridge shows why the base case is mostly a state thesis. Resident weights matter, but KV cache becomes roughly 70% of modeled 2030 demand after gross-up.
Methodology
The important choice is that model size and context are not blindly multiplied together. They pressure different memory buckets.
Start with tokens, then haircut by the share of traffic served on HBM-heavy systems.
(tokens / 5Q) x HBM-served share
Scale the 2026 resident-weight bucket by serving replicas, model footprint, and weight efficiency.
1.75 EB x replica index x model scale / weight efficiency
Scale the 2026 KV bucket by active session pressure and the context bucket.
1.27 EB x session index x context bucket
Add scratch memory, redundancy, and fixed utilization so the chart ties to deployable HBM.
(weights + KV + scratch) x redundancy / 74%
Compare demand against modern-equivalent HBM supply, including a separate China adjustment.
RoW supply + China gross x modern-equivalent share
Formula audit
The most bullish napkin math multiplies tokens by model size by context. I do not think that is the best core formula. The cleaner approach is to separate memory into buckets, then let each bucket respond to the assumption that actually drives it.
Model size mostly drives resident weights. Context mostly drives KV cache. Treating them as separate buckets avoids a false triple-count.
Routing, quantization, KV paging, reuse, compression, and offload are not ignored. They are visible dials.
Bigger models often have more layers and wider activations, which can increase memory per context token. The calculator only partly captures that through buckets.
The model approximates active sessions. Real deployments depend on batching, latency targets, prompt reuse, offload, and how sticky agent sessions become.
In the current base case, token demand rises from 5 quadrillion tokens per month in 2026 to 120 quadrillion by 2030. The model assumes routing and efficiency matter, so resident weights only rise to 5.2 EB of grossed HBM demand by 2030. The problem is that KV cache rises to 20.8 EB. The base case is not a model-size story. It is mostly a state story.
Calculator base case
Base demand reaches 29.9 EB against 18.5 EB of effective supply, leaving an 11.4 EB deficit. The useful question is not whether that exact number is right. It is which assumptions you have to change to make the deficit disappear.
The thesis should live or die on visible assumptions, not vibes. Here is the current base case in plain English:
None of that is meant to be sacred. If you think routing takes most token growth away from frontier systems, use the routing-bear case. If you think KV cache optimization is spectacular, use the context-disciplined case. If you think context becomes the core workload of the AI economy, use the memory-bull or stress cases. The calculator is there because the debate is the sensitivity.
The calculator produces very different outcomes depending on what you believe about routing, context, and supply. That is not a bug. It is the thesis in miniature.
Calculator sensitivity
Green is surplus. Amber is tight. Red is deficit. Each cell uses the same calculator v2.2 bucket formula.
Positive gap means modeled supply exceeds demand. Negative gap means the calculator is short that many exabytes in 2030. The 30x stress row is included as an upper-tail test, not the base forecast.
The bear case is not dumb. In the routing-bear case, memory demand collapses relative to supply because most traffic never touches the most HBM-heavy systems and efficiency compounds. The bull case is that the world does not stay that clean. Agents, coding, research, and enterprise workflow are exactly the use cases that want state, context, and iteration.
The supply side is not just "more HBM exists." Usable HBM for frontier AI depends on capacity, yields, packaging, qualification, customer allocation, and whether the memory is close enough in performance to substitute for modern HBM in real clusters.
TrendForce says AI infrastructure should sustain strong HBM demand through 2026 and 2027, with HBM capacity per AI chip moving from 96/192 GB to 216/288 GB, Rubin Ultra expected at 384 GB per GPU, and HBM wafer input among the top three suppliers rising toward roughly 30% of total DRAM wafer input by the end of 2027. That is not a normal memory cycle. That is a capacity reallocation problem.
This is why China matters, but not in a simple way. China can add gross memory supply. The relevant question for frontier AI is how much of that supply is modern-equivalent HBM that can actually relieve NVIDIA, ASIC, and hyperscaler clusters. The calculator splits rest-of-world modern HBM from China gross HBM for that reason.
How much DRAM capacity shifts toward HBM without starving conventional markets?
How much of that output reaches the right stack height, speed, and quality?
Can the memory be integrated into accelerators and AI ASIC packages at scale?
Does it qualify for the systems that actually set frontier AI capacity?
The honest bear case is not "memory is cyclical." That is a label, not an argument. The real bear case is that AI demand routes around HBM faster than HBM demand compounds.
Most token growth moves to small models, ASICs, or memory-light systems.
Watch: HBM-served token shareCompression, reuse, offload, and paging outrun context demand.
Watch: KV efficiency and context bucketAverage resident context stays far below advertised max windows.
Watch: average useful contextHBM4 qualification, packaging, yields, and capacity ramp faster than expected.
Watch: RoW supply and China equivalent shareThose are real risks. They are also measurable. If the base case is wrong, it should show up in the dials: lower HBM-served token share, lower context bucket, higher KV efficiency, faster supply, or more modern-equivalent China supply.
Everyone knows memory stocks are cyclical, and they always look cheap right before the bubble bursts. But what if the structurally important product is no longer generic memory? What if the relevant product is frontier HBM, qualified into AI systems, attached to accelerators, and constrained by packaging, yields, bandwidth, and customer allocation?
We have already seen one traditionally cyclical semiconductor company rerate into a growth story because AI turned its product into the bottleneck. It is now the most valuable company in the world.
The point is not that memory becomes NVIDIA. It is that the old commodity prior may be too lazy. If AI compute keeps scaling, and if useful AI work becomes more stateful, then memory may become less cyclical, more strategic, and more margin-rich than investors are used to believing.
The calculator is the living version of this thesis. Change token growth, context pressure, model weights, supply, and China modern-equivalent assumptions, then watch which bucket dominates. The point is not one exact forecast. The point is seeing which assumptions carry the argument.
Memory Analyst is an independent research site for discussion and education. Nothing on this site is investment, legal, tax, accounting, or procurement advice, and nothing should be read as a recommendation to buy or sell any security, private investment, memory product, GPU, contract, or related asset.
The essay, calculator, charts, and model outputs are scenario analysis built from public information, estimates, simplifications, and user-selected assumptions. They may be wrong, stale, incomplete, internally inconsistent, or inappropriate for any specific decision. HBM supply, AI demand, model architecture, export controls, pricing, yields, packaging capacity, and serving efficiency can change quickly.
Do your own work, consult appropriate advisers, and check primary sources before making financial, technical, operational, or strategic decisions. Any companies, products, or securities mentioned are included for research context only. Memory Analyst may revise, replace, or remove assumptions without notice.