Living thesis / June 2026

Memory is cyclical. AI may be big enough to break the reflex.

The lazy bear case is that memory always mean-reverts. The bull case is that AI is turning memory from a commodity input into a bottleneck attached to every useful unit of frontier compute.

01

The consensus

Memory is cyclical, the stocks have run, and the ending is obvious.

02

The inversion

AI demand is not one demand shock. It is tokens, model size, context, turns, and memory attach.

03

The bottleneck

Compute can scale and still be starved if enough state cannot sit close to the accelerator.

04

The debate

Routing and optimization are real. The question is whether they outrun demand.

Official

NVIDIA H100/Rubin specs, Goldman token forecast, TrendForce HBM capacity comments.

Reported

Citi-linked Rubin NAND attach, Michael Dell 625x framing, Jeremy Werner context-growth framing.

Modeled

2030 EB demand, KV share, supply gap, China modern-equivalent adjustment, scenario outputs.

The case in five numbers

The thesis is not that every memory cycle is gone. It is that this cycle may be pulled by several AI curves at once.

Tokens create traffic. Larger models create resident weight demand. Longer context creates KV-cache demand. More agent turns create more inference per useful output. Supply has to answer all of that with HBM that is fast enough, qualified enough, and close enough to the accelerator to matter.

Token frame24xGoldman-style 2030 token growth
Rubin HBM22 TB/sNVIDIA stated bandwidth per GPU
Base demand29.9 EBcalculator 2030 HBM requirement
Base supply18.5 EBmodern-equivalent 2030 supply
KV share~70%of modeled 2030 demand

What has to be true

  • AI usage keeps shifting from short chats toward agents, coding, research, and enterprise workflows.
  • Average useful context rises enough that KV cache becomes a normal serving constraint.
  • Frontier systems still need meaningful resident model footprint despite MoE, distillation, and routing.
  • HBM4, advanced packaging, and customer qualification remain hard enough to keep modern-equivalent supply scarce.

What would change my mind

  • Most incremental tokens route to memory-light models without hurting product quality.
  • KV cache reuse, compression, paging, and offload improve faster than context demand.
  • Advertised context windows do not turn into large resident context in real workloads.
  • HBM supply expands quickly without margin discipline, yield friction, or packaging bottlenecks.

Start with the reflex

"Memory is cyclical, everyone knows that, and the recent run up in memory names is an obvious bubble."

That is the easy view. It might even be right for a trade. Memory stocks have always been dangerous when they look cheap. Supply catches up, pricing rolls over, margins collapse, and the market remembers why it never wanted to pay a growth multiple for commodity bits.

But I think that reflex may be blinding people to the simple scale of what AI is doing to memory demand.

The old cycle was driven by PCs, phones, servers, inventory, and capital discipline. The AI cycle is different because the demand unit is not just "more devices." It is more accelerators, more memory per accelerator, more memory bandwidth per accelerator, larger resident models, longer context, and more inference turns per completed task.

The question is not whether memory is cyclical. It is whether AI is shocking the demand curve hard enough that the cycle stays tighter, higher margin, and more strategically important than investors expect.

The first clue was system memory attach

The first clue that there might be more to the story came when reporting around NVIDIA's next-generation Rubin platform started to show just how much memory the system would need. Citi-linked industry reporting said Rubin could require roughly 16 TB of NAND per GPU, or 1,152 TB per NVL72 rack. NVIDIA's own Rubin materials point in the same direction on HBM: the next generation is built around HBM4, up to 288 GB of HBM per GPU, and up to 22 TB/s of HBM bandwidth.

Accelerator memory step-up

H100 was already memory rich. Rubin is designed around much more bandwidth.

NVIDIA H100 SXM5 80 GB HBM3 Over 3 TB/s bandwidth
NVIDIA Rubin GPU Up to 288 GB HBM4 Up to 22 TB/s bandwidth

Roadmap pressure

The physical product is already moving toward much larger memory stacks.

The thesis starts with observable attach rates before it gets to model assumptions. NVIDIA discloses H100 at 80 GB HBM and Rubin at up to 288 GB HBM4. TrendForce expects Rubin Ultra to move to 384 GB per GPU in 2027.

H100 SXM5
80 GB over 3 TB/s
Rubin GPU
288 GB up to 22 TB/s
Rubin Ultra
384 GB TrendForce 2027 expectation
2025 18% top-three DRAM wafer input to HBM
2026 22% TrendForce estimate
2027 30% capacity crowd-out starts to bite

That is where the memory story starts to look different. It is not just "AI needs more GPUs." It is "known GPU roadmaps are pulling a much larger and much faster memory stack behind them."

This is the old memory wall problem with a new forcing function. Compute can keep improving, but if models need more state to be close to the accelerator, memory manufacturers do not just need more bits. They need faster bits, stacked tighter, packaged better, qualified into frontier systems, and delivered into a supply chain that is already constrained.

Inference is where the math gets non-linear

The first AI infrastructure debate was mostly about training clusters and raw compute. The next debate is about inference systems that serve persistent, agentic, long-context workloads. Those systems are not just doing math. They are holding state.

That state shows up in two buckets. The first is resident model weights: the parameters and experts that need to be loaded across serving replicas. The second is KV cache: the memory used to keep previous tokens available so the model can continue generating without recomputing the whole conversation or task history.

Agents make this much more extreme. A normal chat can be short. A useful agent may pull a codebase, Slack history, contract set, research corpus, or spreadsheet model into context, then iterate over the task many times. Longer context means more memory per active workload. More turns means more workload per finished output.

Goldman Sachs frames token consumption as rising 24x to 120 quadrillion tokens per month by 2030. Micron SVP Jeremy Werner has reportedly framed current context growth as roughly 30x per year. I do not assume that context keeps compounding anywhere near that rate. I do think it tells you which way the pressure points.

Memory mechanics

Context is not magic. It is state that has to sit somewhere.

At the request level, KV cache grows roughly with active tokens, active sessions, model depth and width, and bytes per stored value. For a fixed model, longer context is close to linear. Across model generations, bigger models can also raise the KV bytes required per token, so context and model size are not perfectly independent in the real world.

Simple mental model KV memory ~= active context x active sessions x memory per token Optimization reduces the multiplier, but the workload still wants state.

Show the work

The simple version is that data center memory demand is driven by four things: token demand, model size, context, and optimization. But those variables are not all multiplied together in one naive index.

Model size primarily pressures the resident-weight bucket. Context primarily pressures the KV-cache bucket. Token growth increases the number of workloads served. Agent turns increase the number of inference passes per useful output. Optimization divides against all of it through routing, quantization, compression, paging, offload, reuse, better kernels, and better scheduling.

Traffic tokens x HBM-served share
Weights resident model footprint
Context KV cache and sessions
Gross-up scratch, redundancy, utilization
Output required HBM
Required HBM = (resident weights + KV cache + scratch) x redundancy / usable utilization

This is less explosive than multiplying token growth by model growth by context growth. It is also more honest. But the additive model can still produce a very large number because the KV bucket gets big quickly once token demand, active sessions, and average useful context all rise at the same time.

Base-case bridge

How the calculator gets from token growth to 29.9 EB of 2030 HBM demand.

This is the current base case, not a proof. The point is to show which intermediate assumptions do the work.

Traffic 24.0x 120Q tokens / 5Q base
Serving replicas 2.18x traffic elasticity after throughput gains
Session pressure 5.74x traffic translated into active state
Context bucket 1.92x average resident context pressure
Raw KV cache 14.0 EB 1.27 EB x 5.74 x 1.92
Gross-up 1.49x 1.10 redundancy / 74% utilization
Raw resident buckets 3.5 EB weights + 14.0 EB KV + 2.6 EB scratch
20.1 EB
Deployable HBM demand after redundancy and fixed utilization
29.9 EB

The bridge shows why the base case is mostly a state thesis. Resident weights matter, but KV cache becomes roughly 70% of modeled 2030 demand after gross-up.

Methodology

How the calculator turns assumptions into HBM demand

The important choice is that model size and context are not blindly multiplied together. They pressure different memory buckets.

1 Traffic index

Start with tokens, then haircut by the share of traffic served on HBM-heavy systems.

(tokens / 5Q) x HBM-served share
2 Resident weights

Scale the 2026 resident-weight bucket by serving replicas, model footprint, and weight efficiency.

1.75 EB x replica index x model scale / weight efficiency
3 KV cache

Scale the 2026 KV bucket by active session pressure and the context bucket.

1.27 EB x session index x context bucket
4 Gross-up

Add scratch memory, redundancy, and fixed utilization so the chart ties to deployable HBM.

(weights + KV + scratch) x redundancy / 74%
5 Supply

Compare demand against modern-equivalent HBM supply, including a separate China adjustment.

RoW supply + China gross x modern-equivalent share

Formula audit

The model is intentionally less explosive than the napkin math.

The most bullish napkin math multiplies tokens by model size by context. I do not think that is the best core formula. The cleaner approach is to separate memory into buckets, then let each bucket respond to the assumption that actually drives it.

Good Weights and KV are separate

Model size mostly drives resident weights. Context mostly drives KV cache. Treating them as separate buckets avoids a false triple-count.

Good Optimization is explicit

Routing, quantization, KV paging, reuse, compression, and offload are not ignored. They are visible dials.

Weakness KV bytes per token can rise

Bigger models often have more layers and wider activations, which can increase memory per context token. The calculator only partly captures that through buckets.

Weakness Concurrency is the hardest unknown

The model approximates active sessions. Real deployments depend on batching, latency targets, prompt reuse, offload, and how sticky agent sessions become.

What the calculator says

In the current base case, token demand rises from 5 quadrillion tokens per month in 2026 to 120 quadrillion by 2030. The model assumes routing and efficiency matter, so resident weights only rise to 5.2 EB of grossed HBM demand by 2030. The problem is that KV cache rises to 20.8 EB. The base case is not a model-size story. It is mostly a state story.

Calculator base case

By 2030, KV cache is almost 70% of modeled demand.

Base demand reaches 29.9 EB against 18.5 EB of effective supply, leaving an 11.4 EB deficit. The useful question is not whether that exact number is right. It is which assumptions you have to change to make the deficit disappear.

2026
4.8 EB
2027
12.8 EB
2028
18.2 EB
2029
23.8 EB
2030
29.9 EB
WeightsKV cacheScratch

The assumptions are the product

The thesis should live or die on visible assumptions, not vibes. Here is the current base case in plain English:

Token demand5Q to 120Q/monthGoldman-style 24x frame
Model footprint1.6xlarger resident models, partly offset by efficiency
Weight efficiency1.75xquantization, MoE, serving improvements
Context bucket1.92xactive context grows, but far below 30x stress
KV efficiency2.5xpaging, reuse, compression, offload
Supply18.5 EB18.0 EB RoW plus 0.5 EB China modern-equivalent

None of that is meant to be sacred. If you think routing takes most token growth away from frontier systems, use the routing-bear case. If you think KV cache optimization is spectacular, use the context-disciplined case. If you think context becomes the core workload of the AI economy, use the memory-bull or stress cases. The calculator is there because the debate is the sensitivity.

The range is the point

The calculator produces very different outcomes depending on what you believe about routing, context, and supply. That is not a bug. It is the thesis in miniature.

Routing bear + fast supplyOptimization wins, supply ramps hard
0.13x supply
Context disciplinedStrong token growth, but KV is aggressively managed
0.38x supply
Base case20x-plus token frame, moderate context growth, real efficiency
1.62x supply
Memory bullContext becomes a normal enterprise and coding workload
3.52x supply
30x stressNot a base case. A reminder of what the upper tail looks like.
15.76x supply

Calculator sensitivity

Demand case x supply case, shown as 2030 demand divided by usable HBM supply.

Green is surplus. Amber is tight. Red is deficit. Each cell uses the same calculator v2.2 bucket formula.

Demand preset
Tight supply13.4 EB
Base supply18.5 EB
Fast supply27.0 EB
Routing bear3.6 EB demand
0.27x+9.8 EB
0.19x+14.9 EB
0.13x+23.4 EB
Context disciplined7.0 EB demand
0.53x+6.4 EB
0.38x+11.5 EB
0.26x+20.0 EB
Base case29.9 EB demand
2.23x-16.5 EB
1.62x-11.4 EB
1.11x-2.9 EB
Memory bull65.1 EB demand
4.86x-51.7 EB
3.52x-46.6 EB
2.41x-38.1 EB
30x stress291.5 EB demand
21.75x-278.1 EB
15.76x-273.0 EB
10.80x-264.5 EB

Positive gap means modeled supply exceeds demand. Negative gap means the calculator is short that many exabytes in 2030. The 30x stress row is included as an upper-tail test, not the base forecast.

The bear case is not dumb. In the routing-bear case, memory demand collapses relative to supply because most traffic never touches the most HBM-heavy systems and efficiency compounds. The bull case is that the world does not stay that clean. Agents, coding, research, and enterprise workflow are exactly the use cases that want state, context, and iteration.

The supply side is not generic DRAM

The supply side is not just "more HBM exists." Usable HBM for frontier AI depends on capacity, yields, packaging, qualification, customer allocation, and whether the memory is close enough in performance to substitute for modern HBM in real clusters.

TrendForce says AI infrastructure should sustain strong HBM demand through 2026 and 2027, with HBM capacity per AI chip moving from 96/192 GB to 216/288 GB, Rubin Ultra expected at 384 GB per GPU, and HBM wafer input among the top three suppliers rising toward roughly 30% of total DRAM wafer input by the end of 2027. That is not a normal memory cycle. That is a capacity reallocation problem.

This is why China matters, but not in a simple way. China can add gross memory supply. The relevant question for frontier AI is how much of that supply is modern-equivalent HBM that can actually relieve NVIDIA, ASIC, and hyperscaler clusters. The calculator splits rest-of-world modern HBM from China gross HBM for that reason.

1 Wafer allocation

How much DRAM capacity shifts toward HBM without starving conventional markets?

2 Stack and yield

How much of that output reaches the right stack height, speed, and quality?

3 Packaging

Can the memory be integrated into accelerators and AI ASIC packages at scale?

4 Qualification

Does it qualify for the systems that actually set frontier AI capacity?

What would make this wrong?

The honest bear case is not "memory is cyclical." That is a label, not an argument. The real bear case is that AI demand routes around HBM faster than HBM demand compounds.

Routing wins

Most token growth moves to small models, ASICs, or memory-light systems.

Watch: HBM-served token share

KV gets solved

Compression, reuse, offload, and paging outrun context demand.

Watch: KV efficiency and context bucket

Context disappoints

Average resident context stays far below advertised max windows.

Watch: average useful context

Supply catches up

HBM4 qualification, packaging, yields, and capacity ramp faster than expected.

Watch: RoW supply and China equivalent share

Those are real risks. They are also measurable. If the base case is wrong, it should show up in the dials: lower HBM-served token share, lower context bucket, higher KV efficiency, faster supply, or more modern-equivalent China supply.

Why the rerating question is fair

Everyone knows memory stocks are cyclical, and they always look cheap right before the bubble bursts. But what if the structurally important product is no longer generic memory? What if the relevant product is frontier HBM, qualified into AI systems, attached to accelerators, and constrained by packaging, yields, bandwidth, and customer allocation?

We have already seen one traditionally cyclical semiconductor company rerate into a growth story because AI turned its product into the bottleneck. It is now the most valuable company in the world.

The point is not that memory becomes NVIDIA. It is that the old commodity prior may be too lazy. If AI compute keeps scaling, and if useful AI work becomes more stateful, then memory may become less cyclical, more strategic, and more margin-rich than investors are used to believing.

Source notes

Use the calculator

The calculator is the living version of this thesis. Change token growth, context pressure, model weights, supply, and China modern-equivalent assumptions, then watch which bucket dominates. The point is not one exact forecast. The point is seeing which assumptions carry the argument.

Open the Memory Analyst calculator

Disclaimer

Memory Analyst is an independent research site for discussion and education. Nothing on this site is investment, legal, tax, accounting, or procurement advice, and nothing should be read as a recommendation to buy or sell any security, private investment, memory product, GPU, contract, or related asset.

The essay, calculator, charts, and model outputs are scenario analysis built from public information, estimates, simplifications, and user-selected assumptions. They may be wrong, stale, incomplete, internally inconsistent, or inappropriate for any specific decision. HBM supply, AI demand, model architecture, export controls, pricing, yields, packaging capacity, and serving efficiency can change quickly.

Do your own work, consult appropriate advisers, and check primary sources before making financial, technical, operational, or strategic decisions. Any companies, products, or securities mentioned are included for research context only. Memory Analyst may revise, replace, or remove assumptions without notice.