Living thesis / June 2026

Memory is cyclical. AI may be big enough to break the reflex.

The lazy bear case is that memory always mean-reverts. The bull case is that AI is turning memory from a commodity input into a bottleneck attached to every useful unit of frontier compute.

Stress test the model Show the work

The consensus

Memory is cyclical, the stocks have run, and the ending is obvious.

The inversion

AI demand is not one demand shock. It is tokens, model size, context, turns, and memory attach.

The bottleneck

Compute can scale and still be starved if enough state cannot sit close to the accelerator.

The debate

Routing and optimization are real. The question is whether they outrun demand.

Primary

NVIDIA H100/Rubin specs, Goldman token forecast, TrendForce HBM capacity comments.

Reported

Citi-linked Rubin NAND attach, Michael Dell 625x framing, Jeremy Werner context-growth framing.

Modeled

2030 EB demand, KV share, supply gap, China modern-equivalent adjustment, scenario outputs.

Evidence ledger

The thesis is strongest when each source is allowed to prove only what it actually proves.

There is a temptation in AI infrastructure work to turn every datapoint into a conclusion. I think the cleaner method is to separate facts, reported datapoints, and model outputs, then ask which calculator dial each one should move.

Token demand Goldman Sachs Research

Goldman gives the 24x token frame and the 120Q tokens/month 2030 anchor. It supports the traffic curve, not the assumption that every token is served on HBM-heavy systems.

Calculator dials: tokens, HBM-heavy share

Memory attach NVIDIA H100 and Rubin specs

H100 and Rubin show the platform direction: much more HBM capacity and bandwidth per accelerator. They do not forecast total accelerator shipments.

Calculator dials: resident weights, bandwidth pressure, supply quality

KV mechanics NVIDIA Dynamo and vLLM

Dynamo and vLLM support the idea that KV cache is real serving state. They also support the bear case because offload, paging, and reuse can reduce GPU memory pressure.

Calculator dials: context bucket, KV efficiency, KV residency

Supply tightness TrendForce HBM work

TrendForce supports the near-term supply and wafer-allocation tension, including rising HBM wafer input and Rubin Ultra memory expectations. It does not settle 2030 modern-equivalent supply.

Calculator dials: RoW supply, China effective share, fast-supply case

System constraint AI and Memory Wall

The memory-wall literature supports the broader claim that compute can scale faster than memory bandwidth and locality. It is a mechanism source, not a market-sizing source.

Calculator dials: weight efficiency, throughput, context pressure

Model output Memory Analyst calculator

The calculator is the bridge between public evidence and the thesis. It can show what assumptions matter, but it cannot make uncertain inputs certain.

Use it to argue, not to outsource judgment.

The case in five numbers

The thesis is not that every memory cycle is gone. It is that this cycle may be pulled by several AI curves at once.

Tokens create traffic. Larger models create resident weight demand. Longer context creates KV-cache demand. More agent turns create more inference per useful output. Supply has to answer all of that with HBM that is fast enough, qualified enough, and close enough to the accelerator to matter.

Token frame24xGoldman-style 2030 token growth

Rubin HBM22 TB/sNVIDIA stated bandwidth per GPU

Base demand29.9 EBcalculator 2030 HBM requirement

Base supply18.5 EBmodern-equivalent 2030 supply

KV share~70%of modeled 2030 demand

What has to be true

AI usage keeps shifting from short chats toward agents, coding, research, and enterprise workflows.
Average useful context rises enough that KV cache becomes a normal serving constraint.
Frontier systems still need meaningful resident model footprint despite MoE, distillation, and routing.
HBM4, advanced packaging, and customer qualification remain hard enough to keep modern-equivalent supply scarce.

What would change my mind

Most incremental tokens route to memory-light models without hurting product quality.
KV cache reuse, compression, paging, and offload improve faster than context demand.
Advertised context windows do not turn into large resident context in real workloads.
HBM supply expands quickly without margin discipline, yield friction, or packaging bottlenecks.

Memory Analyst essay, updated June 27, 2026

Start with the reflex

"Memory is cyclical, everyone knows that, and the recent run up in memory names is an obvious bubble."

That is the easy view. It might even be right for a trade. Memory stocks have always been dangerous when they look cheap. Supply catches up, pricing rolls over, margins collapse, and the market remembers why it never wanted to pay a growth multiple for commodity bits.

But I think that reflex may be blinding people to the simple scale of what AI is doing to memory demand.

The old cycle was driven by PCs, phones, servers, inventory, and capital discipline. The AI cycle is different because the demand unit is not just "more devices." It is more accelerators, more memory per accelerator, more memory bandwidth per accelerator, larger resident models, longer context, and more inference turns per completed task.

The question is not whether memory is cyclical. It is whether AI is shocking the demand curve hard enough that the cycle stays tighter, higher margin, and more strategically important than investors expect.

The first clue was system memory attach

The first clue that there might be more to the story came when reporting around NVIDIA's next-generation Rubin platform started to show just how much memory the system would need. Citi-linked industry reporting said Rubin could require roughly 16 TB of NAND per GPU, or 1,152 TB per NVL72 rack. NVIDIA's own Rubin materials point in the same direction on HBM: the next generation is built around HBM4, up to 288 GB of HBM per GPU, and up to 22 TB/s of HBM bandwidth.

Accelerator memory step-up

H100 was already memory rich. Rubin is designed around much more bandwidth.

NVIDIA H100 SXM5 80 GB HBM3 Over 3 TB/s bandwidth

NVIDIA Rubin GPU Up to 288 GB HBM4 Up to 22 TB/s bandwidth

Roadmap pressure

The physical product is already moving toward much larger memory stacks.

The thesis starts with observable attach rates before it gets to model assumptions. NVIDIA discloses H100 at 80 GB HBM and Rubin at up to 288 GB HBM4. TrendForce expects Rubin Ultra to move to 384 GB per GPU in 2027.

H100 SXM5

80 GB over 3 TB/s

Rubin GPU

288 GB up to 22 TB/s

Rubin Ultra

384 GB TrendForce 2027 expectation

2025 18% top-three DRAM wafer input to HBM

2026 22% TrendForce estimate

2027 30% capacity crowd-out starts to bite

That is where the memory story starts to look different. It is not just "AI needs more GPUs." It is "known GPU roadmaps are pulling a much larger and much faster memory stack behind them."

This is the old memory wall problem with a new forcing function. Compute can keep improving, but if models need more state to be close to the accelerator, memory manufacturers do not just need more bits. They need faster bits, stacked tighter, packaged better, qualified into frontier systems, and delivered into a supply chain that is already constrained.

Inference is where the math gets non-linear

The first AI infrastructure debate was mostly about training clusters and raw compute. The next debate is about inference systems that serve persistent, agentic, long-context workloads. Those systems are not just doing math. They are holding state.

That state shows up in two buckets. The first is resident model weights: the parameters and experts that need to be loaded across serving replicas. The second is KV cache: the memory used to keep previous tokens available so the model can continue generating without recomputing the whole conversation or task history.

Agents make this much more extreme. A normal chat can be short. A useful agent may pull a codebase, Slack history, contract set, research corpus, or spreadsheet model into context, then iterate over the task many times. Longer context means more memory per active workload. More turns means more workload per finished output.

Goldman Sachs frames token consumption as rising 24x to 120 quadrillion tokens per month by 2030. Third-party summaries of a recent episode of The Circuit with Micron SVP Jeremy Werner report a roughly 30x-per-year context-growth comment. I treat that as a directional anecdote, not a base-case input, and I do not assume context keeps compounding anywhere near that rate.

Memory mechanics

Context is not magic. It is state that has to sit somewhere.

At the request level, KV cache grows roughly with active tokens, active sessions, model depth and width, and bytes per stored value. For a fixed model, longer context is close to linear. Across model generations, bigger models can also raise the KV bytes required per token, so context and model size are not perfectly independent in the real world.

Simple mental model KV memory ~= active context x active sessions x memory per token Optimization reduces the multiplier, but the workload still wants state.

Show the work

The simple version is that data center memory demand is driven by four things: token demand, model size, context, and optimization. But those variables are not all multiplied together in one naive index.

Model size primarily pressures the resident-weight bucket. Context primarily pressures the KV-cache bucket. Token growth increases the number of workloads served. Agent turns increase the number of inference passes per useful output. Optimization divides against all of it through routing, quantization, compression, paging, offload, reuse, better kernels, and better scheduling.

Traffic tokens x HBM-served share

Weights resident model footprint

Context KV cache and sessions

Gross-up scratch, redundancy, utilization

Output required HBM

Required HBM = (resident weights + KV cache + scratch) x redundancy / usable utilization

This is less explosive than multiplying token growth by model growth by context growth. It is also more honest. But the additive model can still produce a very large number because the KV bucket gets big quickly once token demand, active sessions, and average useful context all rise at the same time.

First-principles demand stack

Top-level demand is additive. The pressure inside each bucket is multiplicative.

The calculator is built this way because HBM is not one generic pool in the model. It is live serving state. Resident weights, KV cache, scratch space, redundancy, and usable utilization are different constraints, so they should not all be collapsed into one giant multiplier.

Traffic Tokens create work

Token growth raises the amount of inference the system has to serve, but throughput gains and routing decide how much of that work turns into HBM-heavy replicas.

tokens x HBM-heavy share

Weights Models create resident footprint

Larger frontier systems increase the model state that has to be available for serving, partly offset by quantization, MoE routing, distillation, and better layout.

base weights x replicas x model scale / weight efficiency

KV cache Context creates live memory

Longer resident context, more stateful sessions, and more agent turns increase the amount of prior-token state kept close to the accelerator.

base KV x session pressure x context bucket

System overhead Deployable systems need slack

Scratch memory, redundancy, and practical utilization turn raw resident state into the amount of HBM capacity a deployed fleet needs.

(weights + KV + scratch) x redundancy / utilization

Additive Weights + KV + scratch

A longer context window does not make the model weights larger. A larger model does not automatically mean every token uses a proportionally larger KV cache.

Multiplicative Context x sessions x KV bytes

Inside the KV bucket, the terms multiply. Bigger architectures can also raise KV bytes per token, so model size can still matter indirectly. NVIDIA's Dynamo work and vLLM's PagedAttention both exist because this state can become a real serving bottleneck.

Base-case bridge

How the calculator gets from token growth to 29.9 EB of 2030 HBM demand.

This is the current base case, not a proof. The point is to show which intermediate assumptions do the work.

Traffic 24.0x 120Q tokens / 5Q base

Serving replicas 2.18x traffic elasticity after throughput gains

Session pressure 5.74x traffic translated into active state

Context bucket 1.92x average resident context pressure

Raw KV cache 14.0 EB 1.27 EB x 5.74 x 1.92

Gross-up 1.49x 1.10 redundancy / 74% utilization

Raw resident buckets 3.5 EB weights + 14.0 EB KV + 2.6 EB scratch

20.1 EB

Deployable HBM demand after redundancy and fixed utilization

29.9 EB

The bridge shows why the base case is mostly a state thesis. Resident weights matter, but KV cache becomes roughly 70% of modeled 2030 demand after gross-up.

Methodology

How the calculator turns assumptions into HBM demand

The important choice is that model size and context are not blindly multiplied together. They pressure different memory buckets.

1 Traffic index

Start with tokens, then haircut by the share of traffic served on HBM-heavy systems.

(tokens / 5Q) x HBM-served share

2 Resident weights

Scale the 2026 resident-weight bucket by serving replicas, model footprint, and weight efficiency.

1.75 EB x replica index x model scale / weight efficiency

3 KV cache

Scale the 2026 KV bucket by active session pressure and the context bucket.

1.27 EB x session index x context bucket

4 Gross-up

Add scratch memory, redundancy, and fixed utilization so the chart ties to deployable HBM.

(weights + KV + scratch) x redundancy / 74%

5 Supply

Compare demand against modern-equivalent HBM supply, including a separate China adjustment.

RoW supply + China gross x modern-equivalent share

Model discipline

The point is not to choose the biggest formula. It is to choose the formula that maps to memory actually being used.

The original bull case can be written as tokens x model size x context. That is useful as an upper-tail warning, but it overstates the base case because model weights and KV cache are different memory buckets.

Upper-tail napkin math 20x x 30x x 30x 18,000x index

Good for showing how violent the tail could be if every curve compounds together. Too blunt for the base model.

Simplified base product 24x x 1.6x x 1.92x 74x index

Still too aggressive because all three inputs do not multiply one shared memory pool.

Calculator bucket model weights + KV + scratch 6.2x HBM demand

Less explosive, but still produces 29.9 EB of 2030 demand against 18.5 EB of effective supply.

The first two are intuition indexes, not exabyte forecasts. The calculator turns the same debate into resident-weight, KV-cache, scratch, redundancy, utilization, and supply buckets.

Formula audit

The model is intentionally less explosive than the napkin math.

The most bullish napkin math multiplies tokens by model size by context. I do not think that is the best core formula. The cleaner approach is to separate memory into buckets, then let each bucket respond to the assumption that actually drives it.

Good Weights and KV are separate

Model size mostly drives resident weights. Context mostly drives KV cache. Treating them as separate buckets avoids a false triple-count.

Good Optimization is explicit

Routing, quantization, KV paging, reuse, compression, and offload are not ignored. They are visible dials.

Weakness KV bytes per token can rise

Bigger models often have more layers and wider activations, which can increase memory per context token. The calculator only partly captures that through buckets.

Weakness Concurrency is the hardest unknown

The model approximates active sessions. Real deployments depend on batching, latency targets, prompt reuse, offload, and how sticky agent sessions become.

Formula appendix

The exact calculator formulas are deliberately simple enough to audit.

Every year from 2026 to 2030 is interpolated between the 2026 baseline and the selected 2030 assumption. The 2030 base case uses the formulas below.

Traffic effectiveTraffic = (tokens / 5Q) x HBM-heavy share

Turns token demand into the portion of traffic exposed to frontier-style memory pressure.

Replicas replicaIndex = effectiveTraffic^replicaElasticity / throughputEfficiency

Lets hardware and serving efficiency reduce the number of resident model copies needed for traffic.

Weights weights = 1.75 EB x replicaIndex x modelScale / weightEfficiency

Scales the resident-weight bucket by serving footprint, model size, and compression/quantization gains.

KV cache KV = 1.27 EB x effectiveTraffic^sessionElasticity x contextBucket

Bucket mode rolls stateful concurrency, resident context, KV bytes, residency, and KV efficiency into one dial.

Scratch scratch = (weights + KV) x scratchPct

Models buffers, fragmentation, scheduling slack, and other serving overhead.

Deployable demand demand = (weights + KV + scratch) x redundancy / 74%

Converts raw resident state into practical HBM capacity under fixed utilization.

In detailed context mode, the KV term expands to 1.27 EB x sessionIndex x contextScale x kvBytesScale x kvResidency / kvEfficiency. In detailed supply mode, usable supply is RoW supply + China gross supply x China modern-equivalent share.

What the calculator says

In the current base case, token demand rises from 5 quadrillion tokens per month in 2026 to 120 quadrillion by 2030. The model assumes routing and efficiency matter, so resident weights only rise to 5.2 EB of grossed HBM demand by 2030. The problem is that KV cache rises to 20.8 EB. The base case is not a model-size story. It is mostly a state story.

Calculator base case

By 2030, KV cache is almost 70% of modeled demand.

Base demand reaches 29.9 EB against 18.5 EB of effective supply, leaving an 11.4 EB deficit. The useful question is not whether that exact number is right. It is which assumptions you have to change to make the deficit disappear.

2026

4.8 EB

2027

7.1 EB

2028

11.0 EB

2029

17.8 EB

2030

29.9 EB

WeightsKV cacheScratch

Calculator-derived driver view

The base case does not require supply to stand still. It requires demand to outrun a nearly 5x supply ramp.

From 2026 to 2030, the calculator's base case adds 25.1 EB of HBM demand and 14.8 EB of modern-equivalent supply. The gap widens because almost three quarters of the incremental demand comes from KV cache.

Demand growth 4.8 EB -> 29.9 EB +25.1 EB, or 6.2x

Supply growth 3.75 EB -> 18.5 EB +14.8 EB, or 4.9x

Gap growth -1.0 EB -> -11.4 EB Gap widens by 10.4 EB

Incremental 2026 to 2030 demand +25.1 EB total

Weights +2.7 EB KV cache +19.0 EB Scratch +3.4 EB

This is the cleanest version of the thesis: even after explicit efficiency offsets, the base case is mostly a bet that useful AI workloads become more stateful.

The assumptions are the product

The thesis should live or die on visible assumptions, not vibes. Here is the current base case in plain English:

Token demand5Q to 120Q/monthGoldman-style 24x frame

Model footprint1.6xlarger resident models, partly offset by efficiency

Weight efficiency1.75xquantization, MoE, serving improvements

Context bucket1.92xactive context grows, but far below 30x stress

KV efficiency2.5xpaging, reuse, compression, offload

Supply18.5 EB18.0 EB RoW plus 0.5 EB China modern-equivalent

What the base case is not assuming

The base case is a bull case, but it is not the most explosive version of the argument.

The important discipline is that the calculator does not simply take the wildest recent datapoint and compound it for five years. The stress cases exist, but the base case tries to underwrite a more survivable version of the memory thesis.

Model size Not 3x per year

The base case uses a 1.6x resident model-footprint increase by 2030, not a 30x model-size shock. The 30x case is kept as an explicit stress row.

Context Not 30x every year

The base context bucket is 1.92x after efficiency and residency offsets. That is deliberately far below the agentic-context anecdotes.

Efficiency Not ignored

The base case gives throughput an 8x improvement, weight memory a 1.75x improvement, and KV management a 2.5x improvement.

Supply Not frozen

Base supply rises to 18.5 EB of modern-equivalent HBM by 2030, including a China adjustment, before the model declares a gap.

None of that is meant to be sacred. If you think routing takes most token growth away from frontier systems, use the routing-bear case. If you think KV cache optimization is spectacular, use the context-disciplined case. If you think context becomes the core workload of the AI economy, use the memory-bull or stress cases. The calculator is there because the debate is the sensitivity.

The range is the point

The calculator produces very different outcomes depending on what you believe about routing, context, and supply. That is not a bug. It is the thesis in miniature.

Routing bear + fast supplyOptimization wins, supply ramps hard

0.13x supply

Context disciplinedStrong token growth, but KV is aggressively managed

0.38x supply

Base case20x-plus token frame, moderate context growth, real efficiency

1.62x supply

Memory bullContext becomes a normal enterprise and coding workload

3.52x supply

30x stressNot a base case. A reminder of what the upper tail looks like.

15.76x supply

Open the exact cases

Every scenario in the essay can be opened in the live calculator.

These links load the exact demand and supply presets behind the sensitivity table, so readers can change one dial and see where the thesis breaks.

Routing bear + fast supply 0.13x supply Open the cleanest bear case Context disciplined 0.38x supply Open the KV-optimization win Base case 1.62x supply Open the current thesis case Base + fast supply 1.11x supply Open the supply-catches-up test Memory bull 3.52x supply Open the stateful-agent case 30x stress 15.76x supply Open the upper-tail warning

Calculator sensitivity

Demand case x supply case, shown as 2030 demand divided by usable HBM supply.

Green is surplus. Amber is tight. Red is deficit. Each cell uses the same calculator v2.2 bucket formula.

Demand preset

Tight supply13.4 EB

Base supply18.5 EB

Fast supply27.0 EB

Routing bear3.6 EB demand

0.27x+9.8 EB

0.19x+14.9 EB

0.13x+23.4 EB

Context disciplined7.0 EB demand

0.53x+6.4 EB

0.38x+11.5 EB

0.26x+20.0 EB

Base case29.9 EB demand

2.23x-16.5 EB

1.62x-11.4 EB

1.11x-2.9 EB

Memory bull65.1 EB demand

4.86x-51.7 EB

3.52x-46.6 EB

2.41x-38.1 EB

30x stress291.5 EB demand

21.75x-278.1 EB

15.76x-273.0 EB

10.80x-264.5 EB

Positive gap means modeled supply exceeds demand. Negative gap means the calculator is short that many exabytes in 2030. The 30x stress row is included as an upper-tail test, not the base forecast.

Breakeven hurdles

What has to change to erase the base-case 2030 gap?

These are one-variable calculator solves. They are not forecasts. They show how far a single assumption has to move, holding the rest of the base case constant, for 18.5 EB of supply to cover demand.

Supply answer 18.5 EB -> 29.9 EB

Base supply needs +11.4 EB, or roughly +62%.

Context answer 1.92x -> 1.00x

Average context/KV pressure has to fall about 48%.

Routing answer 100% -> 46%

More than half of HBM-heavy token exposure has to route away.

Token answer 120Q -> 55Q/month

The Goldman-style 24x token frame has to become about 11x.

Efficiency-only answer Not enough

Even very high throughput or weight efficiency leaves roughly 25 EB of demand, because KV still dominates.

The base case is fragile if routing and context discipline are both much better than assumed. It is not fragile to weight-efficiency improvements alone.

The bear case is not dumb. In the routing-bear case, memory demand collapses relative to supply because most traffic never touches the most HBM-heavy systems and efficiency compounds. The bull case is that the world does not stay that clean. Agents, coding, research, and enterprise workflow are exactly the use cases that want state, context, and iteration.

The supply side is not generic DRAM

The supply side is not just "more HBM exists." Usable HBM for frontier AI depends on capacity, yields, packaging, qualification, customer allocation, and whether the memory is close enough in performance to substitute for modern HBM in real clusters.

TrendForce says AI infrastructure should sustain strong HBM demand through 2026 and 2027, with HBM capacity per AI chip moving from 96/192 GB to 216/288 GB, Rubin Ultra expected at 384 GB per GPU, and HBM wafer input among the top three suppliers rising toward roughly 30% of total DRAM wafer input by the end of 2027. That is not a normal memory cycle. That is a capacity reallocation problem.

This is why China matters, but not in a simple way. China can add gross memory supply. The relevant question for frontier AI is how much of that supply is modern-equivalent HBM that can actually relieve NVIDIA, ASIC, and hyperscaler clusters. The calculator splits rest-of-world modern HBM from China gross HBM for that reason.

1 Wafer allocation

How much DRAM capacity shifts toward HBM without starving conventional markets?

2 Stack and yield

How much of that output reaches the right stack height, speed, and quality?

3 Packaging

Can the memory be integrated into accelerators and AI ASIC packages at scale?

4 Qualification

Does it qualify for the systems that actually set frontier AI capacity?

What would make this wrong?

The honest bear case is not "memory is cyclical." That is a label, not an argument. The real bear case is that AI demand routes around HBM faster than HBM demand compounds.

Routing wins

Most token growth moves to small models, ASICs, or memory-light systems.

Watch: HBM-served token share

KV gets solved

Compression, reuse, offload, and paging outrun context demand.

Watch: KV efficiency and context bucket

Context disappoints

Average resident context stays far below advertised max windows.

Watch: average useful context

Supply catches up

HBM4 qualification, packaging, yields, and capacity ramp faster than expected.

Watch: RoW supply and China equivalent share

Those are real risks. They are also measurable. If the base case is wrong, it should show up in the dials: lower HBM-served token share, lower context bucket, higher KV efficiency, faster supply, or more modern-equivalent China supply.

Living monitor

This thesis should get updated by evidence, not defended like a slogan.

The calculator tells us where the argument is fragile. The monitoring job is to watch the real-world signals that map to those dials.

Traffic Bull signal

Agent, coding, search, and enterprise workflows keep token demand tracking the 20x-plus frame.

Bear trigger: 2030 token path looks closer to 55Q/month than 120Q/month.

Context Bull signal

Real workloads use long resident context, repeated turns, and sticky sessions rather than just advertised max windows.

Bear trigger: average context pressure falls toward 1.0x or below.

Routing Bear signal

Most incremental tokens route to small models, memory-light ASICs, or cached/offloaded systems without hurting product quality.

Base-case breakeven: HBM-heavy share falls from 100% toward 46%.

KV efficiency Bear signal

KV reuse, compression, paging, and offload improve faster than agentic context and concurrency grow.

Watch: KV cache stops dominating incremental HBM demand.

Supply Bull signal

HBM4 yields, packaging, and qualification remain hard enough that modern-equivalent supply does not approach 30 EB by 2030.

Bear trigger: fast supply becomes the conservative case and pricing rolls over.

China Bear signal

China brings meaningful modern-equivalent HBM online and it becomes a real substitute for constrained frontier systems.

Watch: the China effective share, not just gross China bits.

Why the rerating question is fair

Everyone knows memory stocks are cyclical, and they always look cheap right before the bubble bursts. But what if the structurally important product is no longer generic memory? What if the relevant product is frontier HBM, qualified into AI systems, attached to accelerators, and constrained by packaging, yields, bandwidth, and customer allocation?

We have already seen one traditionally cyclical semiconductor company rerate into a growth story because AI turned its product into the bottleneck. It is now the most valuable company in the world.

The point is not that memory becomes NVIDIA. It is that the old commodity prior may be too lazy. If AI compute keeps scaling, and if useful AI work becomes more stateful, then memory may become less cyclical, more strategic, and more margin-rich than investors are used to believing.

Source notes

Goldman Sachs Research frames agentic AI token consumption as potentially rising 24x to 120 quadrillion tokens per month by 2030.
NVIDIA's H100 architecture note says H100 SXM5 supports 80 GB of HBM3 and over 3 TB/s of memory bandwidth.
NVIDIA's Vera Rubin platform note describes up to 288 GB of HBM4 and up to 22 TB/s of memory bandwidth per Rubin GPU.
Wccftech and TechTarget summarize Citi-linked reporting on 16 TB NAND per Rubin GPU / 1,152 TB per NVL72 rack. I treat that as reported industry data, not audited NVIDIA guidance.
The Circuit episode with Micron's Jeremy Werner is the primary podcast reference for the memory/context discussion. BigGo's third-party episode summary reports the 30x context-growth quote; I treat that as a directional anecdote, not a calculator input.
NVIDIA Dynamo, vLLM PagedAttention, and LMCache research are the source spine for the KV-cache and context mechanics.
TrendForce is the source for the HBM supply, wafer allocation, pricing power, and 2026-2027 capacity framing.
Michael Dell's market framing is useful intuition for memory-per-accelerator x accelerator-count growth, but I do not use it as a bottom-up forecast.
The AI and Memory Wall is useful background for why memory bandwidth and locality can gate AI system performance.

Use the calculator

The calculator is the living version of this thesis. Change token growth, context pressure, model weights, supply, and China modern-equivalent assumptions, then watch which bucket dominates. The point is not one exact forecast. The point is seeing which assumptions carry the argument.

Open the Memory Analyst calculator

Read path

The reflex
Evidence ledger
Memory mechanics
Show the work
Demand stack
Base-case bridge
Formula comparison
Formula audit
Exact formulas
Not assumed
Driver view
Scenario range
Sensitivity matrix
Breakeven hurdles
Bear case
Living monitor
Sources

Thesis in one line

AI turns memory from a cyclical commodity input into a bottleneck attached to frontier compute.

Base case output

2030 HBM demand: 29.9 EB
Effective supply: 18.5 EB
Gap: -11.4 EB

Biggest swing factors

HBM-served token share
Average resident context
KV cache efficiency
Modern-equivalent supply

Best objection

Routing, distillation, KV offload, and smaller specialist models could absorb more token growth than the base case assumes.

Test that case

Disclaimer

Memory Analyst is an independent research site for discussion and education. Nothing on this site is investment, legal, tax, accounting, or procurement advice, and nothing should be read as a recommendation to buy or sell any security, private investment, memory product, GPU, contract, or related asset.

The essay, calculator, charts, and model outputs are scenario analysis built from public information, estimates, simplifications, and user-selected assumptions. They may be wrong, stale, incomplete, internally inconsistent, or inappropriate for any specific decision. HBM supply, AI demand, model architecture, export controls, pricing, yields, packaging capacity, and serving efficiency can change quickly.

Do your own work, consult appropriate advisers, and check primary sources before making financial, technical, operational, or strategic decisions. Any companies, products, or securities mentioned are included for research context only. Memory Analyst may revise, replace, or remove assumptions without notice.