The next big chip from Korea – flash successor to HBM: KAIST expert

By Candice Kim Posted : February 10, 2026, 16:29 Updated : February 10, 2026, 16:51
Graphics by AJP Song Ji-yoon

SEOUL, February 10 (AJP) - High-bandwidth memory (HBM) has powered the current AI boom, but soaring costs and capacity limits are shifting attention to what comes next. The next big chip to come from Korea's memory powerhouse will be flash-based successor designed for the inference-heavy phase of AI, according to a chip expert.

“Now is the time to move beyond HBM. The era of high-bandwidth flash is coming — and it will unfold within the next decade,” Kim Jung-ho, a professor at Korea Advanced Institute of Science and Technology (KAIST), said Tuesday during a livestreamed session hosted by the KAIST Tera Lab, which he leads.

Kim’s forecast reflects a growing industry concern that memory — not compute — is becoming the primary bottleneck in scaling next-generation AI systems. As large language models expand and inference workloads multiply, the challenge is no longer just faster processors, but how much data can be stored close enough to feed them efficiently.

At the center of the problem is the explosive growth of so-called key-value (KV) caches — temporary memory that stores intermediate data during AI inference in transformer-based models. These caches expand rapidly with longer context windows, pushing HBM to its practical limits in both capacity and cost.

HBM functions as ultra-fast “working memory” placed next to GPUs, but its capacity growth has lagged behind model demands. According to publicly available specifications from Nvidia, the H100 accelerator carries 80 gigabytes of HBM, while the H200 offers 141 gigabytes. Industry tracker TrendForce estimates the upcoming Blackwell-based B200 will reach roughly 192 gigabytes, with the next-generation Rubin platform expected to approach 288 gigabytes.

 
Graphics by AJP Song Ji-yoon

Even those increases fall short. A single modern AI model — once intermediate data is included — can already consume hundreds of gigabytes of memory. Running a 70-billion-parameter model such as Llama-3 in FP16, for example, requires about 140 gigabytes just for model weights. When long-context inference is added, memory demand quickly exceeds what a single GPU can supply, limiting batch size and the number of users that can be served concurrently.

The strain is more acute in transformer architectures, where KV caches scale linearly with context length. For a 70-billion-parameter model, KV cache requirements are estimated at roughly 45 to 64 gigabytes for a 128,000-token context window. At the million-token scale, that figure swells to roughly 327 to 512 gigabytes — two to three times larger than the model weights themselves. For 400-billion-parameter-class models, supporting million-token contexts would push memory needs into the terabyte range, far beyond today’s HBM-only configurations.

“This is why performance is no longer determined by compute alone,” Kim said. “Once decoding begins, throughput becomes memory-bound. Capacity matters as much as bandwidth.”

To bridge that gap, Kim argues that the industry will need high-bandwidth flash (HBF) — a stacked, NAND-based memory technology designed to offer far larger capacity than HBM while delivering much higher bandwidth than conventional storage. Rather than replacing HBM, HBF would serve as a complementary tier within a hierarchical memory system.

Kim likens the relationship to a workspace: HBM is “the bookshelf next to your desk,” optimized for speed but limited in size, while HBF functions as “the library,” holding bulky data such as KV caches and long-context inputs that can be streamed back to the GPU when needed.

The push toward longer context windows is already visible across the AI industry. Google’s Gemini 1.5 supports up to one to two million tokens, Anthropic’s Claude models target 200,000 to one million tokens, and Meta has signaled even longer contexts for future Llama models. Combined with retrieval-augmented generation and agent-based AI that accumulates session history, such workloads dramatically expand memory footprints.

From a performance standpoint, HBM3e delivers roughly 1.2 terabytes per second of bandwidth per stack with latencies measured in tens of nanoseconds — but at a steep cost per gigabyte. Enterprise-grade solid-state drives, by contrast, offer only 10 to 30 gigabytes per second and latencies in the tens of microseconds, albeit at much lower cost. HBF aims to occupy the middle ground, targeting hundreds of gigabytes per second with microsecond-level latency, making it suitable for tiered AI memory architectures.
 
Graphics by AJP Song Ji-yoon

Industry engineers increasingly expect future AI servers to adopt such layered designs, combining HBM, conventional DRAM and flash-based memory pools through high-speed interconnects such as CXL. In that setup, Kim said, HBM would remain dominant for latency-critical tasks, while HBF would emerge as the backbone for long-context and large-scale inference.

“The question is no longer whether we need more memory,” Kim said. “It’s how we restructure the memory system itself. In the AI era, memory is becoming the real factory floor.”

For memory makers, the shift opens a new front in the AI arms race — not only in faster chips, but in how much data those chips can store and move.

Copyright ⓒ Aju Press All rights reserved.