KAIST technology cuts AI service costs by 67 percent using personal GPUs

By Park Sae-jin Posted : December 28, 2025, 13:38 Updated : December 28, 2025, 13:38
This AI-generated image depicts the SpecEdge system. Courtesy of KAIST

SEOUL, December 28 (AJP) - SEOUL, South Korea — Researchers at the Korea Advanced Institute of Science and Technology (KAIST) have developed a technology that significantly reduces the operating costs of large language models (LLMs) by utilizing consumer-grade graphics processing units (GPUs) found in personal computers and mobile devices.

The university announced on December 28 that a team led by Professor Han Dong-su from the School of Electrical Engineering has created "SpecEdge," a framework that integrates low-cost edge GPUs into the infrastructure typically reserved for expensive data centers.

AI services currently rely heavily on high-performance GPUs housed in centralized data centers. This dependency results in high operational expenses and creates significant barriers to entry for new AI technologies. While consumer-grade hardware, such as the NVIDIA RTX 4090, offers substantial computing power at a fraction of the hourly cost of data center equipment, existing systems have struggled to effectively coordinate these distributed resources with central servers.

SpecEdge addresses this by distributing the inference workload. The system employs a technique known as "speculative decoding." In this process, a smaller language model running on a local device—such as a personal computer or smartphone—rapidly generates a sequence of draft words, or "tokens." The massive language model in the data center then verifies these drafts in a single batch.

To maximize efficiency, the local device does not wait for the server's validation before proceeding. Instead, it continues to generate subsequent words, eliminating idle time. This allows the system to function effectively even over standard internet connections without requiring specialized high-speed networks.

By offloading a portion of the computation to local devices, the research team reduced the cost per token by approximately 67.6 percent compared to systems relying solely on data center GPUs. The approach also improved server throughput by 2.22 times and cost efficiency by 1.91 times compared to performing speculative decoding exclusively on the server.

"Our goal is to utilize edge resources around users as part of the LLM infrastructure, going beyond data centers," said Professor Han Dong-su. "We aim to lower the cost of providing AI services and create an environment where anyone can utilize high-quality AI."

The research team included Dr. Park Jin-woo and Cho Seung-geun, a master's student at KAIST. The findings were presented as a "Spotlight" paper—a distinction awarded to the top 3.2 percent of submissions—at the Conference on Neural Information Processing Systems (NeurIPS 2025), held in San Diego from December 2 to December 7.

The project was supported by the Institute for Information & Communications Technology Planning & Evaluation (IITP) under the "AI-Native application service support 6G system technology development" project.

Copyright ⓒ Aju Press All rights reserved.