Huawei Achieves Up to 372% Increase in Long-Sequence AI Inference Speed

Huawei showcased its AI inference acceleration solution at MWC Shanghai 2026 in Shanghai, China. [Photo: Huawei Korea]

Huawei has successfully increased the token processing capacity for long-sequence AI inference by up to 372% in a commercial network environment, marking a first for the Chinese telecommunications industry.

On June 24, during the MWC Shanghai 2026 event, Huawei announced the results of its AI inference acceleration solution, developed in collaboration with China Mobile Hubei.

This solution is built on Huawei's OceanStor A800 storage, Ascend A3 SuperPod, and an integrated cache manager, providing a critical technological foundation for telecom operators to efficiently deploy large-scale AI computing services.

As AI services evolve to focus more on agent-based interactions, scenarios requiring long-context handling, such as code generation and multi-turn conversations, are becoming more common. However, delays in data processing due to limitations in existing on-chip memory and DRAM have created bottlenecks.

To address this issue, Huawei introduced UCM technology, utilizing external high-performance storage to implement a petabyte-scale KV cache and eliminate redundant computations, significantly reducing inference costs.

The validation conducted on China Mobile Hubei's commercial network simulated long-sequence inputs ranging from 8K to 190K tokens using key AI models, including MiniMax M2.5 and GLM-5.1.

The results showed that the first token generation time (TTFT) for the GLM-5.1 model was reduced by up to 93%. The tokens per second (TPS) improved by as much as 372% in a 128K long-sequence environment. The MiniMax M2.5 model also demonstrated a 78% increase in TPS in the same context, with acceleration effects becoming more pronounced as the context window lengthened.

Michael Chu, Huawei's Global Data Storage Marketing and Solution Sales President, stated, "The AI inference acceleration solution not only significantly reduces response times but also contributes to token cost savings. We are committed to supporting telecom operators in building efficient and environmentally friendly AI computing infrastructures."

* This article has been translated by AI.