AI Developed to Change Facial Expressions Based on Voice Emotion Analysis

Comparison of emotion editing results between C-MET and existing methods [Photo=Yonhap News • Ulsan National Institute of Science and Technology]

A new artificial intelligence (AI) technology that analyzes emotions in voice to naturally change the facial expressions of individuals in videos has been developed by a domestic research team, Yonhap News reported on June 18.
According to Yonhap, a team led by Professor Kim Tae-hwan at the Ulsan National Institute of Science and Technology (UNIST) has created an AI module called C-MET (Cross-Modal Emotion Transfer) that can alter the facial expressions of a speaker in a video to reflect desired emotions without needing a reference image.
This technology stands out from traditional methods that simply label emotions like 'joy' or 'sadness' for training. Instead, it focuses on the variations between emotions. The researchers calculated the differences between neutral speech and emotionally charged speech in vector form, allowing the AI to learn how these changes manifest as facial expressions.
As a result, the AI can extract only the emotional signals necessary for facial expression changes from speech that contains both content and emotion. Even with the same sentence, the AI can adjust facial expressions based on variations in tone and inflection, altering movements of the mouth corners, eyebrows, and around the eyes.
Notably, the technology can express emotions that were not directly encountered during the training process. The research team explained that by analyzing the variations between two emotions, subtle feelings such as sarcasm, empathy, and charisma can also be reflected in facial expressions.
For instance, the phrase "Well done" can be interpreted as either sincere praise or sarcastic, with the AI capable of generating different expressions based on the tone alone.
Additionally, the lack of need for high-quality reference images, such as frontal photos expressing emotions, broadens the potential applications of this technology.
Performance has also improved compared to existing technologies. In experiments where the latest expression editing technology, EDTalk's expression encoder, was replaced with C-MET, the accuracy of emotion expression increased from 41.99% to 55.91%, a rise of nearly 14 percentage points.
When applied to another facial generation model, PD-FGC, the accuracy improved from 33.36% to 36.82%. This indicates that C-MET is not limited to specific models and can be applied across various facial generation AI systems, according to the researchers.
Professor Kim Tae-hwan stated through Yonhap, "This research effectively addresses the limitations of existing methods by allowing facial emotions in video to be changed using only voice, without reference images. It serves as a foundational technology that can be utilized in various fields, including virtual human creation, film and content post-production, and emotion recognition AI."
The research findings have been accepted for presentation at the 2026 Conference on Computer Vision and Pattern Recognition (CVPR 2026), an international conference in the field of AI and computer vision.

* This article has been translated by AI.