Qfin Holdings’ Paper Accepted by ASRU 2025, Gathering Three Top Speech Conferences to Demonstrate Independent R&D Strength

SHANGHAI, Aug. 21, 2025 /PRNewswire/ — Recently, Qfin Holdings’ Intelligent Speech Team has achieved another milestone:  its multimodal emotion computing research paper, “Qieemo: Multimodal Emotion Recognition Based on the ASR Backbone” has been officially accepted by ASRU 2025_a flagship conference in the global speech technology industry. This achievement cements Qfin Holdings’ status as one of the rare fintech pioneers with recognition across the “Big Three” top-tier speech conferences: ICASSP, InterSpeech, and ASRU, further solidifying its position in the global first echelon of speech technology research and development.

ASRU (IEEE Workshop on Automatic Speech Recognition and Understanding), a biennial gathering, stands as a pinnacle event in audio understanding, showcasing the world’s foremost research in the field.

The core value of the paper accepted by ASRU 2025 lies in its establishment of a theoretical framework with universal significance, rather than merely presenting a task-specific model. From a mathematical modeling perspective, the paper pioneers the development of a general feature fusion framework centered on the ASR model as its core backbone, while systematically elaborating on the fundamental contributions and key mechanisms of multi-level features from the pre-trained ASR model encoder to downstream audio understanding tasks. The proposed framework moves beyond the conventional approach of adding network layers or fine-tuning parameters on existing models, instead, it delves into the essence of speech representation and the underlying logic of its cross-modal applications, thereby providing a novel and robust theoretical foundation for multimodal emotion recognition and even broader speech understanding tasks.

This breakthrough technology has boosted recognition accuracy by over 15% compared to traditional methods, while achieving notable advancements in complex scenarios – it has created a 3.04% relative improvement over the existing SOTA single-modal method MSMSER. This enables intelligent customer service to possess genuine “emotional understanding” ability for the first time, establishing a new “SOTA+” benchmark in the field of emotion computing. Such a performance leap stems from a profound understanding of the underlying speech features and their mechanism, rather than mere model complication.

Different from most fintech companies that rely on open-source technologies or external cooperation, Qfin Holdings adheres to end-to-end independent R&D in the core field of artificial intelligence. It continuously invests in cutting-edge domains such as speech recognition and emotion computing, forming a complete system from algorithm design to engineering implementation. What is particularly crucial is that Qfin Holdings has opted for a more in-depth, foundational R&D path. While the industry generally focuses on adding layers to existing neural network architectures or experimenting with different combinations, Qfin Holdings has chosen to return to the essence of the problem and deeply explore the mathematical principles and mechanisms of speech signal processing, multi-modal feature expression and fusion. This persistent focus on basic theories and original frameworks has endowed it with significant advantages in technical depth, application flexibility and long-term competitiveness.