Decision-Level Fusion for Facial and Speech Emotion Recognition: A CNN-Based Web Application

  • Hind Mestouri University of Caddi Ayyad
  • Abdelilah Jraifi
  • Kamal Baraka
Keywords: Emotion Recognition, Convolutional Neural Networks (CNN), Artificial Intelligence (AI), Facial Expression, Vocal Analysis, Web-Based Application

Abstract

This paper presents a real-time web-based emotion recognition system based on unimodal deep learning models for facial and speech analysis, combined through decision-level score aggregation. Facial emotion recognition is performed using convolutional neural networks (CNNs), while speech emotion recognition relies on a CNN–BiLSTM architecture to capture both spatial and temporal speech patterns. These models are chosen for their effectiveness and low computational cost, making them suitable for web-based deployment. The facial model is trained on the FER2013 dataset, and the speech model is trained on the RAVDESS corpus using MFCC-based audio features. Rather than performing multimodal representation learning, this work demonstrates decision-level fusion by aggregating unimodal prediction scores to improve robustness when combining facial and speech information. Experimental results show competitive recognition performance and support the applicability of the proposed system for human-computer interaction in real-time and web-based affective applications.
Published
2026-02-18
How to Cite
Mestouri, H., Jraifi, A., & Baraka, K. (2026). Decision-Level Fusion for Facial and Speech Emotion Recognition: A CNN-Based Web Application. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-3341
Section
Research Articles