AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse

Li, Ying; Hashim, Ahmad Sobri; Lin, Yun; Nohuddin, Puteri N. E.; Venkatachalam, K.; Ahmadian, Ali

AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse

Date

2024

Authors

Li, Ying

Hashim, Ahmad Sobri

Lin, Yun

Nohuddin, Puteri N. E.

Venkatachalam, K.

Ahmadian, Ali

Publisher

Elsevier

Abstract

The metaverse, a virtually shared digital world where individuals interact, create, and explore, has witnessed rapid evolution and widespread adoption. Communication between avatars is crucial to their actions in the metaverse. Advances in natural language processing have allowed for significant progress in producing spoken conversations. Within this digital landscape, the integration of Visual Speech Recognition (VSR) powered by deep learning emerges as a transformative application. This research delves into the concept and implications of VSR in the metaverse. This study focuses on developing realistic avatars and a lip-reading application within the metaverse, utilizing Artificial Intelligence (AI) techniques for visual speech recognition. Visual Speech Recognition in the metaverse refers to using deep learning techniques to comprehend and respond to spoken language, relying on the visual cues provided by users' avatars. This multidisciplinary approach combines computer vision and natural language processing, enabling avatars to understand spoken words by analyzing the movements of their lips and facial expressions. Key components encompass the collection of extensive video datasets, the employment of 3D Convolutional Neural Networks (3D CNNs) combined with ShuffleNet and Densely Connected Temporal Convolutional Neural Networks (DC-TCN) called (CFS-DCTCN) to model visual and temporal features, and the integration of contextual understanding mechanisms. The two datasets Wild (LRW) dataset and the GRID Corpus datasets are utilized to validate the proposed model. As the metaverse continues its prominence, integrating Visual Speech Recognition through deep learning represents a pivotal step towards forging immersive and dynamic virtual worlds where communication transcends physical boundaries. This paper contributes to the foundation of technology-driven metaverse development and fosters a future where digital interactions mirror the complexities of human communication. The proposed model achieves 99.5 % on LRW and 98.8 % on the GRID dataset.

Keywords

Artificial intelligence, Metaverse, Visual speech recognition, Deep learning, Avatars, Virtual communication

WoS Q

Q1

Scopus Q

Q1

Volume

164

URI

https://doi.org/10.1016/j.asoc.2024.111906
https://hdl.handle.net/20.500.14517/6232

Collections

WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

Full item page

AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Turkish CoHE Thesis Center URL

WoS Q

Scopus Q

Source

Volume

Issue

Start Page

End Page

URI

Collections