Back to Search Start Over

Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition.

Authors :
Nadeem, Mohammad
Sohail, Shahab Saquib
Javed, Laeeba
Anwer, Faisal
Saudagar, Abdul Khader Jilani
Muhammad, Khan
Source :
Cognitive Computation; Sep2024, Vol. 16 Issue 5, p2566-2579, 14p
Publication Year :
2024

Abstract

The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: C N N 1 and C N N 2 ), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The C N N 2 was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than C N N 1 and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18669956
Volume :
16
Issue :
5
Database :
Complementary Index
Journal :
Cognitive Computation
Publication Type :
Academic Journal
Accession number :
178877051
Full Text :
https://doi.org/10.1007/s12559-024-10281-5