Back to Search Start Over

OHiFormer: Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization

Authors :
Ye Ji Han
Soyeon Lee
Jin Sob Kim
Byung Hoon Lee
Sung Won Han
Source :
IEEE Access, Vol 12, Pp 101313-101324 (2024)
Publication Year :
2024
Publisher :
IEEE, 2024.

Abstract

Screen summarization aims to generate concise textual descriptions that communicate the crucial contents and functionalities of a mobile user interface (UI) screen. A UI screen consists of objects with a hierarchical structure that are tightly interconnected, and each object contains multimodal data such as images, texts, and bounding boxes. Considering these characteristics, previous works encoded the absolute position of objects at the view hierarchy to extract the semantic representation of the UI screen. However, the importance of the hierarchical dependency between objects in the UI structure was overlooked. In this study, we propose an object-wise hierarchical dependency-based Transformer named OHiFormer. OHiFormer considers the objects on the UI screen as tokens in natural language processing and leverages the Transformer to capture the mutual relationships between objects. Moreover, OHiFormer includes a modified self-attention mechanism using structural relative position encoding to represent the hierarchically connected UI. Experimental results demonstrate that OHiFormer outperforms benchmark models in the BLEU 1, BLEU 2, BLEU 3, BLEU 4, ROUGE-L, and CIDEr metrics by 3.63%, 2.1%, 0.12%, 1.8%, 2.38%, and 17.58%, respectively, on the Screen Summarization dataset. Furthermore, our proposed UI structural representation method achieves remarkable performance on complex UIs with numerous objects compared to other structural position encoding methods. Finally, a visualization of the self-attention heatmaps demonstrates how OHiFormer reflects the hierarchical dependencies between objects. By reflecting hierarchical dependencies hidden in the visual layout of the UI, OHiFormer not only improves the quality of summaries but also offers the potential for applications in mobile apps and systems containing numerous interactive objects.

Details

Language :
English
ISSN :
21693536
Volume :
12
Database :
Directory of Open Access Journals
Journal :
IEEE Access
Publication Type :
Academic Journal
Accession number :
edsdoj.0b833eef58d84477b09fa7619903073f
Document Type :
article
Full Text :
https://doi.org/10.1109/ACCESS.2024.3431711