6 results on '"Transformer Network"'
Search Results
2. GraphFusion: Integrating multi-level semantic information with graph computing for enhanced 3D instance segmentation.
- Author
-
Pan, Lei, Luan, Wuyang, Zheng, Yuan, Li, Junhui, Tao, Linwei, and Xu, Chang
- Subjects
- *
TRANSFORMER models , *POINT cloud , *SOCIAL dominance - Abstract
Graph computing has emerged as a focal point in recent research across various fields, including the realm of 3D instance segmentation, where it aids in detecting and segmenting objects within volumetric data. Our study introduces GraphFusion, a state-of-the-art network that harnesses the power of graph computing to enhance the segmentation of 3D point clouds. GraphFusion is equipped with a Multi-Level Semantic Aggregation Module, architectured akin to a graph, to capture comprehensive features from 3D point clouds. Utilizing graph-based methodologies, this module proficiently aggregates multi-scale semantic information, illuminating insights from both global and local contexts. Additionally, our Parallel Feature Fusion Transformer Module leverages graph-transformer techniques to intricately process complex spatial relationships within point clouds, culminating in a more cohesive feature representation. Rigorous experiments on the ScanNetv2 dataset affirm the dominance of GraphFusion, which eclipses current methods by 2.2% in mean Average Precision (mAP) on the hidden test set. The model's code is accessible at https://github.com/3171228612/GraphFusion. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Bi-syntax guided transformer network for aspect sentiment triplet extraction.
- Author
-
Hao, Shufeng, Zhou, Yu, Liu, Ping, and Xu, Shuang
- Subjects
- *
SENTIMENT analysis , *USER-generated content , *END-to-end delay - Abstract
Aspect Sentiment Triplet Extraction is an emerging and challenging task that attempts to present a complete picture of aspect-based sentiment analysis. Prior research efforts mostly leverage various tagging schemes to extract the three elements in a triplet. However, these methods fail to explicitly model the complicated relations between aspects and opinions and the boundaries of multi-word aspects and opinions. In this paper, we propose a bi-syntax guided transformer network in an end-to-end manner to address these challenges. Firstly, we devise three types of representations, including sequence distance representation, constituency distance representation, and dependency distance representation, to learn the comprehensive language representation. Specifically, sequence distance representation utilizes sequence distance between words to enhance the contextual representation. Constituency distance representation adopts constituency distance between words in a constituency tree to capture the intra-span relation between words. Dependency distance representation employs dependency distance between words in a dependency tree to capture the long-distance relation between aspects and opinions. Extensive experiments are conducted on four benchmark datasets to validate the effectiveness of our method. The results demonstrate that the proposed approach achieves better performance than baseline methods. We conduct further detailed analysis to demonstrate that our method effectively handles multi-word terms and overlapping triplets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Blind face restoration: Benchmark datasets and a baseline model.
- Author
-
Zhang, Puyang, Zhang, Kaihao, Luo, Wenhan, Li, Changsheng, and Wang, Guoren
- Subjects
- *
TRANSFORMER models , *JPEG (Image coding standard) - Abstract
Blind Face Restoration (BFR) aims to generate high-quality face images from low-quality inputs. However, existing BFR methods often use private datasets for training and evaluation, making it challenging for future approaches to compare fairly. To address this issue, we introduce two benchmark datasets, BFRBD128 and BFRBD512, for evaluating state-of-the-art methods in five scenarios: blur, noise, low resolution, JPEG compression artifacts, and full degradation. We use seven standard quantitative metrics and two task-specific metrics, AFLD and AFICS. Additionally, we propose an efficient baseline model called Swin Transformer U-Net (STUNet), which outperforms state-of-the-art methods in various BFR tasks. The codes, datasets, and trained models are publicly available at: https://github.com/bitzpy/Blind-Face-Restoration-Benchmark-Datasets-and-a-Baseline-Model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Boundary-guided part reasoning network for human parsing.
- Author
-
Su, Zhuo, Guan, Huiqiang, Lai, Yuntian, Zhou, Fan, and Liang, Yun
- Subjects
- *
TRANSFORMER models , *HUMAN body , *HUMAN beings - Abstract
The task of human parsing aims to segment the human body into different semantic regions. Despite advancements in this field, there are still two issues with current works: boundary indistinction and parsing inconsistency. In this paper, we investigate how to utilize structural information and auxiliary information to jointly solve the above two problems. Drawing inspiration from Transformer architecture, a Boundary-guided Part Reasoning Network (BPRNet) is proposed to combine edge information and associated semantics of body parts for human parsing. Specifically, we design a part representation module to represent human body parts as part features. Based on the Transformer decoder, a multi-head self-attention is used to capture the semantic correlation between the human body. Moreover, we propose a boundary-guided module consisting of absolute boundary attention and reinforced boundary attention. They take advantage of edge information and multi-scale image features to jointly constrain cross-attention to extract global features. Experiments and corresponding results on three public datasets show that the proposed method performs favorably against the state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
6. A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows.
- Author
-
Sharma, Mayank, Joshi, Sandeep, Chatterjee, Tamojit, and Hamid, Raffay
- Subjects
- *
TELEVISION programs , *ENVIRONMENTAL music , *SIGNAL-to-noise ratio , *MUSIC scores , *SPEECH - Abstract
A robust and language agnostic Voice Activity Detection (VAD) is crucial for Digital Entertainment Content (DEC). Primary examples of DEC include movies and TV series. Some ways in which VAD systems are used for DEC creation include augmenting subtitle creation, subtitle drift detection and correction, and audio diarisation. Majority of the previous work on VAD focuses on scenarios that: (a) have minimal background noise, and (b) where the audio content is delivered in English language. However, movies and TV shows can: (a) have substantial amounts of non-voice background signal (e.g. musical score and environmental sounds), and (b) are released worldwide in a variety of languages. This makes most of the previous standard VAD approaches not readily applicable for DEC related applications. Furthermore, there does not exist a comprehensive analysis of Deep Neural Network's (DNN) performance for the task of VAD applied to DEC. In this work, we present a thorough survey on DNN based VADs on DEC data in terms of their accuracy, Area Under Curve (AUC), noise sensitivity, and language agnostic behaviour. For our analysis we use 1100 proprietary DEC videos spanning 450 h of content in 9 languages and 5 + genres, making our study the largest of its kind ever published. The key findings of our analysis are: (a) even high quality timed-text or subtitle 2 2 subtitles and timed-text are used interchangeably in the manuscript files contain significant levels of label-noise (up to 15%). Despite high label noise, deep networks are robust and are able to retain high AUCs (∼ 0.94). (b) Using larger labelled dataset can substantially increase neural VAD model's True Positive Rate (TPR) with up to 1.3% and 18% relative improvement over current state-of-the-art methods in Hebbar et al. (2019) and Chaudhuri et al. (2018) respectively. This effect is more pronounced in noisy environments such as music and environmental sounds. This insight is particularly instructive while prioritizing domain specific labelled data acquisition versus exploring model structure and complexity. (c) Currently available sequence based neural models show similar levels of competence in terms of their language agnostic behaviour for VAD at high Signal-to-Noise Ratios (SNRs) and for clean speech, (d) Deep models exhibit varied performance across different SNRs with CLDNN (Zazo et al., 2016) being the most robust, and (e) models with comparatively larger number of parameters (∼ 2 M) are less robust to input noise as opposed to models having smaller number of parameters (∼ 0.5 M). [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.