Author: "Chen, Guikun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chen, Guikun"' showing total 15 results

Start Over Author "Chen, Guikun"

15 results on '"Chen, Guikun"'

1. Scene Graph Generation with Role-Playing Large Language Models

Author: Chen, Guikun, Li, Jin, and Wang, Wenguan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin., Comment: NeurIPS 2024. Code: https://github.com/guikunchen/SDSGG
Published: 2024

2. A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Author: Li, Lin, Chen, Guikun, Shi, Hanrong, Xiao, Jun, and Chen, Long
Subjects: Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available., Comment: Ongoing project
Published: 2024

3. Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

Author: Chen, Minghan, Chen, Guikun, Wang, Wenguan, and Yang, Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: DETR introduces a simplified one-stage framework for scene graph generation (SGG). However, DETR-based SGG models face two challenges: i) Sparse supervision, as each image typically contains fewer than 10 relation annotations, while the models employ over 100 relation queries. This sparsity arises because each ground truth relation is assigned to only one single query during training. ii) False negative samples, since one ground truth relation may have multiple queries with similar matching scores. These suboptimally matched queries are simply treated as negative samples, causing the loss of valuable supervisory signals. As a response, we devise Hydra-SGG, a one-stage SGG method that adopts a new Hybrid Relation Assignment. This assignment combines a One-to-One Relation Assignment with a newly introduced IoU-based One-to-Many Relation Assignment. Specifically, each ground truth is assigned to multiple relation queries with high IoU subject-object boxes. This Hybrid Relation Assignment increases the number of positive training samples, alleviating sparse supervision. Moreover, we, for the first time, empirically show that self-attention over relation queries helps reduce duplicated relation predictions. We, therefore, propose Hydra Branch, a parameter-sharing auxiliary decoder without a self-attention layer. This design promotes One-to-Many Relation Assignment by enabling different queries to predict the same relation. Hydra-SGG achieves state-of-the-art performance with 10.6 mR@20 and 16.0 mR@50 on VG150, while only requiring 12 training epochs. It also sets a new state-of-the-art on Open Images V6 and and GQA.
Published: 2024

4. Neural Clustering based Visual Representation Learning

Author: Chen, Guikun, Li, Xia, Yang, Yi, and Wang, Wenguan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm., Comment: CVPR 2024. Code: https://github.com/guikunchen/FEC/
Published: 2024

5. DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

Author: Yang, Zongxin, Chen, Guikun, Li, Xiaodi, Wang, Wenguan, and Yang, Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Hence, this paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a video agent. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios. The code will be released at https://github.com/z-x-yang/DoraemonGPT.
Published: 2024

6. A Survey on 3D Gaussian Splatting

Author: Chen, Guikun and Wang, Wenguan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics, Computer Science - Multimedia
Abstract: 3D Gaussian splatting (GS) has recently emerged as a transformative technique in the realm of explicit radiance field and computer graphics. This innovative approach, characterized by the utilization of millions of learnable 3D Gaussians, represents a significant departure from mainstream neural radiance field approaches, which predominantly use implicit, coordinate-based models to map spatial coordinates to pixel values. 3D GS, with its explicit scene representation and differentiable rendering algorithm, not only promises real-time rendering capability but also introduces unprecedented levels of editability. This positions 3D GS as a potential game-changer for the next generation of 3D reconstruction and representation. In the present paper, we provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS. We begin with a detailed exploration of the underlying principles and the driving forces behind the emergence of 3D GS, laying the groundwork for understanding its significance. A focal point of our discussion is the practical applicability of 3D GS. By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond. This is complemented by a comparative analysis of leading 3D GS models, evaluated across various benchmark tasks to highlight their performance and practical utility. The survey concludes by identifying current challenges and suggesting potential avenues for future research in this domain. Through this survey, we aim to provide a valuable resource for both newcomers and seasoned researchers, fostering further exploration and advancement in applicable and explicit radiance field representation., Comment: Ongoing project
Published: 2024

7. Compositional Feature Augmentation for Unbiased Scene Graph Generation

Author: Li, Lin, Chen, Guikun, Xiao, Jun, Yang, Yi, Wang, Chunping, and Chen, Long
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene Graph Generation (SGG) aims to detect all the visual relation triplets in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into various SGG frameworks. Extensive ablations have shown that CFA achieves a new state-of-the-art performance on the trade-off between different metrics., Comment: Accepted by ICCV 2023
Published: 2023

8. Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

Author: Li, Lin, Xiao, Jun, Chen, Guikun, Shao, Jian, Zhuang, Yueting, and Chen, Long
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.
Published: 2023

9. Decomposed Prototype Learning for Few-Shot Scene Graph Generation

Author: Li, Xingchen, Chen, Long, Chen, Guikun, Feng, Yinfu, Yang, Yi, and Xiao, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Thus, it is difficult to apply them to real-world applications with a long-tailed distribution of predicates. In this paper, we focus on a new promising task of SGG: few-shot SGG (FSSGG). FSSGG encourages models to be able to quickly transfer previous knowledge and recognize novel predicates well with only a few examples. Although many advanced approaches have achieved great success on few-shot learning (FSL) tasks, straightforwardly extending them into FSSGG is not applicable due to two intrinsic characteristics of predicate concepts: 1) Each predicate category commonly has multiple semantic meanings under different contexts. 2) The visual appearance of relation triplets with the same predicate differs greatly under different subject-object pairs. Both issues make it hard to model conventional latent representations for predicate categories with state-of-the-art FSL methods. To this end, we propose a novel Decomposed Prototype Learning (DPL). Specifically, we first construct a decomposable prototype space to capture intrinsic visual patterns of subjects and objects for predicates, and enhance their feature representations with these decomposed prototypes. Then, we devise an intelligent metric learner to assign adaptive weights to each support sample by considering the relevance of their subject-object pairs. We further re-split the VG dataset and compare DPL with various FSL methods to benchmark this task. Extensive results show that DPL achieves excellent performance in both base and novel categories.
Published: 2023

10. Multiple nonlinear regression prediction model for process parameters of Al alloy self-piercing riveting

Author: Chen, Guikun, Zeng, Kai, Xing, Baoying, and He, Xiaocong
Published: 2022
Full Text: View/download PDF

11. Compositional Zero-shot Learning via Progressive Language-based Observations

Author: Li, Lin, Chen, Guikun, Xiao, Jun, Chen, Long, Li, Lin, Chen, Guikun, Xiao, Jun, and Chen, Long
Abstract: Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify a vintage design for a "car" or an advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.
Published: 2023

12. 3-D HANet: A Flexible 3-D Heatmap Auxiliary Network for Object Detection

Author: Xia, Qiming, primary, Chen, Yidong, additional, Cai, Guorong, additional, Chen, Guikun, additional, Xie, Daoshun, additional, Su, Jinhe, additional, and Wang, Zongyue, additional
Published: 2023
Full Text: View/download PDF

13. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest

Author: Zebiao Wu, Chen Guikun, Guorong Cai, Xiangchen Zhang, and Jinhe Su
Subjects: Quality (physics), General Chemical Engineering, Near-infrared spectroscopy, Food Science, Mathematics, Remote sensing, Random forest
Published: 2020

14. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest

Author: Chen, Guikun, primary, Zhang, Xiangchen, additional, Wu, Zebiao, additional, Su, Jinhe, additional, and Cai, Guorong, additional
Published: 2020
Full Text: View/download PDF

15. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest.

Author: Chen, Guikun, Zhang, Xiangchen, Wu, Zebiao, Su, Jinhe, and Cai, Guorong
Subjects: RANDOM forest algorithms, CLASSIFICATION algorithms, HIGH performance liquid chromatography, TEA, CHEMICAL testing
Abstract: Traditional tea quality evaluation methods are based on chemical testing, such as gas chromatography‐mass spectrometry (GCMS) and high‐performance liquid chromatography (HPLC). However, the process of extracting chemical components is generally time‐consuming and expensive, which makes it unsuitable for wide range of applications. Therefore, this paper presents a new approach to evaluate tea quality based on Near‐infrared Spectroscopy (NIRS) devices. In our method, factor analysis compression algorithm is first applied to initially compress the input NIRS vectors, which are acquired from tea samples with high dimensional data. Then, random forest algorithm is employed to construct a voting strategy. More precisely speaking, we proposed a low‐cost and convenient tea quality estimation scheme that can be widely used in tea industry. The proposed approach has been verified using tea NIRS datasets which were acquired from Fujian Province. Experiments show that the proposed NIRS‐based approach significantly outperforms the GCMS‐based and HPLC‐based methods. Specially, we achieved a highly competitive performance (AP = 0.989) on the comprehensive data set that contains 869 annotated Chinese tea samples, which means that tea quality can be estimated in a convenient and cheaper way. Practical Applications: The proposed tea classification approach based on artificial intelligence which lend new perspectives to tea merchants and consumers insight and decision‐making. The approach can perform preference adjustments in various conditions such as regions, crowd habits, seasons, etc. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

15 results on '"Chen, Guikun"'

1. Scene Graph Generation with Role-Playing Large Language Models

2. A Survey on Multimodal Benchmarks: In the Era of Large AI Models

3. Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation

4. Neural Clustering based Visual Representation Learning

5. DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

6. A Survey on 3D Gaussian Splatting

7. Compositional Feature Augmentation for Unbiased Scene Graph Generation

8. Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models

9. Decomposed Prototype Learning for Few-Shot Scene Graph Generation

10. Multiple nonlinear regression prediction model for process parameters of Al alloy self-piercing riveting

11. Compositional Zero-shot Learning via Progressive Language-based Observations

12. 3-D HANet: A Flexible 3-D Heatmap Auxiliary Network for Object Detection

13. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest

14. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest

15. An efficient tea quality classification algorithm based on near infrared spectroscopy and random Forest.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

15 results on '"Chen, Guikun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources