Author: "Zhu, Yousong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhu, Yousong"' showing total 41 results

Start Over Author "Zhu, Yousong"

41 results on '"Zhu, Yousong"'

1. Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models

Author: Zhan, Yufei, Zhao, Hongyin, Zhu, Yousong, Yang, Fan, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-language and vision-centric tasks based on auto-regressive modeling. However, these models typically focus on either vision-centric tasks, such as visual grounding and region description, or vision-language tasks, like image caption and multi-scenario VQAs. None of the LMMs have yet comprehensively unified both types of tasks within a single model, as seen in Large Language Models in the natural language processing field. Furthermore, even with abundant multi-task instruction-following data, directly stacking these data for universal capabilities extension remains challenging. To address these issues, we introduce a novel multi-dimension curated and consolidated multimodal dataset, named CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks through multi-level data curation and multi-task consolidation. More importantly, we present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm. Griffon-G resolves the training collapse issue encountered during the joint optimization of these tasks, achieving better training efficiency. Evaluations across multimodal benchmarks, general Visual Question Answering (VQA) tasks, scene text-centric VQA tasks, document-related VQA tasks, Referring Expression Comprehension, and object detection demonstrate that Griffon-G surpasses the advanced LMMs and achieves expert-level performance in complicated vision-centric tasks., Comment: This work has been submitted to the IEEE for possible publication. Codes and data will be later released at https://github.com/jefferyZhan/Griffon
Published: 2024

2. Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Author: Zhan, Yufei, Zhu, Yousong, Zhao, Hongyin, Yang, Fan, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon., Comment: Tech report working in progress. Codes, models and datasets will be released at https://github.com/jefferyZhan/Griffon
Published: 2024

3. Griffon: Spelling Out All Object Locations at Any Granularity with Large Language Models

Author: Zhan, Yufei, Zhu, Yousong, Chen, Zhiyang, Yang, Fan, Tang, Ming, Wang, Jinqiao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

4. Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Author: Zhan, Yufei, Zhu, Yousong, Chen, Zhiyang, Yang, Fan, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Large Vision Language Models (LVLMs). Current LVLMs are predominantly constrained to locate a single, pre-existing object. This limitation leads to a compromise in model design, necessitating the introduction of visual expert models or customized head structures. Beyond these constraints, our research uncovers LVLMs' capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that Griffon not only achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO. Data, codes, and models are released at https://github.com/jefferyZhan/Griffon., Comment: ECCV2024, Github: https://github.com/jefferyZhan/Griffon
Published: 2023

5. Efficient Masked Autoencoders with Self-Consistency

Author: Li, Zhaowen, Zhu, Yousong, Chen, Zhiyang, Li, Wei, Zhao, Chaoyang, Zhao, Rui, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by the masked language modeling (MLM) in natural language processing tasks, the masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. However, the high random mask ratio of MIM results in two serious problems: 1) the inadequate data utilization of images within each iteration brings prolonged pre-training, and 2) the high inconsistency of predictions results in unreliable generations, $i.e.$, the prediction of the identical patch may be inconsistent in different mask rounds, leading to divergent semantics in the ultimately generated outcomes. To tackle these problems, we propose the efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency and increase the consistency of MIM. In particular, we present a parallel mask strategy that divides the image into K non-overlapping parts, each of which is generated by a random mask with the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and the model minimizes the loss between the predictions and the masked patches. Besides, we design the self-consistency learning to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, our method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves the best performance on ViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After pre-training on diverse datasets, EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation., Comment: Accept by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2023

6. Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Author: Qi, Yu, Yang, Fan, Zhu, Yousong, Liu, Yufei, Wu, Liwei, Zhao, Rui, and Li, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Autoregressive language modeling (ALM) have been successfully used in self-supervised pre-training in Natural language processing (NLP). However, this paradigm has not achieved comparable results with other self-supervised approach in computer vision (e.g., contrastive learning, mask image modeling). In this paper, we try to find the reason why autoregressive modeling does not work well on vision tasks. To tackle this problem, we fully analyze the limitation of visual autoregressive methods and proposed a novel stochastic autoregressive image modeling (named SAIM) by the two simple designs. First, we employ stochastic permutation strategy to generate effective and robust image context which is critical for vision tasks. Second, we create a parallel encoder-decoder training process in which the encoder serves a similar role to the standard vision transformer focus on learning the whole contextual information, and meanwhile the decoder predicts the content of the current position, so that the encoder and decoder can reinforce each other. By introducing stochastic prediction and the parallel encoder-decoder, SAIM significantly improve the performance of autoregressive image modeling. Our method achieves the best accuracy (83.9%) on the vanilla ViT-Base model among methods using only ImageNet-1K data. Transfer performance in downstream tasks also show that our model achieves competitive performance., Comment: Accepted by AAAI 2023
Published: 2022

7. Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

Author: Shu, Fangxun, Chen, Biaolong, Liao, Yue, Xiao, Shuwen, Sun, Wenyu, Li, Xiaobo, Zhu, Yousong, Wang, Jinqiao, and Liu, Si
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only feed visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance. Our MAC achieves state-of-the-art results on various video-text retrieval datasets, including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks., Comment: Technical Report
Published: 2022

8. Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks

Author: Chen, Zhiyang, Zhu, Yousong, Li, Zhaowen, Yang, Fan, Li, Wei, Wang, Haixin, Zhao, Chaoyang, Wu, Liwei, Zhao, Rui, Wang, Jinqiao, and Tang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual tasks vary a lot in their output formats and concerned contents, therefore it is hard to process them with an identical structure. One main obstacle lies in the high-dimensional outputs in object-level visual tasks. In this paper, we propose an object-centric vision framework, Obj2Seq. Obj2Seq takes objects as basic units, and regards most object-level visual tasks as sequence generation problems of objects. Therefore, these visual tasks can be decoupled into two steps. First recognize objects of given categories, and then generate a sequence for each of these objects. The definition of the output sequences varies for different tasks, and the model is supervised by matching these sequences with ground-truth targets. Obj2Seq is able to flexibly determine input categories to satisfy customized requirements, and be easily extended to different visual tasks. When experimenting on MS COCO, Obj2Seq achieves 45.7% AP on object detection, 89.0% AP on multi-label classification and 65.0% AP on human pose estimation. These results demonstrate its potential to be generally applied to different visual tasks. Code has been made available at: https://github.com/CASIA-IVA-Lab/Obj2Seq., Comment: Accepted by NeurIPS 2022. Code available at https://github.com/CASIA-IVA-Lab/Obj2Seq
Published: 2022

9. UniVIP: A Unified Framework for Self-Supervised Visual Pre-training

Author: Li, Zhaowen, Zhu, Yousong, Yang, Fan, Li, Wei, Zhao, Chaoyang, Chen, Yingying, Chen, Zhiyang, Xie, Jiahao, Wu, Liwei, Zhao, Rui, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised learning (SSL) holds promise in leveraging large amounts of unlabeled data. However, the success of popular SSL methods has limited on single-centric-object images like those in ImageNet and ignores the correlation among the scene and instances, as well as the semantic difference of instances in the scene. To address the above problems, we propose a Unified Self-supervised Visual Pre-training (UniVIP), a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset. The framework takes into account the representation learning at three levels: 1) the similarity of scene-scene, 2) the correlation of scene-instance, 3) the discrimination of instance-instance. During the learning, we adopt the optimal transport algorithm to automatically measure the discrimination of instances. Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance on a variety of downstream tasks, such as image classification, semi-supervised learning, object detection and segmentation. Furthermore, our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing, and surpass current self-supervised object detection methods on COCO dataset, demonstrating its universality and potential., Comment: Accepted by CVPR2022
Published: 2022

10. PASS: Part-Aware Self-Supervised Pre-Training for Person Re-Identification

Author: Zhu, Kuan, Guo, Haiyun, Yan, Tianyi, Zhu, Yousong, Wang, Jinqiao, and Tang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In person re-identification (ReID), very recent researches have validated pre-training the models on unlabelled person images is much better than on ImageNet. However, these researches directly apply the existing self-supervised learning (SSL) methods designed for image classification to ReID without any adaption in the framework. These SSL methods match the outputs of local views (e.g., red T-shirt, blue shorts) to those of the global views at the same time, losing lots of details. In this paper, we propose a ReID-specific pre-training method, Part-Aware Self-Supervised pre-training (PASS), which can generate part-level features to offer fine-grained information and is more suitable for ReID. PASS divides the images into several local areas, and the local views randomly cropped from each area are assigned with a specific learnable [PART] token. On the other hand, the [PART]s of all local areas are also appended to the global views. PASS learns to match the output of the local views and global views on the same [PART]. That is, the learned [PART] of the local views from a local area is only matched with the corresponding [PART] learned from the global views. As a result, each [PART] can focus on a specific local area of the image and extracts fine-grained information of this area. Experiments show PASS sets the new state-of-the-art performances on Market1501 and MSMT17 on various ReID tasks, e.g., vanilla ViT-S/16 pre-trained by PASS achieves 92.2\%/90.2\%/88.5\% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Our codes are available at https://github.com/CASIA-IVA-Lab/PASS-reID., Comment: Accepted by ECCV2022. Codes are available at https://github.com/CASIA-IVA-Lab/PASS-reID
Published: 2022

11. Study on the role of Dihuang Yinzi in regulating the AMPK/SIRT1/PGC-1α pathway to promote mitochondrial biogenesis and improve Alzheimer's disease

Author: Zhu, Chao, Zhang, Zheng, Zhu, Yousong, Du, Yuzhong, Han, Cheng, Zhao, Qiong, Li, Qinqing, Hou, Jiangqi, Zhang, Junlong, He, Wenbin, and Qin, Yali
Published: 2025
Full Text: View/download PDF

12. DPT: Deformable Patch-based Transformer for Visual Recognition

Author: Chen, Zhiyang, Zhu, Yousong, Zhao, Chaoyang, Hu, Guosheng, Zeng, Wei, Wang, Jinqiao, and Tang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transformer has achieved great success in computer vision, while how to split patches in an image remains a problem. Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. We term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and conduct extensive evaluations of DPT on image classification and object detection. Results show DPT can achieve 81.9% top-1 accuracy on ImageNet classification, and 43.7% box mAP with RetinaNet, 44.3% with Mask R-CNN on MSCOCO object detection. Code has been made available at: https://github.com/CASIA-IVA-Lab/DPT ., Comment: In Proceedings of the 29th ACM International Conference on Multimedia (MM '21)
Published: 2021
Full Text: View/download PDF

13. MST: Masked Self-Supervised Transformer for Visual Representation

Author: Li, Zhaowen, Chen, Zhiyang, Yang, Fan, Li, Wei, Zhu, Yousong, Zhao, Chaoyang, Deng, Rui, Wu, Liwei, Zhao, Rui, Tang, Ming, and Wang, Jinqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0\%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training., Comment: Accepted in NeurIPS 2021
Published: 2021

14. Adaptive Class Suppression Loss for Long-Tail Object Detection

Author: Wang, Tong, Zhu, Yousong, Zhao, Chaoyang, Zeng, Wei, Wang, Jinqiao, and Tang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: To address the problem of long-tail distribution for the large vocabulary object detection task, existing methods usually divide the whole categories into several groups and treat each group with different strategies. These methods bring the following two problems. One is the training inconsistency between adjacent categories of similar sizes, and the other is that the learned model is lack of discrimination for tail categories which are semantically similar to some of the head categories. In this paper, we devise a novel Adaptive Class Suppression Loss (ACSL) to effectively tackle the above problems and improve the detection performance of tail categories. Specifically, we introduce a statistic-free perspective to analyze the long-tail distribution, breaking the limitation of manual grouping. According to this perspective, our ACSL adjusts the suppression gradients for each sample of each class adaptively, ensuring the training consistency and boosting the discrimination for rare categories. Extensive experiments on long-tail datasets LVIS and Open Images show that the our ACSL achieves 5.18% and 5.2% improvements with ResNet50-FPN, and sets a new state of the art. Code and models are available at https://github.com/CASIA-IVA-Lab/ACSL., Comment: CVPR2021 camera ready version
Published: 2021

15. CoupleNet: Coupling Global Structure with Local Parts for Object Detection

Author: Zhu, Yousong, Zhao, Chaoyang, Wang, Jinqiao, Zhao, Xu, Wu, Yi, and Lu, Hanqing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The region-based Convolutional Neural Network (CNN) detectors such as Faster R-CNN or R-FCN have already shown promising results for object detection by combining the region proposal subnetwork and the classification subnetwork together. Although R-FCN has achieved higher detection speed while keeping the detection performance, the global structure information is ignored by the position-sensitive score maps. To fully explore the local and global properties, in this paper, we propose a novel fully convolutional network, named as CoupleNet, to couple the global structure with local parts for object detection. Specifically, the object proposals obtained by the Region Proposal Network (RPN) are fed into the the coupling module which consists of two branches. One branch adopts the position-sensitive RoI (PSRoI) pooling to capture the local part information of the object, while the other employs the RoI pooling to encode the global and context information. Next, we design different coupling strategies and normalization ways to make full use of the complementary advantages between the global and local branches. Extensive experiments demonstrate the effectiveness of our approach. We achieve state-of-the-art results on all three challenging datasets, i.e. a mAP of 82.7% on VOC07, 80.4% on VOC12, and 34.4% on COCO. Codes will be made publicly available., Comment: Accepted by ICCV 2017
Published: 2017

16. The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers

Author: Chen, Zhiyang, primary, Zhu, Yousong, additional, Li, Zhaowen, additional, Yang, Fan, additional, Zhao, Chaoyang, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2024
Full Text: View/download PDF

17. Large Batch Optimization for Object Detection: Training COCO in 12 minutes

Author: Wang, Tong, Zhu, Yousong, Zhao, Chaoyang, Zeng, Wei, Wang, Yaowei, Wang, Jinqiao, Tang, Ming, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
Published: 2020
Full Text: View/download PDF

18. PASS: Part-Aware Self-Supervised Pre-Training for Person Re-Identification

Author: Zhu, Kuan, primary, Guo, Haiyun, additional, Yan, Tianyi, additional, Zhu, Yousong, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2022
Full Text: View/download PDF

19. A novel data augmentation scheme for pedestrian detection with attribute preserving GAN

Author: Liu, Songyan, Guo, Haiyun, Hu, Jian-Guo, Zhao, Xu, Zhao, Chaoyang, Wang, Tong, Zhu, Yousong, Wang, Jinqiao, and Tang, Ming
Published: 2020
Full Text: View/download PDF

20. Food det: Detecting foods in refrigerator with supervised transformer network

Author: Zhu, Yousong, Zhao, Xu, Zhao, Chaoyang, Wang, Jinqiao, and Lu, Hanqing
Published: 2020
Full Text: View/download PDF

21. Efficient Masked Autoencoders With Self-Consistency

Author: Li, Zhaowen, primary, Zhu, Yousong, additional, Chen, Zhiyang, additional, Li, Wei, additional, Zhao, Rui, additional, Zhao, Chaoyang, additional, Tang, Ming, additional, and Wang, Jinqiao, additional
Published: 2024
Full Text: View/download PDF

22. Large Batch Optimization for Object Detection: Training COCO in 12 minutes

Author: Wang, Tong, primary, Zhu, Yousong, additional, Zhao, Chaoyang, additional, Zeng, Wei, additional, Wang, Yaowei, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2020
Full Text: View/download PDF

23. Scale-Adaptive Deconvolutional Regression Network for Pedestrian Detection

Author: Zhu, Yousong, Wang, Jinqiao, Zhao, Chaoyang, Guo, Haiyun, Lu, Hanqing, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Lai, Shang-Hong, editor, Lepetit, Vincent, editor, Nishino, Ko, editor, and Sato, Yoichi, editor
Published: 2017
Full Text: View/download PDF

24. Effects of Decoction of Dihuang Yinzi on Alzheimer's disease: a systematic review and meta-analysis

Author: ZHANG, Zheng, primary, ZHU, Yousong, additional, ZHU, Chao, additional, LI, Shaochuang, additional, ZHAO, Yuting, additional, Yang, Jie, additional, QIN, Yali, additional, HOU, Jiangqi, additional, ZHANG, Junlong, additional, and HAN, Cheng, additional
Published: 2023
Full Text: View/download PDF

25. Exploring Stochastic Autoregressive Image Modeling for Visual Representation

Author: Qi, Yu, primary, Yang, Fan, additional, Zhu, Yousong, additional, Liu, Yufei, additional, Wu, Liwei, additional, Zhao, Rui, additional, and Li, Wei, additional
Published: 2023
Full Text: View/download PDF

26. Advancements in the study of synaptic plasticity and mitochondrial autophagy relationship.

Author: Zhu, Yousong, Hui, Qinlong, Zhang, Zheng, Fu, Hao, Qin, Yali, Zhao, Qiong, Li, Qinqing, Zhang, Junlong, Guo, Lei, He, Wenbin, and Han, Cheng
Published: 2024
Full Text: View/download PDF

27. Mitigating Hallucination in Visual Language Models with Visual Supervision

Author: Chen, Zhiyang, Zhu, Yousong, Zhan, Yufei, Li, Zhaowen, Zhao, Chaoyang, Wang, Jinqiao, Tang, Ming, Chen, Zhiyang, Zhu, Yousong, Zhan, Yufei, Li, Zhaowen, Zhao, Chaoyang, Wang, Jinqiao, and Tang, Ming
Abstract: Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. On the other hand, we integrate SAM and mask prediction loss as auxiliary supervision, forcing the LVLMs to have the capacity to identify context-related objects, so that they can generate more accurate responses, mitigating hallucination. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. In this benchmark, our approach demonstrates an +8.4% enhancement compared to original LLaVA and achieves widespread performance improvements across other models.
Published: 2023

28. Effects of Dihuang Yinzi Decoction on Alzheimer's Disease: A Systematic Review and Meta-Analysis.

Author: Zhang, Zheng, Zhu, Yousong, Zhu, Chao, Li, Shaochuang, Zhao, Yuting, Yang, Jie, Qin, Yali, Hou, Jiangqi, Zhang, Junlong, and Han, Cheng
Published: 2023
Full Text: View/download PDF

29. Scale-Adaptive Deconvolutional Regression Network for Pedestrian Detection

Author: Zhu, Yousong, primary, Wang, Jinqiao, additional, Zhao, Chaoyang, additional, Guo, Haiyun, additional, and Lu, Hanqing, additional
Published: 2017
Full Text: View/download PDF

30. UniVIP: A Unified Framework for Self-Supervised Visual Pre-training

Author: Li, Zhaowen, primary, Zhu, Yousong, additional, Yang, Fan, additional, Li, Wei, additional, Zhao, Chaoyang, additional, Chen, Yingying, additional, Chen, Zhiyang, additional, Xie, Jiahao, additional, Wu, Liwei, additional, Zhao, Rui, additional, Tang, Ming, additional, and Wang, Jinqiao, additional
Published: 2022
Full Text: View/download PDF

31. C2AM Loss: Chasing a Better Decision Boundary for Long-Tail Object Detection

Author: Wang, Tong, primary, Zhu, Yousong, additional, Chen, Yingying, additional, Zhao, Chaoyang, additional, Yu, Bin, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2022
Full Text: View/download PDF

32. The Devil is in Details: Delving into Lite Ffn Design for Vision Transformers

Author: Chen, Zhiyang, primary, Zhu, Yousong, additional, Yang, Fan, additional, Li, Zhaowen, additional, Zhao, Chaoyang, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2022
Full Text: View/download PDF

33. DPT: Deformable Patch-based Transformer for Visual Recognition

Author: Chen, Zhiyang, primary, Zhu, Yousong, additional, Zhao, Chaoyang, additional, Hu, Guosheng, additional, Zeng, Wei, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2021
Full Text: View/download PDF

34. Attention-Guided Knowledge Distillation for Efficient Single-Stage Detector

Author: Wang, Tong, primary, Zhu, Yousong, additional, Zhao, Chaoyang, additional, Zhao, Xu, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2021
Full Text: View/download PDF

35. Adaptive Class Suppression Loss for Long-Tail Object Detection

Author: Wang, Tong, primary, Zhu, Yousong, additional, Zhao, Chaoyang, additional, Zeng, Wei, additional, Wang, Jinqiao, additional, and Tang, Ming, additional
Published: 2021
Full Text: View/download PDF

36. Dual Super-Resolution Learning for Semantic Segmentation

Author: Wang, Li, primary, Li, Dong, additional, Zhu, Yousong, additional, Tian, Lu, additional, and Shan, Yi, additional
Published: 2020
Full Text: View/download PDF

37. Mask Guided Knowledge Distillation for Single Shot Detector

Author: Zhu, Yousong, primary, Zhao, Chaoyang, additional, Han, Chenxia, additional, Wang, Jinqiao, additional, and Lu, Hanqing, additional
Published: 2019
Full Text: View/download PDF

38. Elite Loss for scene text detection

Author: Zhao, Xu, primary, Zhao, Chaoyang, additional, Guo, Haiyun, additional, Zhu, Yousong, additional, Tang, Ming, additional, and Wang, Jinqiao, additional
Published: 2019
Full Text: View/download PDF

39. Attention CoupleNet: Fully Convolutional Attention Coupling Network for Object Detection

Author: Zhu, Yousong, primary, Zhao, Chaoyang, additional, Guo, Haiyun, additional, Wang, Jinqiao, additional, Zhao, Xu, additional, and Lu, Hanqing, additional
Published: 2019
Full Text: View/download PDF

40. Improved Single Shot Object Detector Using Enhanced Features and Predicting Heads

Author: Zhao, Xu, primary, Zhao, Chaoyang, additional, Zhu, Yousong, additional, Tang, Ming, additional, and Wang, Jinqiao, additional
Published: 2018
Full Text: View/download PDF

41. CoupleNet: Coupling Global Structure with Local Parts for Object Detection

Author: Zhu, Yousong, primary, Zhao, Chaoyang, additional, Wang, Jinqiao, additional, Zhao, Xu, additional, Wu, Yi, additional, and Lu, Hanqing, additional
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

41 results on '"Zhu, Yousong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources