Author: "Hao, Shaozhe" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Hao, Shaozhe"' showing total 14 results

Start Over Author "Hao, Shaozhe"

14 results on '"Hao, Shaozhe"'

1. Elucidating the design space of language models for image generation

Author: Liu, Xuantong, Hao, Shaozhe, Qi, Xianbiao, Hu, Tianyang, Wang, Jun, Xiao, Rong, and Yao, Yuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the performance gains achieved when scaling up model size. We further elucidate the design space of language models for vision generation, including tokenizer choice, model choice, model scalability, vocabulary design, and sampling strategy through extensive comparative experiments. Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains. Finally, our elucidated language model for image generation, termed as ELM, achieves state-of-the-art performance on the ImageNet 256*256 benchmark. The code is available at https://github.com/Pepperlll/LMforImageGeneration.git., Comment: Project page: https://pepper-lll.github.io/LMforImageGeneration/
Published: 2024

2. BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Author: Hao, Shaozhe, Liu, Xuantong, Qi, Xianbiao, Zhao, Shihao, Zi, Bojia, Xiao, Rong, Han, Kai, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities. BiGR is the first conditional generative model that unifies generation and discrimination within the same framework. BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction. Additionally, we introduce a novel entropy-ordered sampling method to enable efficient image generation. Extensive experiments validate BiGR's superior performance in generation quality, as measured by FID-50k, and representation capabilities, as evidenced by linear-probe accuracy. Moreover, BiGR showcases zero-shot generalization across various vision tasks, enabling applications such as image inpainting, outpainting, editing, interpolation, and enrichment, without the need for structural modifications. Our findings suggest that BiGR unifies generative and discriminative tasks effectively, paving the way for further advancements in the field., Comment: Project page: https://haoosz.github.io/BiGR
Published: 2024

3. CusConcept: Customized Visual Concept Decomposition with Diffusion Models

Author: Xu, Zhi, Hao, Shaozhe, and Han, Kai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Enabling generative models to decompose visual concepts from a single image is a complex and challenging problem. In this paper, we study a new and challenging task, customized concept decomposition, wherein the objective is to leverage diffusion models to decompose a single image and generate visual concepts from various perspectives. To address this challenge, we propose a two-stage framework, CusConcept (short for Customized Visual Concept Decomposition), to extract customized visual concept embedding vectors that can be embedded into prompts for text-to-image generation. In the first stage, CusConcept employs a vocabulary-guided concept decomposition mechanism to build vocabularies along human-specified conceptual axes. The decomposed concepts are obtained by retrieving corresponding vocabularies and learning anchor weights. In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images. We further curate an evaluation benchmark for assessing the performance of the open-world concept decomposition task. Our approach can effectively generate high-quality images of the decomposed concepts and produce related lexical predictions as secondary outcomes. Extensive qualitative and quantitative experiments demonstrate the effectiveness of CusConcept.
Published: 2024

4. ArtiFade: Learning to Generate High-quality Subject from Blemished Images

Author: Yang, Shuya, Hao, Shaozhe, Cao, Yukang, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.
Published: 2024

5. ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Author: Hao, Shaozhe, Han, Kai, Lv, Zhengyao, Zhao, Shihao, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. Given an image that contains multiple concepts, the task aims to extract and recreate individual concepts solely relying on the existing knowledge from pretrained diffusion models. To achieve this, we present ConceptExpress that tackles UCE by unleashing the inherent capabilities of pretrained diffusion models in two aspects. Specifically, a concept localization approach automatically locates and disentangles salient concepts by leveraging spatial correspondence from diffusion self-attention; and based on the lookup association between a concept and a conceptual token, a concept-wise optimization process learns discriminative tokens that represent each individual concept. Finally, we establish an evaluation protocol tailored for the UCE task. Extensive experiments demonstrate that ConceptExpress is a promising solution to the UCE task. Our code and data are available at: https://github.com/haoosz/ConceptExpress, Comment: ECCV 2024, Project page: https://haoosz.github.io/ConceptExpress/
Published: 2024

6. Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Author: Zhao, Shihao, Hao, Shaozhe, Zi, Bojia, Xu, Huaizhe, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://github.com/ShihaoZhaoZSH/LaVi-Bridge.
Published: 2024

7. ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

Author: Hao, Shaozhe, Han, Kai, Zhao, Shihao, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo, Comment: Under review
Published: 2023

8. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Author: Zhao, Shihao, Chen, Dongdong, Chen, Yen-Chun, Bao, Jianmin, Hao, Shaozhe, Yuan, Lu, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one single model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}., Comment: Camera Ready, Code is available at https://github.com/ShihaoZhaoZSH/Uni-ControlNet
Published: 2023

9. CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

Author: Hao, Shaozhe, Han, Kai, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data may contain instances from both novel categories and labelled classes. In this paper, we address the GCD problem with an unknown category number for the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations in the partially labelled data for contrastive learning, which have been neglected in existing methods. To obtain reliable cross-instance relations to facilitate representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components of a graph constructed from selective neighbors. We further present a method to estimate the unknown class number using SNC with a joint reference score that considers clustering indexes of both labelled and unlabelled data, and extend SNC to allow label assignment for the unlabelled instances with a given class number. We thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, and establish a new state-of-the-art. Code: https://github.com/haoosz/CiPR, Comment: Accepted to TMLR. Code: https://github.com/haoosz/CiPR
Published: 2023

10. Learning Attention as Disentangler for Compositional Zero-shot Learning

Author: Hao, Shaozhe, Han, Kai, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Compositional zero-shot learning (CZSL) aims at learning visual concepts (i.e., attributes and objects) from seen compositions and combining concept knowledge into unseen compositions. The key to CZSL is learning the disentanglement of the attribute-object composition. To this end, we propose to exploit cross-attentions as compositional disentanglers to learn disentangled concept embeddings. For example, if we want to recognize an unseen composition "yellow flower", we can learn the attribute concept "yellow" and object concept "flower" from different yellow objects and different flowers respectively. To further constrain the disentanglers to learn the concept of interest, we employ a regularization at the attention level. Specifically, we adapt the earth mover's distance (EMD) as a feature similarity metric in the cross-attention module. Moreover, benefiting from concept disentanglement, we improve the inference process and tune the prediction score by combining multiple concept probabilities. Comprehensive experiments on three CZSL benchmark datasets demonstrate that our method significantly outperforms previous works in both closed- and open-world settings, establishing a new state-of-the-art., Comment: CVPR 2023, available at https://haoosz.github.io/ade-czsl/
Published: 2023

11. A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

Author: Hao, Shaozhe, Chen, Chaofeng, Chen, Zhenfang, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Face recognition under ideal conditions is now considered a well-solved problem with advances in deep learning. Recognizing faces under occlusion, however, still remains a challenge. Existing techniques often fail to recognize faces with both the mouth and nose covered by a mask, which is now very common under the COVID-19 pandemic. Common approaches to tackle this problem include 1) discarding information from the masked regions during recognition and 2) restoring the masked regions before recognition. Very few works considered the consistency between features extracted from masked faces and from their mask-free counterparts. This resulted in models trained for recognizing masked faces often showing degraded performance on mask-free faces. In this paper, we propose a unified framework, named Face Feature Rectification Network (FFR-Net), for recognizing both masked and mask-free faces alike. We introduce rectification blocks to rectify features extracted by a state-of-the-art recognition model, in both spatial and channel dimensions, to minimize the distance between a masked face and its mask-free counterpart in the rectified feature space. Experiments show that our unified framework can learn a rectified feature space for recognizing both masked and mask-free faces effectively, achieving state-of-the-art results. Project code: https://github.com/haoosz/FFR-Net, Comment: 5 pages, 4 figures, conference
Published: 2022

12. Learning Attention as Disentangler for Compositional Zero-Shot Learning

Author: Hao, Shaozhe, primary, Han, Kai, additional, and Wong, Kwan-Yee K., additional
Published: 2023
Full Text: View/download PDF

13. ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Author: Hao, Shaozhe, Han, Kai, Zhao, Shihao, and Wong, Kwan-Yee K.
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively., Under review
Published: 2023

14. A Unified Framework for Masked and Mask-Free Face Recognition Via Feature Rectification

Author: Hao, Shaozhe, primary, Chen, Chaofeng, additional, Chen, Zhenfang, additional, and Wong, Kwan-Yee K., additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Hao, Shaozhe"'

1. Elucidating the design space of language models for image generation

2. BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

3. CusConcept: Customized Visual Concept Decomposition with Diffusion Models

4. ArtiFade: Learning to Generate High-quality Subject from Blemished Images

5. ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

6. Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

7. ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

8. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

9. CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

10. Learning Attention as Disentangler for Compositional Zero-shot Learning

11. A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

12. Learning Attention as Disentangler for Compositional Zero-Shot Learning

13. ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

14. A Unified Framework for Masked and Mask-Free Face Recognition Via Feature Rectification

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

14 results on '"Hao, Shaozhe"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources