Author: "Wu, Xiaoshi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Xiaoshi"' showing total 21 results

Start Over Author "Wu, Xiaoshi"

21 results on '"Wu, Xiaoshi"'

1. Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Author: Wu, Xiaoshi, Hao, Yiming, Zhang, Manyuan, Sun, Keqiang, Huang, Zhaoyang, Song, Guanglu, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2., Comment: N/A
Published: 2024

2. CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Author: Jiang, Dongzhi, Song, Guanglu, Wu, Xiaoshi, Zhang, Renrui, Shen, Dazhong, Zong, Zhuofan, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance., Comment: Project Page: https://caraj7.github.io/comat
Published: 2024

3. ECNet: Effective Controllable Text-to-Image Diffusion Models

Author: Li, Sicheng, Sun, Keqiang, Lai, Zhixin, Wu, Xiaoshi, Qiu, Feng, Xie, Haoran, Miyata, Kazunori, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.
Published: 2024

4. Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Author: Wang, Fu-Yun, Wu, Xiaoshi, Huang, Zhaoyang, Shi, Xiaoyu, Shen, Dazhong, Song, Guanglu, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation, a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning., Comment: Code will be available at https://github.com/G-U-N/Be-Your-Outpainter
Published: 2024

5. Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Author: Wu, Xiaoshi, Hao, Yiming, Zhang, Manyuan, Sun, Keqiang, Huang, Zhaoyang, Song, Guanglu, Liu, Yu, Li, Hongsheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

6. Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation

Author: Wang, Fu-Yun, Wu, Xiaoshi, Huang, Zhaoyang, Shi, Xiaoyu, Shen, Dazhong, Song, Guanglu, Liu, Yu, Li, Hongsheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

7. JourneyDB: A Benchmark for Generative Image Understanding

Author: Sun, Keqiang, Pan, Junting, Ge, Yuying, Li, Hao, Duan, Haodong, Wu, Xiaoshi, Zhang, Renrui, Zhou, Aojun, Qin, Zipeng, Wang, Yi, Dai, Jifeng, Qiao, Yu, Wang, Limin, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https://journeydb.github.io., Comment: Accepted to the Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)
Published: 2023

8. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Author: Wu, Xiaoshi, Hao, Yiming, Sun, Keqiang, Chen, Yixiong, Zhu, Feng, Zhao, Rui, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Databases
Abstract: Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 ., Comment: Revision
Published: 2023

9. Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

Author: Wu, Xiaoshi, Sun, Keqiang, Zhu, Feng, Zhao, Rui, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences. Our experiments show that HPS outperforms CLIP in predicting human choices and has good generalization capability toward images generated from other models. By tuning Stable Diffusion with the guidance of HPS, the adapted model is able to generate images that are more preferred by human users. The project page is available here: https://tgxs002.github.io/align_sd_web/ ., Comment: Accepted by ICCV 2023
Published: 2023

10. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Author: Wu, Xiaoshi, Zhu, Feng, Zhao, Rui, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark., Comment: 11 pages, 4 figures. Accepted by CVPR 2023
Published: 2023

11. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Author: Zhu, Xizhou, Zhu, Jinguo, Li, Hao, Wu, Xiaoshi, Wang, Xiaogang, Li, Hongsheng, Wang, Xiaohua, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks with unified modeling and shared parameters. Specifically, Uni-Perceiver encodes different task inputs and targets from arbitrary modalities into a unified representation space with a modality-agnostic Transformer encoder and lightweight modality-specific tokenizers. Different perception tasks are modeled as the same formulation, that is, finding the maximum likelihood target for each input through the similarity of their representations. The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage. Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks. The performance can be improved to a level close to state-of-the-art methods by conducting prompt tuning on 1% of downstream task data. Full-data fine-tuning further delivers results on par with or better than state-of-the-art results. Code shall be released.
Published: 2021

12. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Author: Wu, Xiaoshi, Averbuch-Elor, Hadar, Sun, Jin, and Snavely, Noah
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The abundance and richness of Internet photos of landmarks and cities has led to significant progress in 3D vision over the past two decades, including automated 3D reconstructions of the world's landmarks from tourist photos. However, a major source of information available for these 3D-augmented collections---namely language, e.g., from image captions---has been virtually untapped. In this work, we present WikiScenes, a new, large-scale dataset of landmark photo collections that contains descriptive text in the form of captions and hierarchical category names. WikiScenes forms a new testbed for multimodal reasoning involving images, text, and 3D geometry. We demonstrate the utility of WikiScenes for learning semantic concepts over images and 3D models. Our weakly-supervised framework connects images, 3D structure, and semantics---utilizing the strong constraints provided by 3D geometry---to associate semantic concepts to image pixels and 3D points., Comment: Published in ICCV 2021; Project webpage: https://www.cs.cornell.edu/projects/babel/
Published: 2021

13. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Author: Wu, Xiaoshi, primary, Zhu, Feng, additional, Zhao, Rui, additional, and Li, Hongsheng, additional
Published: 2023
Full Text: View/download PDF

14. Better Aligning Text-to-Image Models with Human Preference

Author: Wu, Xiaoshi, Sun, Keqiang, Zhu, Feng, Zhao, Rui, and Li, Hongsheng
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human aesthetic preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using the HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human aesthetic preferences. Our experiments show that the HPS outperforms CLIP in predicting human choices and has good generalization capability towards images generated from other models. By tuning Stable Diffusion with the guidance of the HPS, the adapted model is able to generate images that are more preferred by human users., 15 pages, 11 figures
Published: 2023

15. JourneyDB: A Benchmark for Generative Image Understanding

Author: Pan, Junting, Sun, Keqiang, Ge, Yuying, Li, Hao, Duan, Haodong, Wu, Xiaoshi, Zhang, Renrui, Zhou, Aojun, Qin, Zipeng, Wang, Yi, Dai, Jifeng, Qiao, Yu, and Li, Hongsheng
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.
Published: 2023
Full Text: View/download PDF

16. Children with low effortful control benefit in high-quality home learning environment: Evidence from China.

Author: Li, Xiaowei, primary, Wu, Xiaoshi, additional, and Liu, Qianqian, additional
Published: 2022
Full Text: View/download PDF

17. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Author: Zhu, Xizhou, primary, Zhu, Jinguo, additional, Li, Hao, additional, Wu, Xiaoshi, additional, Li, Hongsheng, additional, Wang, Xiaohua, additional, and Dai, Jifeng, additional
Published: 2022
Full Text: View/download PDF

18. Effect of Terminalia chebula Retz. extraction with water on Staphylococcus epidermidis activity and its biofilm formation

Author: Zheng, ShiQian, primary, Wu, XiaoShi, additional, Cai, ZhiLing, additional, Deng, RongRong, additional, and Shen, ZhiBin, additional
Published: 2022
Full Text: View/download PDF

19. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Author: Wu, Xiaoshi, primary, Averbuch-Elor, Hadar, additional, Sun, Jin, additional, and Snavely, Noah, additional
Published: 2021
Full Text: View/download PDF

20. Quantitative Impact and Research on Water Supply Management and Demand in Beijing under the WEAP Model

Author: Yang, Liu, primary, Hao, Meng, additional, Cao, Qingyi, additional, Liu, Keling, additional, Xiao, Liangpeng, additional, Pei, Lin, additional, and Wu, Xiaoshi, additional
Published: 2020
Full Text: View/download PDF

21. Enrichment Process and Efficient Removal of Thallium from Steel Plant Desulfurization Wastewater

Author: Liu, Juan, primary, Lin, Yuyang, additional, Zhang, Weilong, additional, Yin, Meiling, additional, Wang, Jin, additional, Li, Nuo, additional, Luo, Xuwen, additional, Wei, Xudong, additional, Chen, Yongheng, additional, Wu, Yang, additional, Liu, Siyu, additional, Yu, Xiaoxiang, additional, Wu, Xiaoshi, additional, Zhang, Wenhui, additional, and Huang, Chunlin, additional
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

21 results on '"Wu, Xiaoshi"'

1. Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

2. CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

3. ECNet: Effective Controllable Text-to-Image Diffusion Models

4. Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

5. Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

6. Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation

7. JourneyDB: A Benchmark for Generative Image Understanding

8. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

9. Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

10. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

11. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

12. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

13. CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

14. Better Aligning Text-to-Image Models with Human Preference

15. JourneyDB: A Benchmark for Generative Image Understanding

16. Children with low effortful control benefit in high-quality home learning environment: Evidence from China.

17. Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

18. Effect of Terminalia chebula Retz. extraction with water on Staphylococcus epidermidis activity and its biofilm formation

19. Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

20. Quantitative Impact and Research on Water Supply Management and Demand in Beijing under the WEAP Model

21. Enrichment Process and Efficient Removal of Thallium from Steel Plant Desulfurization Wastewater

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

21 results on '"Wu, Xiaoshi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources