Author: "Yang, Ceyuan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yang, Ceyuan"' showing total 136 results

Start Over Author "Yang, Ceyuan"

136 results on '"Yang, Ceyuan"'

1. Scaling Laws For Diffusion Transformers

Author: Liang, Zhengyang, He, Hao, Yang, Ceyuan, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, e.g., image and video generation. However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget. Therefore, experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT for the first time. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute. Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of 1e21 FLOPs. Additionally, we also demonstrate that the trend of pre-training loss matches the generation performances (e.g., FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.
Published: 2024

2. 3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

Author: Zhang, Qihang, Xu, Yinghao, Wang, Chaoyang, Lee, Hsin-Ying, Wetzstein, Gordon, Zhou, Bolei, and Yang, Ceyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene image editing is crucial for entertainment, photography, and advertising design. Existing methods solely focus on either 2D individual object or 3D global scene editing. This results in a lack of a unified approach to effectively control and manipulate scenes at the 3D level with different levels of granularity. In this work, we propose 3DitScene, a novel and unified scene editing framework leveraging language-guided disentangled Gaussian Splatting that enables seamless editing from 2D to 3D, allowing precise control over scene composition and individual objects. We first incorporate 3D Gaussians that are refined through generative priors and optimization techniques. Language features from CLIP then introduce semantics into 3D geometry for object disentanglement. With the disentangled Gaussians, 3DitScene allows for manipulation at both the global and individual levels, revolutionizing creative expression and empowering control over scenes and objects. Experimental results demonstrate the effectiveness and versatility of 3DitScene in scene image editing. Code and online demo can be found at our project homepage: https://zqh0253.github.io/3DitScene/.
Published: 2024

3. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Author: He, Hao, Xu, Yinghao, Guo, Yuwei, Wetzstein, Gordon, Dai, Bo, Li, Hongsheng, and Yang, Ceyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Controllability plays a crucial role in video generation since it allows users to create desired content. However, existing models largely overlooked the precise control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for text-to-video(T2V) models. After precisely parameterizing the camera trajectory, a plug-and-play camera module is then trained on a T2V model, leaving others untouched. Additionally, a comprehensive study on the effect of various datasets is also conducted, suggesting that videos with diverse camera distribution and similar appearances indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise and domain-adaptive camera control, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs. Our project website is at: https://hehao13.github.io/projects-CameraCtrl/., Comment: Project page: https://hehao13.github.io/projects-CameraCtrl/ Code: https://github.com/hehao13/CameraCtrl
Published: 2024

4. GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation

Author: Xu, Yinghao, Shi, Zifan, Yifan, Wang, Chen, Hansheng, Yang, Ceyuan, Peng, Sida, Shen, Yujun, and Wetzstein, Gordon
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. Together, our transformer architecture and the use of 3D Gaussians unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate the superiority of our method over alternatives regarding both reconstruction quality and efficiency. We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models. Our project website is at: https://justimyhxu.github.io/projects/grm/., Comment: Project page: https://justimyhxu.github.io/projects/grm/ Code: https://github.com/justimyhxu/GRM
Published: 2024

5. Real-time 3D-aware Portrait Editing from a Single Image

Author: Bai, Qingyan, Shi, Zifan, Xu, Yinghao, Ouyang, Hao, Wang, Qiuyu, Yang, Ceyuan, Wang, Xuan, Wetzstein, Gordon, Shen, Yujun, and Chen, Qifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our method achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style)., Comment: ECCV 2024 camera-ready version. Project page: https://github.com/EzioBy/3dpe
Published: 2024

6. SceneWiz3D: Towards Text-guided 3D Scene Composition

Author: Zhang, Qihang, Wang, Chaoyang, Siarohin, Aliaksandr, Zhuang, Peiye, Xu, Yinghao, Yang, Ceyuan, Lin, Dahua, Zhou, Bolei, Tulyakov, Sergey, and Lee, Hsin-Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes, however, remains very challenging as a scene contains multiple 3D objects, diverse and scattered. In this work, we introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text. We marry the locality of objects with globality of scenes by introducing a hybrid 3D representation: explicit for objects and implicit for scenes. Remarkably, an object, being represented explicitly, can be either generated from text using conventional text-to-3D approaches, or provided by users. To configure the layout of the scene and automatically place objects, we apply the Particle Swarm Optimization technique during the optimization process. Furthermore, it is difficult for certain parts of the scene (e.g., corners, occlusion) to receive multi-view supervision, leading to inferior geometry. We incorporate an RGBD panorama diffusion model to mitigate it, resulting in high-quality geometry. Extensive evaluation supports that our approach achieves superior quality over previous approaches, enabling the generation of detailed and view-consistent 3D scenes., Comment: Project page: https://zqh0253.github.io/SceneWiz3D/
Published: 2023

7. GenDeF: Learning Generative Deformation Field for Video Generation

Author: Wang, Wen, Zheng, Kecheng, Wang, Qiuyu, Chen, Hao, Shi, Zifan, Yang, Ceyuan, Shen, Yujun, and Shen, Chunhua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method., Comment: Project page: https://aim-uofa.github.io/GenDeF/
Published: 2023

8. BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

Author: Zhang, Qihang, Xu, Yinghao, Shen, Yujun, Dai, Bo, Zhou, Bolei, and Yang, Ceyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover, by adequately incorporating positional encoding and low-pass filters into the generator, the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at https://zqh0253.github.io/BerfScene/.
Published: 2023

9. Exploring Guided Sampling of Conditional GANs

Author: Zhang, Yifei, Xia, Mengfei, Shen, Yujun, Zhu, Jiapeng, Yang, Ceyuan, Zheng, Kecheng, Huang, Lianghua, Liu, Yu, Cheng, Fan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

10. Real-Time 3D-Aware Portrait Editing from a Single Image

Author: Bai, Qingyan, Shi, Zifan, Xu, Yinghao, Ouyang, Hao, Wang, Qiuyu, Yang, Ceyuan, Wang, Xuan, Wetzstein, Gordon, Shen, Yujun, Chen, Qifeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

11. SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Author: Guo, Yuwei, Yang, Ceyuan, Rao, Anyi, Agrawala, Maneesh, Lin, Dahua, Dai, Bo, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

12. SMaRt: Improving GANs with Score Matching Regularity

Author: Xia, Mengfei, Shen, Yujun, Yang, Ceyuan, Yi, Ran, Wang, Wenping, and Liu, Yong-jin
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex. In this work, we revisit the mathematical foundations of GANs, and theoretically reveal that the native adversarial loss for GAN training is insufficient to fix the problem of subsets with positive Lebesgue measure of the generated data manifold lying out of the real data manifold. Instead, we find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold. We thereby propose to improve the optimization of GANs with score matching regularity (SMaRt). Regarding the empirical evidences, we first design a toy example to show that training GANs by the aid of a ground-truth score function can help reproduce the real data distribution more accurately, and then confirm that our approach can consistently boost the synthesis performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function. For instance, when training Aurora on the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the performance of one-step consistency model. The source code will be made public.
Published: 2023

13. LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Author: Wang, Yaohui, Chen, Xinyuan, Ma, Xin, Zhou, Shangchen, Huang, Ziqi, Wang, Yi, Yang, Ceyuan, He, Yinan, Yu, Jiashuo, Yang, Peiqing, Guo, Yuwei, Wu, Tianxing, Si, Chenyang, Jiang, Yuming, Chen, Cunjian, Loy, Chen Change, Dai, Bo, Lin, Dahua, Qiao, Yu, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications., Comment: Project webpage: https://vchitect.github.io/LaVie-project/
Published: 2023

14. Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

Author: Zhu, Jiapeng, Yang, Ceyuan, Zheng, Kecheng, Xu, Yinghao, Shi, Zifan, and Shen, Yujun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development., Comment: Technical report
Published: 2023

15. Learning Modulated Transformation in GANs

Author: Yang, Ceyuan, Zhang, Qihang, Xu, Yinghao, Zhu, Jiapeng, Shen, Yujun, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The success of style-based generators largely benefits from style modulation, which helps take care of the cross-instance variation within data. However, the instance-wise stochasticity is typically introduced via regular convolution, where kernels interact with features at some fixed locations, limiting its capacity for modeling geometric variation. To alleviate this problem, we equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM). This module predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations for different instances, and hence offers the model an additional degree of freedom to handle geometry deformation. Extensive experiments suggest that our approach can be faithfully generalized to various generative tasks, including image generation, 3D-aware image synthesis, and video generation, and get compatible with state-of-the-art frameworks without any hyper-parameter tuning. It is noteworthy that, towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation., Comment: Technical report
Published: 2023

16. Improving Out-of-Distribution Robustness of Classifiers via Generative Interpolation

Author: Bai, Haoyue, Yang, Ceyuan, Xu, Yinghao, Chan, S. -H. Gary, and Zhou, Bolei
Subjects: Computer Science - Machine Learning
Abstract: Deep neural networks achieve superior performance for learning from independent and identically distributed (i.i.d.) data. However, their performance deteriorates significantly when handling out-of-distribution (OoD) data, where the training and test are drawn from different distributions. In this paper, we explore utilizing the generative models as a data augmentation source for improving out-of-distribution robustness of neural classifiers. Specifically, we develop a simple yet effective method called Generative Interpolation to fuse generative models trained from multiple domains for synthesizing diverse OoD samples. Training a generative model directly on the source domains tends to suffer from mode collapse and sometimes amplifies the data bias. Instead, we first train a StyleGAN model on one source domain and then fine-tune it on the other domains, resulting in many correlated generators where their model parameters have the same initialization thus are aligned. We then linearly interpolate the model parameters of the generators to spawn new sets of generators. Such interpolated generators are used as an extra data augmentation source to train the classifiers. The interpolation coefficients can flexibly control the augmentation direction and strength. In addition, a style-mixing mechanism is applied to further improve the diversity of the generated OoD samples. Our experiments show that the proposed method explicitly increases the diversity of training domains and achieves consistent improvements over baselines across datasets and multiple different distribution shifts.
Published: 2023

17. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Author: Guo, Yuwei, Yang, Ceyuan, Rao, Anyi, Liang, Zhengyang, Wang, Yaohui, Qiao, Yu, Agrawala, Maneesh, Lin, Dahua, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. However, adding motion dynamics to existing high-quality personalized T2Is and enabling them to generate animations remains an open challenge. In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning. At the core of our framework is a plug-and-play motion module that can be trained once and seamlessly integrated into any personalized T2Is originating from the same base T2I. Through our proposed training strategy, the motion module effectively learns transferable motion priors from real-world videos. Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator. We further propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns, such as different shot types, at a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA on several public representative personalized T2I models collected from the community. The results demonstrate that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity. Codes and pre-trained weights are available at https://github.com/guoyww/AnimateDiff., Comment: Codes and Supplementary Material: https://github.com/guoyww/AnimateDiff
Published: 2023

18. Revisiting the Evaluation of Image Synthesis with GANs

Author: Yang, Mengping, Yang, Ceyuan, Zhang, Yichi, Bai, Qingyan, Shen, Yujun, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A good metric, which promises a reliable comparison between solutions, is essential for any well-defined task. Unlike most vision tasks that have per-sample ground-truth, image synthesis tasks target generating unseen data and hence are usually evaluated through a distributional distance between one set of real samples and another set of generated samples. This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set. Extensive experiments conducted on multiple datasets and settings reveal several important findings. Firstly, a group of models that include both CNN-based and ViT-based architectures serve as reliable and robust feature extractors for measurement evaluation. Secondly, Centered Kernel Alignment (CKA) provides a better comparison across various extractors and hierarchical layers in one model. Finally, CKA is more sample-efficient and enjoys better agreement with human judgment in characterizing the similarity between two internal data correlations. These findings contribute to the development of a new measurement system, which enables a consistent and reliable re-evaluation of current state-of-the-art generative models., Comment: NeurIPS 2023 datasets and benchmarks track
Published: 2023

19. Spatial Steerability of GANs via Self-Supervision from Discriminator

Author: Wang, Jianyuan, Bhagat, Lalit, Yang, Ceyuan, Xu, Yinghao, Shen, Yujun, Li, Hongdong, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generative models make huge progress to the photorealistic image synthesis in recent years. To enable human to steer the image generation process and customize the output, many works explore the interpretable dimensions of the latent space in GANs. Existing methods edit the attributes of the output image such as orientation or color scheme by varying the latent code along certain directions. However, these methods usually require additional human annotations for each pretrained model, and they mostly focus on editing global attributes. In this work, we propose a self-supervised approach to improve the spatial steerability of GANs without searching for steerable directions in the latent space or requiring extra annotations. Specifically, we design randomly sampled Gaussian heatmaps to be encoded into the intermediate layers of generative models as spatial inductive bias. Along with training the GAN model from scratch, these heatmaps are being aligned with the emerging attention of the GAN's discriminator in a self-supervised learning manner. During inference, users can interact with the spatial heatmaps in an intuitive manner, enabling them to edit the output image by adjusting the scene layout, moving, or removing objects. Moreover, we incorporate DragGAN into our framework, which facilitates fine-grained manipulation within a reasonable time and supports a coarse-to-fine editing process. Extensive experiments show that the proposed method not only enables spatial editing over human faces, animal faces, outdoor scenes, and complicated multi-object indoor scenes but also brings improvement in synthesis quality. Code, models, and demo video are available at https://genforce.github.io/SpatialGAN/., Comment: This manuscript is a journal extension of our previous conference work (arXiv:2112.00718), submitted to TPAMI
Published: 2023

20. GH-Feat: Learning Versatile Generative Hierarchical Features from GANs

Author: Xu, Yinghao, Shen, Yujun, Zhu, Jiapeng, Yang, Ceyuan, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years witness the tremendous success of generative adversarial networks (GANs) in synthesizing photo-realistic images. GAN generator learns to compose realistic images and reproduce the real data distribution. Through that, a hierarchical visual feature with multi-level semantics spontaneously emerges. In this work we investigate that such a generative feature learned from image synthesis exhibits great potentials in solving a wide range of computer vision tasks, including both generative ones and more importantly discriminative ones. We first train an encoder by considering the pretrained StyleGAN generator as a learned loss function. The visual features produced by our encoder, termed as Generative Hierarchical Features (GH-Feat), highly align with the layer-wise GAN representations, and hence describe the input image adequately from the reconstruction perspective. Extensive experiments support the versatile transferability of GH-Feat across a range of applications, such as image editing, image processing, image harmonization, face verification, landmark detection, layout prediction, image retrieval, etc. We further show that, through a proper spatial expansion, our developed GH-Feat can also facilitate fine-grained semantic segmentation using only a few annotations. Both qualitative and quantitative results demonstrate the appealing performance of GH-Feat., Comment: Accepted by TPAMI 2022. arXiv admin note: text overlap with arXiv:2007.10379
Published: 2023

21. LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis

Author: Zhu, Jiapeng, Yang, Ceyuan, Shen, Yujun, Shi, Zifan, Dai, Bo, Zhao, Deli, and Chen, Qifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents an easy-to-use regularizer for GAN training, which helps explicitly link some axes of the latent space to a set of pixels in the synthesized image. Establishing such a connection facilitates a more convenient local control of GAN generation, where users can alter the image content only within a spatial area simply by partially resampling the latent code. Experimental results confirm four appealing properties of our regularizer, which we call LinkGAN. (1) The latent-pixel linkage is applicable to either a fixed region (\textit{i.e.}, same for all instances) or a particular semantic category (i.e., varying across instances), like the sky. (2) Two or multiple regions can be independently linked to different latent axes, which further supports joint control. (3) Our regularizer can improve the spatial controllability of both 2D and 3D-aware GAN models, barely sacrificing the synthesis performance. (4) The models trained with our regularizer are compatible with GAN inversion techniques and maintain editability on real images.
Published: 2023

22. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

Author: Xu, Yinghao, Chai, Menglei, Shi, Zifan, Peng, Sida, Skorokhodov, Ivan, Siarohin, Aliaksandr, Yang, Ceyuan, Shen, Yujun, Lee, Hsin-Ying, Zhou, Bolei, and Tulyakov, Sergey
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3Daware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Project page: https://snap-research.github.io/discoscene/, Comment: Project page: https://snap-research.github.io/discoscene/
Published: 2022

23. Towards Smooth Video Composition

Author: Zhang, Qihang, Yang, Ceyuan, Shen, Yujun, Xu, Yinghao, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video generation requires synthesizing consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality. Second, by incorporating the temporal shift module (TSM), originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more consistent dynamics. Third, we develop a novel B-Spline based motion representation to ensure temporal smoothness to achieve infinite-length video generation. It can go beyond the frame number used in training. A low-rank temporal modulation is also proposed to alleviate repeating contents for long video generation. We evaluate our approach on various datasets and show substantial improvements over video generation baselines. Code and models will be publicly available at https://genforce.github.io/StyleSV.
Published: 2022

24. GLeaD: Improving GANs with A Generator-Leading Task

Author: Bai, Qingyan, Yang, Ceyuan, Xu, Yinghao, Liu, Xihui, Yang, Yujiu, and Shen, Yujun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Generative adversarial network (GAN) is formulated as a two-player game between a generator (G) and a discriminator (D), where D is asked to differentiate whether an image comes from real data or is produced by G. Under such a formulation, D plays as the rule maker and hence tends to dominate the competition. Towards a fairer game in GANs, we propose a new paradigm for adversarial training, which makes G assign a task to D as well. Specifically, given an image, we expect D to extract representative features that can be adequately decoded by G to reconstruct the input. That way, instead of learning freely, D is urged to align with the view of G for domain classification. Experimental results on various datasets demonstrate the substantial superiority of our approach over the baselines. For instance, we improve the FID of StyleGAN2 from 4.30 to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN Church. We believe that the pioneering attempt present in this work could inspire the community with better designed generator-leading tasks for GAN improvement., Comment: CVPR2023. Project page: https://ezioby.github.io/glead/ Code: https://github.com/EzioBy/glead/
Published: 2022

25. Improving GANs with A Dynamic Discriminator

Author: Yang, Ceyuan, Shen, Yujun, Xu, Yinghao, Zhao, Deli, Dai, Bo, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Discriminator plays a vital role in training generative adversarial networks (GANs) via distinguishing real and synthesized samples. While the real data distribution remains the same, the synthesis distribution keeps varying because of the evolving generator, and thus effects a corresponding change to the bi-classification task for the discriminator. We argue that a discriminator with an on-the-fly adjustment on its capacity can better accommodate such a time-varying task. A comprehensive empirical study confirms that the proposed training strategy, termed as DynamicD, improves the synthesis performance without incurring any additional computation cost or training objectives. Two capacity adjusting schemes are developed for training GANs under different data regimes: i) given a sufficient amount of training data, the discriminator benefits from a progressively increased learning capacity, and ii) when the training data is limited, gradually decreasing the layer width mitigates the over-fitting issue of the discriminator. Experiments on both 2D and 3D-aware image synthesis tasks conducted on a range of datasets substantiate the generalizability of our DynamicD as well as its substantial improvement over the baselines. Furthermore, DynamicD is synergistic to other discriminator-improving approaches (including data augmentation, regularizers, and pre-training), and brings continuous performance gain when combined for learning GANs., Comment: To appear in NeurIPS 2022
Published: 2022

26. Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

Author: Jiang, Zhengkai, Li, Yuxi, Yang, Ceyuan, Gao, Peng, Wang, Yabiao, Tai, Ying, and Wang, Chengjie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unsupervised Domain Adaptation (UDA) aims to adapt the model trained on the labeled source domain to an unlabeled target domain. In this paper, we present Prototypical Contrast Adaptation (ProCA), a simple and efficient contrastive learning method for unsupervised domain adaptive semantic segmentation. Previous domain adaptation methods merely consider the alignment of the intra-class representational distributions across various domains, while the inter-class structural relationship is insufficiently explored, resulting in the aligned representations on the target domain might not be as easily discriminated as done on the source domain anymore. Instead, ProCA incorporates inter-class information into class-wise prototypes, and adopts the class-centered distribution alignment for adaptation. By considering the same class prototypes as positives and other class prototypes as negatives to achieve class-centered distribution alignment, ProCA achieves state-of-the-art performance on classical domain adaptation tasks, {\em i.e., GTA5 $\to$ Cityscapes \text{and} SYNTHIA $\to$ Cityscapes}. Code is available at \href{https://github.com/jiangzhengkai/ProCA}{ProCA}
Published: 2022

27. Accelerating Diffusion Models via Early Stop of the Diffusion Process

Author: Lyu, Zhaoyang, XU, Xudong, Yang, Ceyuan, Lin, Dahua, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have achieved impressive performance on various generation tasks. By modeling the reverse process of gradually diffusing the data distribution into a Gaussian distribution, generating a sample in DDPMs can be regarded as iteratively denoising a randomly sampled Gaussian noise. However, in practice DDPMs often need hundreds even thousands of denoising steps to obtain a high-quality sample from the Gaussian noise, leading to extremely low inference efficiency. In this work, we propose a principled acceleration strategy, referred to as Early-Stopped DDPM (ES-DDPM), for DDPMs. The key idea is to stop the diffusion process early where only the few initial diffusing steps are considered and the reverse denoising process starts from a non-Gaussian distribution. By further adopting a powerful pre-trained generative model, such as GAN and VAE, in ES-DDPM, sampling from the target non-Gaussian distribution can be efficiently achieved by diffusing samples obtained from the pre-trained generative model. In this way, the number of required denoising steps is significantly reduced. In the meantime, the sample quality of ES-DDPM also improves substantially, outperforming both the vanilla DDPM and the adopted pre-trained generative model. On extensive experiments across CIFAR-10, CelebA, ImageNet, LSUN-Bedroom and LSUN-Cat, ES-DDPM obtains promising acceleration effect and performance improvement over representative baseline methods. Moreover, ES-DDPM also demonstrates several attractive properties, including being orthogonal to existing acceleration methods, as well as simultaneously enabling both global semantic and local pixel-level control in image generation., Comment: Code is released at https://github.com/ZhaoyangLyu/Early_Stopped_DDPM
Published: 2022

28. 3D-aware Image Synthesis via Learning Structural and Textural Representations

Author: Xu, Yinghao, Peng, Sida, Yang, Ceyuan, Shen, Yujun, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Making generative models 3D-aware bridges the 2D image space and the 3D physical world yet remains challenging. Recent attempts equip a Generative Adversarial Network (GAN) with a Neural Radiance Field (NeRF), which maps 3D coordinates to pixel values, as a 3D prior. However, the implicit function in NeRF has a very local receptive field, making the generator hard to become aware of the global structure. Meanwhile, NeRF is built on volume rendering which can be too costly to produce high-resolution results, increasing the optimization difficulty. To alleviate these two problems, we propose a novel framework, termed as VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation. We first learn a feature volume to represent the underlying structure, which is then converted to a feature field using a NeRF-like model. The feature field is further accumulated into a 2D feature map as the textural representation, followed by a neural renderer for appearance synthesis. Such a design enables independent control of the shape and the appearance. Extensive experiments on a wide range of datasets show that our approach achieves sufficiently higher image quality and better 3D control than the previous methods., Comment: CVPR 2022 camera-ready, Project page: https://genforce.github.io/volumegan/
Published: 2021

29. Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition

Author: Xu, Yinghao, Wei, Fangyun, Sun, Xiao, Yang, Ceyuan, Shen, Yujun, Dai, Bo, Zhou, Bolei, and Lin, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semi-supervised action recognition is a challenging but important task due to the high cost of data annotation. A common approach to this problem is to assign unlabeled data with pseudo-labels, which are then used as additional supervision in training. Typically in recent work, the pseudo-labels are obtained by training a model on the labeled data, and then using confident predictions from the model to teach itself. In this work, we propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL). Concretely, we introduce a lightweight auxiliary network in addition to the primary backbone, and ask them to predict pseudo-labels for each other. We observe that, due to their different structural biases, these two models tend to learn complementary representations from the same video clips. Each model can thus benefit from its counterpart by utilizing cross-model predictions as supervision. Experiments on different data partition protocols demonstrate the significant improvement of our framework over existing alternatives. For example, CMPL achieves $17.6\%$ and $25.1\%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1\%$ labeled data, outperforming our baseline model, FixMatch, by $9.0\%$ and $10.3\%$, respectively., Comment: CVPR 2022 camera-ready, Project webpage: https://justimyhxu.github.io/projects/cmpl/
Published: 2021

30. Improving GAN Equilibrium by Raising Spatial Awareness

Author: Wang, Jianyuan, Yang, Ceyuan, Xu, Yinghao, Shen, Yujun, Li, Hongdong, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The success of Generative Adversarial Networks (GANs) is largely built upon the adversarial training between a generator (G) and a discriminator (D). They are expected to reach a certain equilibrium where D cannot distinguish the generated images from the real ones. However, such an equilibrium is rarely achieved in practical GAN training, instead, D almost always surpasses G. We attribute one of its sources to the information asymmetry between D and G. We observe that D learns its own visual attention when determining whether an image is real or fake, but G has no explicit clue on which regions to focus on for a particular synthesis. To alleviate the issue of D dominating the competition in GANs, we aim to raise the spatial awareness of G. Randomly sampled multi-level heatmaps are encoded into the intermediate layers of G as an inductive bias. Thus G can purposefully improve the synthesis of certain image regions. We further propose to align the spatial awareness of G with the attention map induced from D. Through this way we effectively lessen the information gap between D and G. Extensive results show that our method pushes the two-player game in GANs closer to the equilibrium, leading to a better synthesis performance. As a byproduct, the introduced spatial awareness facilitates interactive editing over the output synthesis. Demo video and code are available at https://genforce.github.io/eqgan-sa/.
Published: 2021

31. One-Shot Generative Domain Adaptation

Author: Yang, Ceyuan, Shen, Yujun, Zhang, Zhiyi, Xu, Yinghao, Zhu, Jiapeng, Wu, Zhirong, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work aims at transferring a Generative Adversarial Network (GAN) pre-trained on one image domain to a new domain referring to as few as just one target image. The main challenge is that, under limited supervision, it is extremely difficult to synthesize photo-realistic and highly diverse images, while acquiring representative characters of the target. Different from existing approaches that adopt the vanilla fine-tuning strategy, we import two lightweight modules to the generator and the discriminator respectively. Concretely, we introduce an attribute adaptor into the generator yet freeze its original parameters, through which it can reuse the prior knowledge to the most extent and hence maintain the synthesis quality and diversity. We then equip the well-learned discriminator backbone with an attribute classifier to ensure that the generator captures the appropriate characters from the reference. Furthermore, considering the poor diversity of the training data (i.e., as few as only one image), we propose to also constrain the diversity of the generative domain in the training process, alleviating the optimization difficulty. Our approach brings appealing results under various settings, substantially surpassing state-of-the-art alternatives, especially in terms of synthesis diversity. Noticeably, our method works well even with large domain gaps, and robustly converges within a few minutes for each experiment., Comment: Technical Report
Published: 2021

32. Data-Efficient Instance Generation from Instance Discrimination

Author: Yang, Ceyuan, Shen, Yujun, Xu, Yinghao, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generative Adversarial Networks (GANs) have significantly advanced image synthesis, however, the synthesis quality drops significantly given a limited amount of training data. To improve the data efficiency of GAN training, prior work typically employs data augmentation to mitigate the overfitting of the discriminator yet still learn the discriminator with a bi-classification (i.e., real vs. fake) task. In this work, we propose a data-efficient Instance Generation (InsGen) method based on instance discrimination. Concretely, besides differentiating the real domain from the fake domain, the discriminator is required to distinguish every individual image, no matter it comes from the training set or from the generator. In this way, the discriminator can benefit from the infinite synthesized samples for training, alleviating the overfitting problem caused by insufficient training data. A noise perturbation strategy is further introduced to improve its discriminative power. Meanwhile, the learned instance discrimination capability from the discriminator is in turn exploited to encourage the generator for diverse generation. Extensive experiments demonstrate the effectiveness of our method on a variety of datasets and training settings. Noticeably, on the setting of 2K training images from the FFHQ dataset, we outperform the state-of-the-art approach with 23.5% FID improvement., Comment: Technical report
Published: 2021

33. Instance Localization for Self-supervised Detection Pretraining

Author: Yang, Ceyuan, Wu, Zhirong, Zhou, Bolei, and Lin, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prior research on self-supervised learning has led to considerable progress on image classification, but often with degraded transfer performance on object detection. The objective of this paper is to advance self-supervised pretrained models specifically for object detection. Based on the inherent difference between classification and detection, we propose a new self-supervised pretext task, called instance localization. Image instances are pasted at various locations and scales onto background images. The pretext task is to predict the instance category given the composited images as well as the foreground bounding boxes. We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning. In addition, we propose an augmentation method on the bounding boxes to further enhance the feature alignment. As a result, our model becomes weaker at Imagenet semantic classification but stronger at image patch localization, with an overall stronger pretrained model for object detection. Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection on PASCAL VOC and MSCOCO., Comment: To appear in CVPR 2021. Code is available at https://github.com/limbo0000/InstanceLoc
Published: 2021

34. Generative Hierarchical Features from Synthesizing Images

Author: Xu, Yinghao, Shen, Yujun, Zhu, Jiapeng, Yang, Ceyuan, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generative Adversarial Networks (GANs) have recently advanced image synthesis by learning the underlying distribution of the observed data. However, how the features learned from solving the task of image generation are applicable to other vision tasks remains seldom explored. In this work, we show that learning to synthesize images can bring remarkable hierarchical visual features that are generalizable across a wide range of applications. Specifically, we consider the pre-trained StyleGAN generator as a learned loss function and utilize its layer-wise representation to train a novel hierarchical encoder. The visual feature produced by our encoder, termed as Generative Hierarchical Feature (GH-Feat), has strong transferability to both generative and discriminative tasks, including image editing, image harmonization, image classification, face verification, landmark detection, and layout prediction. Extensive qualitative and quantitative experimental results demonstrate the appealing performance of GH-Feat., Comment: CVPR 2021 camera-ready
Published: 2020

35. Unsupervised Landmark Learning from Unpaired Data

Author: Xu, Yinghao, Yang, Ceyuan, Liu, Ziwei, Dai, Bo, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent attempts for unsupervised landmark learning leverage synthesized image pairs that are similar in appearance but different in poses. These methods learn landmarks by encouraging the consistency between the original images and the images reconstructed from swapped appearances and poses. While synthesized image pairs are created by applying pre-defined transformations, they can not fully reflect the real variances in both appearances and poses. In this paper, we aim to open the possibility of learning landmarks on unpaired data (i.e. unaligned image pairs) sampled from a natural image collection, so that they can be different in both appearances and poses. To this end, we propose a cross-image cycle consistency framework ($C^3$) which applies the swapping-reconstruction strategy twice to obtain the final supervision. Moreover, a cross-image flow module is further introduced to impose the equivariance between estimated landmarks across images. Through comprehensive experiments, our proposed framework is shown to outperform strong baselines by a large margin. Besides quantitative results, we also provide visualization and interpretation on our learned models, which not only verifies the effectiveness of the learned landmarks, but also leads to important insights that are beneficial for future research.
Published: 2020

36. Video Representation Learning with Visual Tempo Consistency

Author: Yang, Ceyuan, Xu, Yinghao, Dai, Bo, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1\%) and HMDB-51 (49.2\%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we propose Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning., Comment: Technical report. Models are available at https://github.com/decisionforce/VTHCL
Published: 2020

37. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs

Author: Shen, Yujun, Yang, Ceyuan, Tang, Xiaoou, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Although Generative Adversarial Networks (GANs) have made significant progress in face synthesis, there lacks enough understanding of what GANs have learned in the latent representation to map a random code to a photo-realistic image. In this work, we propose a framework called InterFaceGAN to interpret the disentangled face representation learned by the state-of-the-art GAN models and study the properties of the facial semantics encoded in the latent space. We first find that GANs learn various semantics in some linear subspaces of the latent space. After identifying these subspaces, we can realistically manipulate the corresponding facial attributes without retraining the model. We then conduct a detailed study on the correlation between different semantics and manage to better disentangle them via subspace projection, resulting in more precise control of the attribute manipulation. Besides manipulating the gender, age, expression, and presence of eyeglasses, we can even alter the face pose and fix the artifacts accidentally made by GANs. Furthermore, we perform an in-depth face identity analysis and a layer-wise analysis to evaluate the editing results quantitatively. Finally, we apply our approach to real face editing by employing GAN inversion approaches and explicitly training feed-forward models based on the synthetic data established by InterFaceGAN. Extensive experimental results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable face representation., Comment: Accepted by TPAMI 2020
Published: 2020

38. Temporal Pyramid Network for Action Recognition

Author: Yang, Ceyuan, Xu, Yinghao, Shi, Jianping, Dai, Bo, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2% gain on the validation set of Kinetics-400. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN., Comment: To appear in CVPR 2020. Code is available at https://github.com/decisionforce/TPN
Published: 2020

39. Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis

Author: Yang, Ceyuan, Shen, Yujun, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Despite the success of Generative Adversarial Networks (GANs) in image synthesis, there lacks enough understanding on what generative models have learned inside the deep generative representations and how photo-realistic images are able to be composed of the layer-wise stochasticity introduced in recent GANs. In this work, we show that highly-structured semantic hierarchy emerges as variation factors from synthesizing scenes from the generative representations in state-of-the-art GAN models, like StyleGAN and BigGAN. By probing the layer-wise representations with a broad set of semantics at different abstraction levels, we are able to quantify the causality between the activations and semantics occurring in the output image. Such a quantification identifies the human-understandable variation factors learned by GANs to compose scenes. The qualitative and quantitative results further suggest that the generative representations learned by the GANs with layer-wise latent codes are specialized to synthesize different hierarchical semantics: the early layers tend to determine the spatial layout and configuration, the middle layers control the categorical objects, and the later layers finally render the scene attributes as well as color scheme. Identifying such a set of manipulatable latent variation factors facilitates semantic scene manipulation., Comment: 15 pages, 20 figures
Published: 2019

40. Learning Where to Focus for Efficient Video Object Detection

Author: Jiang, Zhengkai, Liu, Yu, Yang, Ceyuan, Liu, Jihao, Gao, Peng, Zhang, Qian, Xiang, Shiming, and Pan, Chunhong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available at https://github.com/jiangzhengkai/LSTS., Comment: Accepted to ECCV 2020
Published: 2019

41. Penalizing Top Performers: Conservative Loss for Semantic Segmentation Adaptation

Author: Zhu, Xinge, Zhou, Hui, Yang, Ceyuan, Shi, Jianping, and Lin, Dahua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the expensive and time-consuming annotations (e.g., segmentation) for real-world images, recent works in computer vision resort to synthetic data. However, the performance on the real image often drops significantly because of the domain shift between the synthetic data and the real images. In this setting, domain adaptation brings an appealing option. The effective approaches of domain adaptation shape the representations that (1) are discriminative for the main task and (2) have good generalization capability for domain shift. To this end, we propose a novel loss function, i.e., Conservative Loss, which penalizes the extreme good and bad cases while encouraging the moderate examples. More specifically, it enables the network to learn features that are discriminative by gradient descent and are invariant to the change of domains via gradient ascend method. Extensive experiments on synthetic to real segmentation adaptation show our proposed method achieves state of the art results. Ablation studies give more insights into properties of the Conservative Loss. Exploratory experiments and discussion demonstrate that our Conservative Loss has good flexibility rather than restricting an exact form., Comment: ECCV 2018
Published: 2018

42. Pose Guided Human Video Generation

Author: Yang, Ceyuan, Wang, Zhe, Zhu, Xinge, Huang, Chen, Shi, Jianping, and Lin, Dahua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the emergence of Generative Adversarial Networks, video synthesis has witnessed exceptional breakthroughs. However, existing methods lack a proper representation to explicitly control the dynamics in videos. Human pose, on the other hand, can represent motion patterns intrinsically and interpretably, and impose the geometric constraints regardless of appearance. In this paper, we propose a pose guided method to synthesize human videos in a disentangled way: plausible motion prediction and coherent appearance generation. In the first stage, a Pose Sequence Generative Adversarial Network (PSGAN) learns in an adversarial manner to yield pose sequences conditioned on the class label. In the second stage, a Semantic Consistent Generative Adversarial Network (SCGAN) generates video frames from the poses while preserving coherent appearances in the input image. By enforcing semantic consistency between the generated and ground-truth poses at a high feature level, our SCGAN is robust to noisy or abnormal poses. Extensive experiments on both human action and human face datasets manifest the superiority of the proposed method over other state-of-the-arts., Comment: Accepted to ECCV 2018
Published: 2018

43. Learning Where to Focus for Efficient Video Object Detection

Author: Jiang, Zhengkai, Liu, Yu, Yang, Ceyuan, Liu, Jihao, Gao, Peng, Zhang, Qian, Xiang, Shiming, Pan, Chunhong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
Published: 2020
Full Text: View/download PDF

44. Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

Author: Jiang, Zhengkai, primary, Li, Yuxi, additional, Yang, Ceyuan, additional, Gao, Peng, additional, Wang, Yabiao, additional, Tai, Ying, additional, and Wang, Chengjie, additional
Published: 2022
Full Text: View/download PDF

45. Semantic Hierarchy Emerges in Deep Generative Representations for Scene Synthesis

Author: Yang, Ceyuan, Shen, Yujun, and Zhou, Bolei
Published: 2021
Full Text: View/download PDF

46. Learning Where to Focus for Efficient Video Object Detection

Author: Jiang, Zhengkai, primary, Liu, Yu, additional, Yang, Ceyuan, additional, Liu, Jihao, additional, Gao, Peng, additional, Zhang, Qian, additional, Xiang, Shiming, additional, and Pan, Chunhong, additional
Published: 2020
Full Text: View/download PDF

47. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

Author: Xu, Yinghao, primary, Chai, Menglei, additional, Shi, Zifan, additional, Peng, Sida, additional, Skorokhodov, Ivan, additional, Siarohin, Aliaksandr, additional, Yang, Ceyuan, additional, Shen, Yujun, additional, Lee, Hsin-Ying, additional, Zhou, Bolei, additional, and Tulyakov, Sergey, additional
Published: 2023
Full Text: View/download PDF

48. GLeaD: Improving GANs with A Generator-Leading Task

Author: Bai, Qingyan, primary, Yang, Ceyuan, additional, Xu, Yinghao, additional, Liu, Xihui, additional, Yang, Yujiu, additional, and Shen, Yujun, additional
Published: 2023
Full Text: View/download PDF

49. GH-Feat: Learning Versatile Generative Hierarchical Features From GANs

Author: Xu, Yinghao, primary, Shen, Yujun, additional, Zhu, Jiapeng, additional, Yang, Ceyuan, additional, and Zhou, Bolei, additional
Published: 2023
Full Text: View/download PDF

50. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

Author: Xu, Yinghao, Chai, Menglei, Shi, Zifan, Peng, Sida, Skorokhodov, Ivan, Siarohin, Aliaksandr, Yang, Ceyuan, Shen, Yujun, Lee, Hsin-Ying, Zhou, Bolei, Tulyakov, Sergey, Xu, Yinghao, Chai, Menglei, Shi, Zifan, Peng, Sida, Skorokhodov, Ivan, Siarohin, Aliaksandr, Yang, Ceyuan, Shen, Yujun, Lee, Hsin-Ying, Zhou, Bolei, and Tulyakov, Sergey
Abstract: Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3D-aware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Project page can be found here. © 2023 IEEE.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

136 results on '"Yang, Ceyuan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources