Author: "Wu, Wayne" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Wayne"' showing total 350 results

Start Over Author "Wu, Wayne"

350 results on '"Wu, Wayne"'

1. Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Author: Liu, Zhizheng, Lin, Joe, Wu, Wayne, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available at https://genforce.github.io/PedGen/ ., Comment: Project Page: https://genforce.github.io/PedGen/
Published: 2024

2. MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility

Author: Wu, Wayne, He, Honglin, He, Jack, Wang, Yiran, Duan, Chenda, Liu, Zhizheng, Li, Quanyi, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while robot dogs and humanoids have recently emerged in the street. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. Ensuring the generalizability and safety of AI models maneuvering mobile machines is essential. In this work, we present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for urban micromobility research and establish various baselines of Reinforcement Learning and Imitation Learning. We conduct extensive evaluation across mobile machines, demonstrating that heterogeneous mechanical structures significantly influence the learning and execution of AI policies. We perform a thorough ablation study, showing that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide research opportunities and foster safe and trustworthy embodied AI and micromobility in cities. The code and dataset will be publicly available., Comment: Technical report. Project page: https://metadriverse.github.io/metaurban/
Published: 2024

3. CosmicMan: A Text-to-Image Foundation Model for Humans

Author: Li, Shikai, Fu, Jianglin, Liu, Kaiyuan, Wang, Wentao, Lin, Kwan-Yee, and Wu, Wayne
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze., Comment: Accepted by CVPR 2024. The supplementary material is included. Project Page: https://cosmicman-cvpr2024.github.io
Published: 2024

4. VLG: General Video Recognition with Web Textual Knowledge

Author: Lin, Jintao, Liu, Zhaoyang, Wang, Wenhai, Wu, Wayne, and Wang, Limin
Published: 2024
Full Text: View/download PDF

5. ReliTalk: Relightable Talking Portrait Generation from a Single Video

Author: Qiu, Haonan, Chen, Zhaoxi, Jiang, Yuming, Zhou, Hang, Fan, Xiangyu, Yang, Lei, Wu, Wayne, and Liu, Ziwei
Published: 2024
Full Text: View/download PDF

6. Attention and cognitive penetration: reflections on Dustin Stokes’ Thinking and Perceiving

Author: Wu, Wayne
Published: 2024
Full Text: View/download PDF

7. PaintHuman: Towards High-fidelity Text-to-3D Human Texturing via Denoised Score Distillation

Author: Yu, Jianhui, Zhu, Hao, Jiang, Liming, Loy, Chen Change, Cai, Weidong, and Wu, Wayne
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in zero-shot text-to-3D human generation, which employ the human model prior (eg, SMPL) or Score Distillation Sampling (SDS) with pre-trained text-to-image diffusion models, have been groundbreaking. However, SDS may provide inaccurate gradient directions under the weak diffusion guidance, as it tends to produce over-smoothed results and generate body textures that are inconsistent with the detailed mesh geometry. Therefore, directly leverage existing strategies for high-fidelity text-to-3D human texturing is challenging. In this work, we propose a model called PaintHuman to addresses the challenges from two aspects. We first propose a novel score function, Denoised Score Distillation (DSD), which directly modifies the SDS by introducing negative gradient components to iteratively correct the gradient direction and generate high-quality textures. In addition, we use the depth map as a geometric guidance to ensure the texture is semantically aligned to human mesh surfaces. To guarantee the quality of rendered results, we employ geometry-aware networks to predict surface materials and render realistic human textures. Extensive experiments, benchmarked against state-of-the-art methods, validate the efficacy of our approach.
Published: 2023

8. Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering

Author: Xu, Baixin, Hu, Jiangbei, Hou, Fei, Lin, Kwan-Yee, Wu, Wayne, Qian, Chen, and He, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The advancements in neural rendering have increased the need for techniques that enable intuitive editing of 3D objects represented as neural implicit surfaces. This paper introduces a novel neural algorithm for parameterizing neural implicit surfaces to simple parametric domains like spheres and polycubes. Our method allows users to specify the number of cubes in the parametric domain, learning a configuration that closely resembles the target 3D object's geometry. It computes bi-directional deformation between the object and the domain using a forward mapping from the object's zero level set and an inverse deformation for backward mapping. We ensure nearly bijective mapping with a cycle loss and optimize deformation smoothness. The parameterization quality, assessed by angle and area distortions, is guaranteed using a Laplacian regularizer and an optimized learned parametric domain. Our framework integrates with existing neural rendering pipelines, using multi-view images of a single object or multiple objects of similar geometries to reconstruct 3D geometry and compute texture maps automatically, eliminating the need for any prior information. We demonstrate the method's effectiveness on images of human heads and man-made objects., Comment: ECCV24, see https://xubaixinxbx.github.io/neuparam
Published: 2023

9. OrthoPlanes: A Novel Representation for Better 3D-Awareness of GANs

Author: He, Honglin, Yang, Zhuoqian, Li, Shikai, Dai, Bo, and Wu, Wayne
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present a new method for generating realistic and view-consistent images with fine geometry from 2D image collections. Our method proposes a hybrid explicit-implicit representation called \textbf{OrthoPlanes}, which encodes fine-grained 3D information in feature maps that can be efficiently generated by modifying 2D StyleGANs. Compared to previous representations, our method has better scalability and expressiveness with clear and explicit information. As a result, our method can handle more challenging view-angles and synthesize articulated objects with high spatial degree of freedom. Experiments demonstrate that our method achieves state-of-the-art results on FFHQ and SHHQ datasets, both quantitatively and qualitatively. Project page: \url{https://orthoplanes.github.io/}.
Published: 2023

10. UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation

Author: Fu, Jianglin, Li, Shikai, Jiang, Yuming, Lin, Kwan-Yee, Wu, Wayne, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Human generation has achieved significant progress. Nonetheless, existing methods still struggle to synthesize specific regions such as faces and hands. We argue that the main reason is rooted in the training data. A holistic human dataset inevitably has insufficient and low-resolution information on local parts. Therefore, we propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model. However, multi-source data inherently a) contains different parts that do not spatially align into a coherent human, and b) comes with different scales. To tackle these challenges, we propose an end-to-end framework, UnitedHuman, that empowers continuous GAN with the ability to effectively utilize multi-source data for high-resolution human generation. Specifically, 1) we design a Multi-Source Spatial Transformer that spatially aligns multi-source images to full-body space with a human parametric model. 2) Next, a continuous GAN is proposed with global-structural guidance and CutMix consistency. Patches from different datasets are then sampled and transformed to supervise the training of this scale-invariant generative model. Extensive experiments demonstrate that our model jointly learned from multi-source data achieves superior quality than those learned from a holistic dataset., Comment: Accepted by ICCV2023. Project page: https://unitedhuman.github.io/ Github: https://github.com/UnitedHuman/UnitedHuman
Published: 2023

11. Innovative Digital Storytelling with AIGC: Exploration and Discussion of Recent Advances

Author: Gu, Rongzhang, Li, Hui, Su, Changyue, and Wu, Wayne
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Multimedia
Abstract: Digital storytelling, as an art form, has struggled with cost-quality balance. The emergence of AI-generated Content (AIGC) is considered as a potential solution for efficient digital storytelling production. However, the specific form, effects, and impacts of this fusion remain unclear, leaving the boundaries of AIGC combined with storytelling undefined. This work explores the current integration state of AIGC and digital storytelling, investigates the artistic value of their fusion in a sample project, and addresses common issues through interviews. Through our study, we conclude that AIGC, while proficient in image creation, voiceover production, and music composition, falls short of replacing humans due to the irreplaceable elements of human creativity and aesthetic sensibilities at present, especially in complex character animations, facial expressions, and sound effects. The research objective is to increase public awareness of the current state, limitations, and challenges arising from combining AIGC and digital storytelling., Comment: Project page: https://lsgm-demo.github.io/Leveraging-recent-advances-of-foundation-models-for-story-telling/
Published: 2023

12. ReliTalk: Relightable Talking Portrait Generation from a Single Video

Author: Qiu, Haonan, Chen, Zhaoxi, Jiang, Yuming, Zhou, Hang, Fan, Xiangyu, Yang, Lei, Wu, Wayne, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in https://github.com/arthur-qiu/ReliTalk.
Published: 2023

13. Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Author: Song, Linsen, Wu, Wayne, Fu, Chaoyou, Loy, Chen Change, and He, Ran
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods., Comment: TCSVT 2022
Published: 2023
Full Text: View/download PDF

14. Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis

Author: Wang, Yuxin, Wu, Wayne, and Xu, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Implicit neural representations have shown powerful capacity in modeling real-world 3D scenes, offering superior performance in novel view synthesis. In this paper, we target a more challenging scenario, i.e., joint scene novel view synthesis and editing based on implicit neural scene representations. State-of-the-art methods in this direction typically consider building separate networks for these two tasks (i.e., view synthesis and editing). Thus, the modeling of interactions and correlations between these two tasks is very limited, which, however, is critical for learning high-quality scene representations. To tackle this problem, in this paper, we propose a unified Neural Radiance Field (NeRF) framework to effectively perform joint scene decomposition and composition for modeling real-world scenes. The decomposition aims at learning disentangled 3D representations of different objects and the background, allowing for scene editing, while scene composition models an entire scene representation for novel view synthesis. Specifically, with a two-stage NeRF framework, we learn a coarse stage for predicting a global radiance field as guidance for point sampling, and in the second fine-grained stage, we perform scene decomposition by a novel one-hot object radiance field regularization module and a pseudo supervision via inpainting to handle ambiguous background regions occluded by objects. The decomposed object-level radiance fields are further composed by using activations from the decomposition module. Extensive quantitative and qualitative results show the effectiveness of our method for scene decomposition and composition, outperforming state-of-the-art methods for both novel-view synthesis and editing tasks., Comment: ICCV2023, Project Page: https://w-ted.github.io/publications/udc-nerf
Published: 2023

15. DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering

Author: Cheng, Wei, Chen, Ruixiang, Yin, Wanqi, Fan, Siming, Chen, Keyu, He, Honglin, Luo, Huiwen, Cai, Zhongang, Wang, Jingbo, Gao, Yang, Yu, Zhengming, Lin, Zhengyu, Ren, Daxuan, Yang, Lei, Liu, Ziwei, Loy, Chen Change, Qian, Chen, Wu, Wayne, Lin, Dahua, Dai, Bo, and Lin, Kwan-Yee
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Realistic human-centric rendering plays a key role in both computer vision and computer graphics. Rapid progress has been made in the algorithm aspect over the years, yet existing human-centric rendering datasets and benchmarks are rather impoverished in terms of diversity, which are crucial for rendering effect. Researchers are usually constrained to explore and evaluate a small set of rendering problems on current datasets, while real-world applications require methods to be robust across different scenarios. In this work, we present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. DNA-Rendering presents several alluring attributes. First, our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. Second, we provide rich assets for each subject -- 2D/3D human body keypoints, foreground masks, SMPLX models, cloth/accessory materials, multi-view images, and videos. These assets boost the current method's accuracy on downstream rendering tasks. Third, we construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps, ensuring high-quality resources for task training and evaluation. Along with the dataset, we provide a large-scale and quantitative benchmark in full-scale, with multiple tasks to evaluate the existing progress of novel view synthesis, novel pose animation synthesis, and novel identity rendering methods. In this manuscript, we describe our DNA-Rendering effort as a revealing of new observations, challenges, and future directions to human-centric rendering. The dataset, code, and benchmarks will be publicly available at https://dna-rendering.github.io/, Comment: This paper is accepted by ICCV2023. Project page: https://dna-rendering.github.io/
Published: 2023

16. RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

Author: Pan, Dongwei, Zhuo, Long, Piao, Jingtan, Luo, Huiwen, Cheng, Wei, Wang, Yuxin, Fan, Siming, Liu, Shengqi, Yang, Lei, Dai, Bo, Liu, Ziwei, Loy, Chen Change, Qian, Chen, Wu, Wayne, Lin, Dahua, and Lin, Kwan-Yee
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Synthesizing high-fidelity head avatars is a central problem for computer vision and graphics. While head avatar synthesis algorithms have advanced rapidly, the best ones still face great obstacles in real-world scenarios. One of the vital causes is inadequate datasets -- 1) current public datasets can only support researchers to explore high-fidelity head avatars in one or two task directions; 2) these datasets usually contain digital head assets with limited data volume, and narrow distribution over different attributes. In this paper, we present RenderMe-360, a comprehensive 4D human head dataset to drive advance in head avatar research. It contains massive data assets, with 243+ million complete head frames, and over 800k video sequences from 500 different identities captured by synchronized multi-view cameras at 30 FPS. It is a large-scale digital library for head avatars with three key attributes: 1) High Fidelity: all subjects are captured by 60 synchronized, high-resolution 2K cameras in 360 degrees. 2) High Diversity: The collected subjects vary from different ages, eras, ethnicities, and cultures, providing abundant materials with distinctive styles in appearance and geometry. Moreover, each subject is asked to perform various motions, such as expressions and head rotations, which further extend the richness of assets. 3) Rich Annotations: we provide annotations with different granularities: cameras' parameters, matting, scan, 2D/3D facial landmarks, FLAME fitting, and text description. Based on the dataset, we build a comprehensive benchmark for head avatar research, with 16 state-of-the-art methods performed on five main tasks: novel view synthesis, novel expression synthesis, hair rendering, hair editing, and talking head generation. Our experiments uncover the strengths and weaknesses of current methods. RenderMe-360 opens the door for future exploration in head avatars., Comment: Technical Report; Project Page: 36; Github Link: https://github.com/RenderMe-360/RenderMe-360
Published: 2023

17. HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks

Author: Chen, Zhuo, Xu, Xudong, Yan, Yichao, Pan, Ye, Zhu, Wenhan, Wu, Wayne, Dai, Bo, and Yang, Xiaokang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the stylization guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes.
Published: 2023

18. Text2Performer: Text-Driven Human Video Generation

Author: Jiang, Yuming, Yang, Shuai, Koh, Tong Liang, Wu, Wayne, Loy, Chen Change, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512x256 resolution) with diverse appearances and flexible motions., Comment: Project Page: https://yumingj.github.io/projects/Text2Performer.html, Github: https://github.com/yumingj/Text2Performer
Published: 2023

19. MonoHuman: Animatable Human Neural Field from Monocular Video

Author: Yu, Zhengming, Cheng, Wei, Liu, Xian, Wu, Wayne, and Lin, Kwan-Yee
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Animating virtual avatars with free-view control is crucial for various applications like virtual reality and digital entertainment. Previous studies have attempted to utilize the representation power of the neural radiance field (NeRF) to reconstruct the human body from monocular videos. Recent works propose to graft a deformation network into the NeRF to further model the dynamics of the human neural field for animating vivid human motions. However, such pipelines either rely on pose-dependent representations or fall short of motion coherency due to frame-independent optimization, making it difficult to generalize to unseen pose sequences realistically. In this paper, we propose a novel framework MonoHuman, which robustly renders view-consistent and high-fidelity avatars under arbitrary novel poses. Our key insight is to model the deformation field with bi-directional constraints and explicitly leverage the off-the-peg keyframe information to reason the feature correlations for coherent results. Specifically, we first propose a Shared Bidirectional Deformation module, which creates a pose-independent generalizable deformation field by disentangling backward and forward deformation correspondences into shared skeletal motion weight and separate non-rigid motions. Then, we devise a Forward Correspondence Search module, which queries the correspondence feature of keyframes to guide the rendering network. The rendered results are thus multi-view consistent with high fidelity, even under challenging novel pose settings. Extensive experiments demonstrate the superiority of our proposed MonoHuman over state-of-the-art methods., Comment: 15 pages, 14 figures. Accepted to CVPR 2023. Project page: https://yzmblog.github.io/projects/MonoHuman/
Published: 2023

20. SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling

Author: Yang, Zhitao, Cai, Zhongang, Mei, Haiyi, Liu, Shuai, Chen, Zhaoxi, Xiao, Weiye, Wei, Yukun, Qing, Zhongfei, Wei, Chen, Dai, Bo, Wu, Wayne, Qian, Chen, Lin, Dahua, Liu, Ziwei, and Yang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Synthetic data has emerged as a promising source for 3D human research as it offers low-cost access to large-scale human datasets. To advance the diversity and annotation quality of human models, we introduce a new synthetic dataset, SynBody, with three appealing features: 1) a clothed parametric human model that can generate a diverse range of subjects; 2) the layered human representation that naturally offers high-quality 3D annotations to support multiple tasks; 3) a scalable system for producing realistic data to facilitate real-world tasks. The dataset comprises 1.2M images with corresponding accurate 3D annotations, covering 10,000 human body models, 1,187 actions, and various viewpoints. The dataset includes two subsets for human pose and shape estimation as well as human neural rendering. Extensive experiments on SynBody indicate that it substantially enhances both SMPL and SMPL-X estimation. Furthermore, the incorporation of layered annotations offers a valuable training resource for investigating the Human Neural Radiance Fields (NeRF)., Comment: Accepted by ICCV 2023. Project webpage: https://synbody.github.io/
Published: 2023

21. CelebV-Text: A Large-Scale Facial Text-Video Dataset

Author: Yu, Jianhui, Zhu, Hao, Jiang, Liming, Loy, Chen Change, Cai, Weidong, and Wu, Wayne
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-driven generation models are flourishing in video generation and editing. However, face-centric text-to-video generation remains a challenge due to the lack of a suitable dataset containing high-quality videos and highly relevant texts. This paper presents CelebV-Text, a large-scale, diverse, and high-quality dataset of facial text-video pairs, to facilitate research on facial text-to-video generation tasks. CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts are of high quality, describing both static and dynamic attributes precisely. The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance. The effectiveness and potential of CelebV-Text are further shown through extensive self-evaluation. A benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task. All data and models are publicly available., Comment: Accepted by CVPR2023. Project Page: https://celebv-text.github.io/
Published: 2023

22. OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Author: Wu, Tong, Zhang, Jiarui, Fu, Xiao, Wang, Yuxin, Ren, Jiawei, Pan, Liang, Wu, Wayne, Yang, Lei, Wang, Jiaqi, Qian, Chen, Lin, Dahua, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale realscanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support highquality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing new observations, challenges, and opportunities for future research in realistic 3D vision., Comment: Project page: https://omniobject3d.github.io/
Published: 2023

23. 3DHumanGAN: 3D-Aware Human Image Generation with 3D Pose Mapping

Author: Yang, Zhuoqian, Li, Shikai, Wu, Wayne, and Dai, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present 3DHumanGAN, a 3D-aware generative adversarial network that synthesizes photorealistic images of full-body humans with consistent appearances under different view-angles and body-poses. To tackle the representational and computational challenges in synthesizing the articulated structure of human bodies, we propose a novel generator architecture in which a 2D convolutional backbone is modulated by a 3D pose mapping network. The 3D pose mapping network is formulated as a renderable implicit function conditioned on a posed 3D human mesh. This design has several merits: i) it leverages the strength of 2D GANs to produce high-quality images; ii) it generates consistent images under varying view-angles and poses; iii) the model can incorporate the 3D human prior and enable pose conditioning. Project page: https://3dhumangan.github.io/., Comment: 9 pages, 8 figures
Published: 2022

24. Audio-Driven Co-Speech Gesture Video Generation

Author: Liu, Xian, Wu, Qianyi, Zhou, Hang, Du, Yuanqi, Wu, Wayne, Lin, Dahua, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE, Comment: Accepted by Advances in Neural Information Processing Systems (NeurIPS), 2022 (Spotlight Presentation). Camera-Ready Version, 19 Pages
Published: 2022

25. VLG: General Video Recognition with Web Textual Knowledge

Author: Lin, Jintao, Liu, Zhaoyang, Wang, Wenhai, Wu, Wayne, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video recognition in an open and dynamic world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. The code and models will be available at https://github.com/MCG-NJU/VLG.
Published: 2022

26. MotionBERT: A Unified Perspective on Learning Human Motion Representations

Author: Zhu, Wentao, Ma, Xiaoxuan, Liu, Zhaoyang, Liu, Libin, Wu, Wayne, and Wang, Yizhou
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations. Code and models are available at https://motionbert.github.io/, Comment: ICCV 2023 Camera Ready
Published: 2022

27. StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Author: Qiu, Haonan, Jiang, Yuming, Zhou, Hang, Wu, Wayne, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic $1024\times1024$ face videos even without high-resolution training videos., Comment: Project Page: http://haonanqiu.com/projects/StyleFaceV.html; Code Repo: https://github.com/arthur-qiu/StyleFaceV
Published: 2022

28. CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Author: Zhu, Hao, Wu, Wayne, Zhu, Wentao, Jiang, Liming, Tang, Siwei, Zhang, Li, Liu, Ziwei, and Loy, Chen Change
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale datasets have played indispensable roles in the recent success of face generation/editing and significantly facilitated the advances of emerging research fields. However, the academic community still lacks a video dataset with diverse facial attribute annotations, which is crucial for the research on face-related videos. In this work, we propose a large-scale, high-quality, and diverse video dataset with rich facial attribute annotations, named the High-Quality Celebrity Video Dataset (CelebV-HQ). CelebV-HQ contains 35,666 video clips with the resolution of 512x512 at least, involving 15,653 identities. All clips are labeled manually with 83 facial attributes, covering appearance, action, and emotion. We conduct a comprehensive analysis in terms of age, ethnicity, brightness stability, motion smoothness, head pose diversity, and data quality to demonstrate the diversity and temporal coherence of CelebV-HQ. Besides, its versatility and potential are validated on two representative tasks, i.e., unconditional video generation and video facial attribute editing. Furthermore, we envision the future potential of CelebV-HQ, as well as the new opportunities and challenges it would bring to related research directions. Data, code, and models are publicly available. Project page: https://celebv-hq.github.io., Comment: ECCV 2022. Project Page: https://celebv-hq.github.io/ ; Dataset: https://github.com/CelebV-HQ/CelebV-HQ
Published: 2022

29. Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

Author: Zhuo, Long, Wang, Guangcong, Li, Shikai, Wu, Wayne, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps. However, this pipeline suffers from high computational cost and long inference latency, which largely depends on two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly compressed via more efficient network architectures. Nevertheless, existing methods mainly focus on slimming network architectures and ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal compression framework, \textbf{Fast-Vid2Vid}, which focuses on data aspects of generative models. It makes the first attempt at time dimension to reduce computational resources and accelerate inference. Specifically, we compress the input data stream spatially and reduce the temporal redundancy. After the proposed spatial-temporal knowledge distillation, our model can synthesize key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid interpolates intermediate frames by motion compensation with slight latency. On standard benchmarks, Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8x computational cost on a single V100 GPU., Comment: ECCV 2022, Project Page: https://fast-vid2vid.github.io/ , Code: https://github.com/fast-vid2vid/fast-vid2vid
Published: 2022

30. Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Author: Tang, Jiaqi, Liu, Zhaoyang, Tan, Jing, Qian, Chen, Wu, Wayne, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generic event boundary detection (GEBD) is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. In this paper, we present a local context modeling and global boundary decoding approach for GEBD task. Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence. Based on them, global boundary decoding sub-network is exploited to decode event boundaries from a global view. Our proposed method achieves 85.13% F1-score on Kinetics-GEBD testing set, which achieves a more than 22% F1-score boost compared to the baseline method. The code is available at https://github.com/JackyTown/GEBD_Challenge_CVPR2022., Comment: arXiv admin note: text overlap with arXiv:2112.04771
Published: 2022

31. Production of high protein yeast using enzymatically liquefied almond hulls.

Author: Hitomi, Alex, Wu, Wayne, Wu, Angela, Jeoh, Tina, Boundy-Mills, Kyria, and Sitepu, Irnayuli
Subjects: Animals, Cattle, Prunus dulcis, Saccharomyces cerevisiae, Proteins, Agriculture, Sugars, Animal Feed
Abstract: Animal feed ingredients, especially those abundant in high quality protein, are the most expensive component of livestock production. Sustainable alternative feedstocks may be sourced from abundant, low value agricultural byproducts. California almond production generates nearly 3 Mtons of biomass per year with about 50% in the form of hulls. Almond hulls are a low-value byproduct currently used primarily for animal feed for dairy cattle. However, the protein and essential amino acid content are low, at ~30% d.b.. The purpose of this study was to improve the protein content and quality using yeast. To achieve this, the almond hulls were liquefied to liberate soluble and structural sugars. A multi-phase screening approach was used to identify yeasts that can consume a large proportion of the sugars in almond hulls while accumulating high concentrations of amino acids essential for livestock feed. Compositional analysis showed that almond hulls are rich in polygalacturonic acid (pectin) and soluble sucrose. A pectinase-assisted process was optimized to liquefy and release soluble sugars from almond hulls. The resulting almond hull slurry containing solubilized sugars was subsequently used to grow high-protein yeasts that could consume nutrients in almond hulls while accumulating high concentrations of high-quality protein rich in essential amino acids needed for livestock feed, yielding a process that would produce 72 mg protein/g almond hull. Further work is needed to achieve conversion of galacturonic acid to yeast cell biomass.
Published: 2023

32. Text2Human: Text-Driven Controllable Human Image Generation

Author: Jiang, Yuming, Yang, Shuai, Qiu, Haonan, Wu, Wayne, Loy, Chen Change, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework can generate more diverse and realistic human images compared to state-of-the-art methods., Comment: SIGGRAPH 2022; Project Page: https://yumingj.github.io/projects/Text2Human.html, Codes available at https://github.com/yumingj/Text2Human
Published: 2022

33. EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

Author: Ji, Xinya, Zhou, Hang, Wang, Kaisiyuan, Wu, Qianyi, Wu, Wayne, Xu, Feng, and Cao, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although significant progress has been made to audio-driven talking face generation, existing methods either neglect facial emotion or cannot be applied to arbitrary subjects. In this paper, we propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces by involving an emotion source video. Specifically, we first propose an Audio2Facial-Dynamics module, which renders talking faces from audio-driven unsupervised zero- and first-order key-points motion. Then through exploring the motion model's properties, we further propose an Implicit Emotion Displacement Learner to represent emotion-related facial dynamics as linearly additive displacements to the previously acquired motion representations. Comprehensive experiments demonstrate that by incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects with realistic emotion patterns., Comment: Accepted by SIGGRAPH 2022 Conference Proceedings. For demo video and codes, see https://jixinya.github.io/projects/EAMM/
Published: 2022

34. Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Author: Cheng, Haoyue, Liu, Zhaoyang, Zhou, Hang, Qian, Chen, Wu, Wayne, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries. This task is challenging because only overall labels indicating the video events are provided for training. However, an event might be labeled but not appear in one of the modalities, which results in a modality-specific noisy label problem. In this work, we propose a training strategy to identify and remove modality-specific noisy labels dynamically. It is motivated by two key observations: 1) networks tend to learn clean samples first; and 2) a labeled event would appear in at least one modality. Specifically, we sort the losses of all instances within a mini-batch individually in each modality, and then select noisy samples according to the relationships between intra-modal and inter-modal losses. Besides, we also propose a simple but valid noise ratio estimation method by calculating the proportion of instances whose confidence is below a preset threshold. Our method makes large improvements over the previous state of the arts (e.g. from 60.0\% to 63.8\% in segment-level visual metric), which demonstrates the effectiveness of our approach. Code and trained models are publicly available at \url{https://github.com/MCG-NJU/JoMoLD}., Comment: Accepted by ECCV 2022
Published: 2022

35. StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Author: Fu, Jianglin, Li, Shikai, Jiang, Yuming, Lin, Kwan-Yee, Qian, Chen, Loy, Chen Change, Wu, Wayne, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Unconditional human image generation is an important task in vision and graphics, which enables various applications in the creative industry. Existing studies in this field mainly focus on "network engineering" such as designing new components and objective functions. This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering", which we believe would complement the current practice. To facilitate a comprehensive study, we collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. Equipped with this large dataset, we rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment. Extensive experiments reveal several valuable observations w.r.t. these aspects: 1) Large-scale data, more than 40K images, are needed to train a high-fidelity unconditional human generation model with vanilla StyleGAN. 2) A balanced training set helps improve the generation quality with rare face poses compared to the long-tailed counterpart, whereas simply balancing the clothing texture distribution does not effectively bring an improvement. 3) Human GAN models with body centers for alignment outperform models trained using face centers or pelvis points as alignment anchors. In addition, a model zoo and human editing applications are demonstrated to facilitate future research in the community., Comment: Technical Report. Project page: https://stylegan-human.github.io/ Code and models: https://github.com/stylegan-human/StyleGAN-Human/
Published: 2022

36. Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis

Author: Cheng, Wei, Xu, Su, Piao, Jingtan, Qian, Chen, Wu, Wayne, Lin, Kwan-Yee, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work targets at using a general deep learning framework to synthesize free-viewpoint images of arbitrary human performers, only requiring a sparse number of camera views as inputs and skirting per-case fine-tuning. The large variation of geometry and appearance, caused by articulated body poses, shapes and clothing types, are the key bottlenecks of this task. To overcome these challenges, we present a simple yet powerful framework, named Generalizable Neural Performer (GNR), that learns a generalizable and robust neural body representation over various geometry and appearance. Specifically, we compress the light fields for novel view human rendering as conditional implicit neural radiance fields from both geometry and appearance aspects. We first introduce an Implicit Geometric Body Embedding strategy to enhance the robustness based on both parametric 3D human body model and multi-view images hints. We further propose a Screen-Space Occlusion-Aware Appearance Blending technique to preserve the high-quality appearance, through interpolating source view appearance to the radiance fields with a relax but approximate geometric guidance. To evaluate our method, we present our ongoing effort of constructing a dataset with remarkable complexity and diversity. The dataset GeneBody-1.0, includes over 360M frames of 370 subjects under multi-view cameras capturing, performing a large variety of pose actions, along with diverse body shapes, clothing, accessories and hairdos. Experiments on GeneBody-1.0 and ZJU-Mocap show better robustness of our methods than recent state-of-the-art generalizable methods among all cross-dataset, unseen subjects and unseen poses settings. We also demonstrate the competitiveness of our model compared with cutting-edge case-specific ones. Dataset, code and model will be made publicly available., Comment: Project Page: https://generalizable-neural-performer.github.io/ Dataset: https://generalizable-neural-performer.github.io/genebody.html/
Published: 2022

37. TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Author: Xu, Yanbo, Yin, Yueqin, Jiang, Liming, Wu, Qianyi, Zheng, Chengyao, Loy, Chen Change, Dai, Bo, and Wu, Wayne
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent advances like StyleGAN have promoted the growth of controllable facial editing. To address its core challenge of attribute decoupling in a single latent space, attempts have been made to adopt dual-space GAN for better disentanglement of style and content representations. Nonetheless, these methods are still incompetent to obtain plausible editing results with high controllability, especially for complicated attributes. In this study, we highlight the importance of interaction in a dual-space GAN for more controllable editing. We propose TransEditor, a novel Transformer-based framework to enhance such interaction. Besides, we develop a new dual-space editing and inversion strategy to provide additional editing flexibility. Extensive experiments demonstrate the superiority of the proposed framework in image quality and editing capability, suggesting the effectiveness of TransEditor for highly controllable facial editing., Comment: CVPR 2022. Code: https://github.com/BillyXYB/TransEditor Project page: https://billyxyb.github.io/TransEditor/
Published: 2022

38. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Author: Liu, Xian, Wu, Qianyi, Zhou, Hang, Xu, Yinghao, Qian, Rui, Lin, Xinyi, Zhou, Xiaowei, Wu, Wayne, Dai, Bo, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G, Comment: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022. Camera-Ready Version, 19 Pages
Published: 2022

39. Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

Author: Liu, Xian, Xu, Yinghao, Wu, Qianyi, Zhou, Hang, Wu, Wayne, and Zhou, Bolei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page: https://alvinliu0.github.io/projects/SSP-NeRF, Comment: 12 pages, 3 figures. Project page: https://alvinliu0.github.io/projects/SSP-NeRF
Published: 2022

40. MoCaNet: Motion Retargeting in-the-wild via Canonicalization Networks

Author: Zhu, Wentao, Yang, Zhuoqian, Di, Ziang, Wu, Wayne, Wang, Yizhou, and Loy, Chen Change
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel framework that brings the 3D motion retargeting task from controlled environments to in-the-wild scenarios. In particular, our method is capable of retargeting body motion from a character in a 2D monocular video to a 3D character without using any motion capture system or 3D reconstruction procedure. It is designed to leverage massive online videos for unsupervised training, needless of 3D annotations or motion-body pairing information. The proposed method is built upon two novel canonicalization operations, structure canonicalization and view canonicalization. Trained with the canonicalization operations and the derived regularizations, our method learns to factorize a skeleton sequence into three independent semantic subspaces, i.e., motion, structure, and view angle. The disentangled representation enables motion retargeting from 2D to 3D with high precision. Our method achieves superior performance on motion transfer benchmarks with large body variations and challenging actions. Notably, the canonicalized skeleton sequence could serve as a disentangled and interpretable representation of human motion that benefits action analysis and motion retrieval., Comment: Accepted by AAAI 2022. The first two authors contributed equally. Project page: https://yzhq97.github.io/mocanet/
Published: 2021

41. Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

Author: Tang, Jiaqi, Liu, Zhaoyang, Qian, Chen, Wu, Wayne, and Wang, Limin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. The main challenge of this task is perceiving various temporal variations of diverse event boundaries. To this end, this paper presents an effective and end-to-end learnable framework (DDM-Net). To tackle the diversity and complicated semantics of event boundaries, we make three notable improvements. First, we construct a feature bank to store multi-level features of space and time, prepared for difference calculation at multiple scales. Second, to alleviate inadequate temporal modeling of previous methods, we present dense difference maps (DDM) to comprehensively characterize the motion pattern. Finally, we exploit progressive attention on multi-level DDM to jointly aggregate appearance and motion clues. As a result, DDM-Net respectively achieves a significant boost of 14% and 8% on Kinetics-GEBD and TAPOS benchmark, and outperforms the top-1 winner solution of LOVEU Challenge@CVPR 2021 without bells and whistles. The state-of-the-art result demonstrates the effectiveness of richer motion representation and more sophisticated aggregation, in handling the diversity of generic event boundary detection. The code is made available at \url{https://github.com/MCG-NJU/DDM}., Comment: CVPR 2022 camera-ready version
Published: 2021

42. Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Author: Jiang, Liming, Dai, Bo, Wu, Wayne, and Loy, Chen Change
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Generative adversarial networks (GANs) typically require ample data for training in order to synthesize high-fidelity images. Recent studies have shown that training GANs with limited data remains formidable due to discriminator overfitting, the underlying cause that impedes the generator's convergence. This paper introduces a novel strategy called Adaptive Pseudo Augmentation (APA) to encourage healthy competition between the generator and the discriminator. As an alternative method to existing approaches that rely on standard data augmentations or model regularization, APA alleviates overfitting by employing the generator itself to augment the real data distribution with generated images, which deceives the discriminator adaptively. Extensive experiments demonstrate the effectiveness of APA in improving synthesis quality in the low-data regime. We provide a theoretical analysis to examine the convergence and rationality of our new training strategy. APA is simple and effective. It can be added seamlessly to powerful contemporary GANs, such as StyleGAN2, with negligible computational cost., Comment: NeurIPS 2021. Code: https://github.com/EndlessSora/DeceiveD Project page: https://www.mmlab-ntu.com/project/apa/index.html
Published: 2021

43. We know what attention is!

Author: Wu, Wayne
Published: 2024
Full Text: View/download PDF

44. Targeting autophagy drug discovery: Targets, indications and development trends

Author: Jiang, Mengjia, Wu, Wayne, Xiong, Zijie, Yu, Xiaoping, Ye, Zihong, and Wu, Zhiping
Published: 2024
Full Text: View/download PDF

45. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

Author: Zhou, Hang, Sun, Yasheng, Wu, Wayne, Loy, Chen Change, Wang, Xiaogang, and Liu, Ziwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. Previous methods rely on pre-estimated structural information such as landmarks and 3D parameters, aiming to generate personalized rhythmic movements. However, the inaccuracy of such estimated information under extreme conditions would lead to degradation problems. In this paper, we propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. The key is to modularize audio-visual representations by devising an implicit low-dimension pose code. Substantially, both speech content and head pose information lie in a joint non-identity embedding space. While speech content information can be defined by learning the intrinsic synchronization between audio-visual modalities, we identify that a pose code will be complementarily learned in a modulated convolution-based reconstruction framework. Extensive experiments show that our method generates accurately lip-synced talking faces whose poses are controllable by other videos. Moreover, our model has multiple advanced capabilities including extreme view robustness and talking face frontalization. Code, models, and demo videos are available at https://hangz-nju-cuhk.github.io/projects/PC-AVS., Comment: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. Code and models are available at https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS
Published: 2021

46. Audio-Driven Emotional Video Portraits

Author: Ji, Xinya, Zhou, Hang, Wang, Kaisiyuan, Wu, Wayne, Loy, Chen Change, Cao, Xun, and Xu, Feng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite previous success in generating audio-driven talking heads, most of the previous studies focus on the correlation between speech content and the mouth shape. Facial emotion, which is one of the most important features on natural human faces, is always neglected in their methods. In this work, we present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios. Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces, i.e., a duration-independent emotion space and a duration dependent content space. With the disentangled features, dynamic 2D emotional facial landmarks can be deduced. Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits, by bridging the gap between the deduced landmarks and the natural head poses of target videos. Extensive experiments demonstrate the effectiveness of our method both qualitatively and quantitatively., Comment: Accepted by CVPR2021
Published: 2021

47. Everything's Talkin': Pareidolia Face Reenactment

Author: Song, Linsen, Wu, Wayne, Fu, Chaoyou, Qian, Chen, Loy, Chen Change, and He, Ran
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a new application direction named Pareidolia Face Reenactment, which is defined as animating a static illusory face to move in tandem with a human face in the video. For the large differences between pareidolia face reenactment and traditional human face reenactment, two main challenges are introduced, i.e., shape variance and texture variance. In this work, we propose a novel Parametric Unsupervised Reenactment Algorithm to tackle these two challenges. Specifically, we propose to decompose the reenactment into three catenate processes: shape modeling, motion transfer and texture synthesis. With the decomposition, we introduce three crucial components, i.e., Parametric Shape Modeling, Expansionary Motion Transfer and Unsupervised Texture Synthesizer, to overcome the problems brought by the remarkably variances on pareidolia faces. Extensive experiments show the superior performance of our method both qualitatively and quantitatively. Code, model and data are available on our project page., Comment: Accepted by CVPR2021
Published: 2021

48. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results

Author: Jiang, Liming, Guo, Zhengkui, Wu, Wayne, Liu, Zhaoyang, Liu, Ziwei, Loy, Chen Change, Yang, Shuo, Xiong, Yuanjun, Xia, Wei, Chen, Baoying, Zhuang, Peiyu, Li, Sili, Chen, Shen, Yao, Taiping, Ding, Shouhong, Li, Jilin, Huang, Feiyue, Cao, Liujuan, Ji, Rongrong, Lu, Changlei, and Tan, Ganchao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set with multiple sources and diverse distortions. A total of 115 participants registered for the competition, and 25 teams made valid submissions. We will summarize the winning solutions and present some discussions on potential research directions., Comment: Technical report. Challenge website: https://competitions.codalab.org/competitions/25228
Published: 2021

49. Focal Frequency Loss for Image Reconstruction and Synthesis

Author: Jiang, Liming, Dai, Bo, Wu, Wayne, and Loy, Chen Change
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Image reconstruction and synthesis have witnessed remarkable progress thanks to the development of generative models. Nonetheless, gaps could still exist between the real and generated images, especially in the frequency domain. In this study, we show that narrowing gaps in the frequency domain can ameliorate image reconstruction and synthesis quality further. We propose a novel focal frequency loss, which allows a model to adaptively focus on frequency components that are hard to synthesize by down-weighting the easy ones. This objective function is complementary to existing spatial losses, offering great impedance against the loss of important frequency information due to the inherent bias of neural networks. We demonstrate the versatility and effectiveness of focal frequency loss to improve popular models, such as VAE, pix2pix, and SPADE, in both perceptual quality and quantitative performance. We further show its potential on StyleGAN2., Comment: ICCV 2021. GitHub: https://github.com/EndlessSora/focal-frequency-loss Project page: https://www.mmlab-ntu.com/project/ffl/index.html
Published: 2020

50. Deducing, Skill and Knowledge

Author: Wu, Wayne, primary
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

350 results on '"Wu, Wayne"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources