Author: "Wang, Huayan" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Huayan"' showing total 12 results

Start Over Author "Wang, Huayan" Database arXiv

12 results on '"Wang, Huayan"'

1. Movie Genre Classification by Language Augmentation and Shot Sampling

Author: Zhang, Zhongping, Gu, Yiwen, Plummer, Bryan A., Miao, Xin, Liu, Jiayi, and Wang, Huayan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video-based movie genre classification has garnered considerable attention due to its various applications in recommendation systems. Prior work has typically addressed this task by adapting models from traditional video classification tasks, such as action recognition or event detection. However, these models often neglect language elements (e.g., narrations or conversations) present in videos, which can implicitly convey high-level semantics of movie genres, like storylines or background context. Additionally, existing approaches are primarily designed to encode the entire content of the input video, leading to inefficiencies in predicting movie genres. Movie genre prediction may require only a few shots to accurately determine the genres, rendering a comprehensive understanding of the entire video unnecessary. To address these challenges, we propose a Movie genre Classification method based on Language augmentatIon and shot samPling (Movie-CLIP). Movie-CLIP mainly consists of two parts: a language augmentation module to recognize language elements from the input audio, and a shot sampling module to select representative shots from the entire video. We evaluate our method on MovieNet and Condensed Movies datasets, achieving approximate 6-9% improvement in mean Average Precision (mAP) over the baselines. We also generalize Movie-CLIP to the scene boundary detection task, achieving 1.1% improvement in Average Precision (AP) over the state-of-the-art. We release our implementation at github.com/Zhongping-Zhang/Movie-CLIP., Comment: Accepted at WACV2024
Published: 2022

2. Complex Scene Image Editing by Scene Graph Comprehension

Author: Zhang, Zhongping, He, Huiwen, Plummer, Bryan A., Liao, Zhenyu, and Wang, Huayan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Conditional diffusion models have demonstrated impressive performance on various tasks like text-guided semantic image editing. Prior work requires image regions to be identified manually by human users or use an object detector that only perform well for object-centric manipulations. For example, if an input image contains multiple objects with the same semantic meaning (such as a group of birds), object detectors may struggle to recognize and localize the target object, let alone accurately manipulate it. To address these challenges, we propose a two-stage method for achieving complex scene image editing by Scene Graph Comprehension (SGC-Net). In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects. Unlike object detection methods based solely on object category, our method can accurately recognize the target object by comprehending the objects and their semantic relationships within a complex scene. The second stage uses a conditional diffusion model to edit the image based on our RoI predictions. We evaluate the effectiveness of our approach on the CLEVR and Visual Genome datasets. We report an 8 point improvement in SSIM on CLEVR and our edited images were preferred by human users by 9-33% over prior work on Visual Genome, validating the effectiveness of our proposed method. Code is available at github.com/Zhongping-Zhang/SGC_Net., Comment: Accepted to BMVC 2023
Published: 2022

3. ImageSubject: A Large-scale Dataset for Subject Detection

Author: Miao, Xin, Liu, Jiayi, Wang, Huayan, and Fu, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Main subjects usually exist in the images or videos, as they are the objects that the photographer wants to highlight. Human viewers can easily identify them but algorithms often confuse them with other objects. Detecting the main subjects is an important technique to help machines understand the content of images and videos. We present a new dataset with the goal of training models to understand the layout of the objects and the context of the image then to find the main subjects among them. This is achieved in three aspects. By gathering images from movie shots created by directors with professional shooting skills, we collect the dataset with strong diversity, specifically, it contains 107\,700 images from 21\,540 movie shots. We labeled them with the bounding box labels for two classes: subject and non-subject foreground object. We present a detailed analysis of the dataset and compare the task with saliency detection and object detection. ImageSubject is the first dataset that tries to localize the subject in an image that the photographer wants to highlight. Moreover, we find the transformer-based detection model offers the best result among other popular model architectures. Finally, we discuss the potential applications and conclude with the importance of the dataset.
Published: 2022

4. Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction

Author: Li, Jiachen, Cheng, Shuo, Liao, Zhenyu, Wang, Huayan, Wang, William Yang, and Bai, Qinxun
Subjects: Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Improving the sample efficiency of reinforcement learning algorithms requires effective exploration. Following the principle of $\textit{optimism in the face of uncertainty}$ (OFU), we train a separate exploration policy to maximize the approximate upper confidence bound of the critics in an off-policy actor-critic framework. However, this introduces extra differences between the replay buffer and the target policy regarding their stationary state-action distributions. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy RL training. In particular, we correct the training distribution for both policies and critics. Empirically, we evaluate our proposed method in several challenging continuous control tasks and show superior performance compared to state-of-the-art methods. We also conduct extensive ablation studies to demonstrate the effectiveness and rationality of the proposed method., Comment: Deep RL Workshop, NeurIPS 2022
Published: 2021

5. Fine-Grained Control of Artistic Styles in Image Generation

Author: Miao, Xin, Wang, Huayan, Fu, Jun, Liu, Jiayi, Wang, Shen, and Liao, Zhenyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in generative models and adversarial training have enabled artificially generating artworks in various artistic styles. It is highly desirable to gain more control over the generated style in practice. However, artistic styles are unlike object categories -- there are a continuous spectrum of styles distinguished by subtle differences. Few works have been explored to capture the continuous spectrum of styles and apply it to a style generation task. In this paper, we propose to achieve this by embedding original artwork examples into a continuous style space. The style vectors are fed to the generator and discriminator to achieve fine-grained control. Our method can be used with common generative adversarial networks (such as StyleGAN). Experiments show that our method not only precisely controls the fine-grained artistic style but also improves image quality over vanilla StyleGAN as measured by FID.
Published: 2021

6. EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation

Author: Gao, Yanjun, Liu, Lulu, Wang, Jason, Chen, Xin, Wang, Huayan, and Zhang, Rui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Temporal grounding aims to predict a time interval of a video clip corresponding to a natural language query input. In this work, we present EVOQUER, a temporal grounding framework incorporating an existing text-to-video grounding model and a video-assisted query generation network. Given a query and an untrimmed video, the temporal grounding model predicts the target interval, and the predicted video clip is fed into a video translation task by generating a simplified version of the input query. EVOQUER forms closed-loop learning by incorporating loss functions from both temporal grounding and query generation serving as feedback. Our experiments on two widely used datasets, Charades-STA and ActivityNet, show that EVOQUER achieves promising improvements by 1.05 and 1.31 at R@0.7. We also discuss how the query generation task could facilitate error analysis by explaining temporal grounding model behavior., Comment: Accepted by Visually Grounded Interaction and Language (ViGIL) Workshop at NAACL 2021
Published: 2021

7. Transforming the Latent Space of StyleGAN for Real Face Editing

Author: Li, Heyi, Liu, Jinlong, Zhang, Xinyu, Bai, Yunzhi, Wang, Huayan, and Mueller, Klaus
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite recent advances in semantic manipulation using StyleGAN, semantic editing of real faces remains challenging. The gap between the $W$ space and the $W$+ space demands an undesirable trade-off between reconstruction quality and editing quality. To solve this problem, we propose to expand the latent space by replacing fully-connected layers in the StyleGAN's mapping network with attention-based transformers. This simple and effective technique integrates the aforementioned two spaces and transforms them into one new latent space called $W$++. Our modified StyleGAN maintains the state-of-the-art generation quality of the original StyleGAN with moderately better diversity. But more importantly, the proposed $W$++ space achieves superior performance in both reconstruction quality and editing quality. Despite these significant advantages, our $W$++ space supports existing inversion algorithms and editing methods with only negligible modifications thanks to its structural similarity with the $W/W$+ space. Extensive experiments on the FFHQ dataset prove that our proposed $W$++ space is evidently more preferable than the previous $W/W$+ space for real face editing. The code is publicly available for research purposes at https://github.com/AnonSubm2021/TransStyleGAN., Comment: 28 pages, 15 figures
Published: 2021

8. Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration

Author: Chen, Xingyu, Liu, Yufeng, Ma, Chongyang, Chang, Jianlong, Wang, Huayan, Chen, Tian, Guo, Xiaoyan, Wan, Pengfei, and Zheng, Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have witnessed significant progress in 3D hand mesh recovery. Nevertheless, because of the intrinsic 2D-to-3D ambiguity, recovering camera-space 3D information from a single RGB image remains challenging. To tackle this problem, we divide camera-space mesh recovery into two sub-tasks, i.e., root-relative mesh recovery and root recovery. First, joint landmarks and silhouette are extracted from a single input image to provide 2D cues for the 3D tasks. In the root-relative mesh recovery task, we exploit semantic relations among joints to generate a 3D mesh from the extracted 2D cues. Such generated 3D mesh coordinates are expressed relative to a root position, i.e., wrist of the hand. In the root recovery task, the root position is registered to the camera space by aligning the generated 3D mesh back to 2D cues, thereby completing cameraspace 3D mesh recovery. Our pipeline is novel in that (1) it explicitly makes use of known semantic relations among joints and (2) it exploits 1D projections of the silhouette and mesh to achieve robust registration. Extensive experiments on popular datasets such as FreiHAND, RHD, and Human3.6M demonstrate that our approach achieves stateof-the-art performance on both root-relative mesh recovery and root recovery. Our code is publicly available at https://github.com/SeanChenxy/HandMesh., Comment: CVPR2021
Published: 2021

9. Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

Author: Chen, Tian, An, Shijie, Zhang, Yuan, Ma, Chongyang, Wang, Huayan, Guo, Xiaoyan, and Zheng, Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Monocular depth estimation plays a crucial role in 3D recognition and understanding. One key limitation of existing approaches lies in their lack of structural information exploitation, which leads to inaccurate spatial layout, discontinuous surface, and ambiguous boundaries. In this paper, we tackle this problem in three aspects. First, to exploit the spatial relationship of visual features, we propose a structure-aware neural network with spatial attention blocks. These blocks guide the network attention to global structures or local details across different feature layers. Second, we introduce a global focal relative loss for uniform point pairs to enhance spatial constraint in the prediction, and explicitly increase the penalty on errors in depth-wise discontinuous regions, which helps preserve the sharpness of estimation results. Finally, based on analysis of failure cases for prior methods, we collect a new Hard Case (HC) Depth dataset of challenging scenes, such as special lighting conditions, dynamic objects, and tilted camera angles. The new dataset is leveraged by an informed learning curriculum that mixes training examples incrementally to handle diverse data distributions. Experimental results show that our method outperforms state-of-the-art approaches by a large margin in terms of both prediction accuracy on NYUDv2 dataset and generalization performance on unseen datasets., Comment: 14 pages, 8 figures
Published: 2020

10. Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Author: Liu, Jinlong, Jiang, Guoqing, Bai, Yunzhi, Chen, Ting, and Wang, Huayan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well. In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance, over the data distribution. Based on several approximations, we establish a quantitative relationship between model parameters' GSNR and the generalization gap. This relationship indicates that larger GSNR during training process leads to better generalization performance. Moreover, we show that, different from that of shallow models (e.g. logistic regression, support vector machines), the gradient descent optimization dynamics of DNNs naturally produces large GSNR during training, which is probably the key to DNNs' remarkable generalization ability., Comment: 14 pages, 8 figures, ICLR2020 accepted as spotlight presentation
Published: 2020

11. Teaching Compositionality to CNNs

Author: Stone, Austin, Wang, Huayan, Stark, Michael, Liu, Yi, Phoenix, D. Scott, and George, Dileep
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Convolutional neural networks (CNNs) have shown great success in computer vision, approaching human-level performance when trained for specific tasks via application-specific loss functions. In this paper, we propose a method for augmenting and training CNNs so that their learned features are compositional. It encourages networks to form representations that disentangle objects from their surroundings and from each other, thereby promoting better generalization. Our method is agnostic to the specific details of the underlying CNN to which it is applied and can in principle be used with any CNN. As we show in our experiments, the learned representations lead to feature activations that are more localized and improve performance over non-compositional baselines in object recognition tasks., Comment: Preprint appearing in CVPR 2017
Published: 2017

12. A backward pass through a CNN using a generative model of its activations

Author: Wang, Huayan, Chen, Anna, Liu, Yi, George, Dileep, and Phoenix, D. Scott
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Neural networks have shown to be a practical way of building a very complex mapping between a pre-specified input space and output space. For example, a convolutional neural network (CNN) mapping an image into one of a thousand object labels is approaching human performance in this particular task. However the mapping (neural network) does not automatically lend itself to other forms of queries, for example, to detect/reconstruct object instances, to enforce top-down signal on ambiguous inputs, or to recover object instances from occlusion. One way to address these queries is a backward pass through the network that fuses top-down and bottom-up information. In this paper, we show a way of building such a backward pass by defining a generative model of the neural network's activations. Approximate inference of the model would naturally take the form of a backward pass through the CNN layers, and it addresses the aforementioned queries in a unified framework.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

12 results on '"Wang, Huayan"'

1. Movie Genre Classification by Language Augmentation and Shot Sampling

2. Complex Scene Image Editing by Scene Graph Comprehension

3. ImageSubject: A Large-scale Dataset for Subject Detection

4. Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction

5. Fine-Grained Control of Artistic Styles in Image Generation

6. EVOQUER: Enhancing Temporal Grounding with Video-Pivoted BackQuery Generation

7. Transforming the Latent Space of StyleGAN for Real Face Editing

8. Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration

9. Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

10. Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

11. Teaching Compositionality to CNNs

12. A backward pass through a CNN using a generative model of its activations

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

12 results on '"Wang, Huayan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources