Author: "Duan, Nan" / Topic: computer vision and pattern recognition (cs.cv) - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Duan, Nan"' showing total 23 results

Start Over Author "Duan, Nan" Topic computer vision and pattern recognition (cs.cv)

23 results on '"Duan, Nan"'

1. GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

Author: Hou, Zhijian, Ji, Lei, Gao, Difei, Zhong, Wanjun, Yan, Kun, Li, Chao, Chan, Wing-Kwong, Ngo, Chong-Wah, Duan, Nan, and Shou, Mike Zheng
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}., 5 pages, 2 figures, 4 tables, the champion solution for Ego4D Natural Language Queries Challenge in CVPR 2023
Published: 2023

2. NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

Author: Yin, Shengming, Wu, Chenfei, Yang, Huan, Wang, Jianfeng, Wang, Xiaodong, Ni, Minheng, Yang, Zhengyuan, Li, Linjie, Liu, Shuguang, Yang, Fan, Fu, Jianlong, Ming, Gong, Wang, Lijuan, Liu, Zicheng, Li, Houqiang, and Duan, Nan
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{https://msra-nuwa.azurewebsites.net/}
Published: 2023

3. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Author: Wu, Chenfei, Yin, Shengming, Qi, Weizhen, Wang, Xiaodong, Tang, Zecheng, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.
Published: 2023

4. Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

Author: Wang, Xiaodong, Wu, Chenfei, Yin, Shengming, Ni, Minheng, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Yang, Fan, Wang, Lijuan, Liu, Zicheng, Fang, Yuejian, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: 3D photography renders a static image into a video with appealing 3D visual effects. Existing approaches typically first conduct monocular depth estimation, then render the input frame to subsequent frames with various viewpoints, and finally use an inpainting model to fill those missing/occluded regions. The inpainting model plays a crucial role in rendering quality, but it is normally trained on out-of-domain data. To reduce the training and inference gap, we propose a novel self-supervised diffusion model as the inpainting module. Given a single input image, we automatically construct a training pair of the masked occluded image and the ground-truth image with random cycle-rendering. The constructed training samples are closely aligned to the testing instances, without the need of data annotation. To make full use of the masked images, we design a Masked Enhanced Block (MEB), which can be easily plugged into the UNet and enhance the semantic conditions. Towards real-world animation, we present a novel task: out-animation, which extends the space and time of input objects. Extensive experiments on real datasets show that our method achieves competitive results with existing SOTA methods., Comment: 10 pages, 7 figures
Published: 2023
Full Text: View/download PDF

5. ReCo: Region-Controlled Text-to-Image Generation

Author: Yang, Zhengyuan, Wang, Jianfeng, Gan, Zhe, Li, Linjie, Lin, Kevin, Wu, Chenfei, Duan, Nan, Liu, Zicheng, Liu, Ce, Zeng, Michael, and Wang, Lijuan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82->7.36, SceneFID: 15.54->6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model.
Published: 2022

6. An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

Author: Hou, Zhijian, Zhong, Wanjun, Ji, Lei, Gao, Difei, Yan, Kun, Chan, Wing-Kwong, Ngo, Chong-Wah, Shou, Zheng, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Information Retrieval (cs.IR), Computer Science - Information Retrieval
Abstract: This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022. We leverage our model CONE, an efficient window-centric COarse-to-fiNE alignment framework. Specifically, CONE dynamically slices the long video into candidate windows via a sliding window approach. Centering at windows, CONE (1) learns the inter-window (coarse-grained) semantic variance through contrastive learning and speeds up inference by pre-filtering the candidate windows relevant to the NL query, and (2) conducts intra-window (fine-grained) candidate moments ranking utilizing the powerful multi-modal alignment ability of the contrastive vision-text pre-trained model EgoVLP. On the blind test set, CONE achieves 15.26 and 9.24 for R1@IoU=0.3 and R1@IoU=0.5, respectively., Technical report for ECCV 2022 Ego4D workshop, 4 pages, 2 figures, 2 tables. arXiv admin note: substantial text overlap with arXiv:2209.10918
Published: 2022

7. HORIZON: A High-Resolution Panorama Synthesis Framework

Author: Yan, Kun, Ji, Lei, Wu, Chenfei, Liang, Jian, Zhou, Ming, Duan, Nan, and Ma, Shuai
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Panorama synthesis aims to generate a visual scene with all 360-degree views and enables an immersive virtual world. If the panorama synthesis process can be semantically controlled, we can then build an interactive virtual world and form an unprecedented human-computer interaction experience. Existing panoramic synthesis methods mainly focus on dealing with the inherent challenges brought by panoramas' spherical structure such as the projection distortion and the in-continuity problem when stitching edges, but is hard to effectively control semantics. The recent success of visual synthesis like DALL.E generates promising 2D flat images with semantic control, however, it is hard to directly be applied to panorama synthesis which inevitably generates distorted content. Besides, both of the above methods can not effectively synthesize high-resolution panoramas either because of quality or inference speed. In this work, we propose a new generation framework for high-resolution panorama images. The contributions include 1) alleviating the spherical distortion and edge in-continuity problem through spherical modeling, 2) supporting semantic control through both image and text hints, and 3) effectively generating high-resolution panoramas through parallel decoding. Our experimental results on a large-scale high-resolution Street View dataset validated the superiority of our approach quantitatively and qualitatively.
Published: 2022

8. CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Author: Hou, Zhijian, Zhong, Wanjun, Ji, Lei, Gao, Difei, Yan, Kun, Chan, Wing-Kwong, Ngo, Chong-Wah, Shou, Zheng, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Information Retrieval (cs.IR), Computer Science - Information Retrieval
Abstract: This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE., ACL 2023 Camera Ready. 14 pages, 7 figures, 4 tables
Published: 2022

9. NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Author: Wu, Chenfei, Liang, Jian, Hu, Xiaowei, Gan, Zhe, Wang, Jianfeng, Wang, Lijuan, Liu, Zicheng, Fang, Yuejian, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can significantly save computation costs without sacrificing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA. The homepage link is https://nuwa-infinity.microsoft.com., 24 pages, 19 figures
Published: 2022

10. BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

Author: Xu, Xiao, Wu, Chenfei, Rosenman, Shachar, Lal, Vasudev, Che, Wanxiang, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at https://github.com/microsoft/BridgeTower., Comment: Accepted by AAAI 2023, Oral
Published: 2022
Full Text: View/download PDF

11. N��WA-LIP: Language Guided Image Inpainting with Defect-free VQGAN

Author: Ni, Minheng, Wu, Chenfei, Huang, Haoyang, Jiang, Daxin, Zuo, Wangmeng, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), I.5.4
Abstract: Language guided image inpainting aims to fill in the defective regions of an image under the guidance of text while keeping non-defective regions unchanged. However, the encoding process of existing models suffers from either receptive spreading of defective regions or information loss of non-defective regions, giving rise to visually unappealing inpainting results. To address the above issues, this paper proposes N��WA-LIP by incorporating defect-free VQGAN (DF-VQGAN) with multi-perspective sequence to sequence (MP-S2S). In particular, DF-VQGAN introduces relative estimation to control receptive spreading and adopts symmetrical connections to protect information. MP-S2S further enhances visual information from complementary perspectives, including both low-level pixels and high-level tokens. Experiments show that DF-VQGAN performs more robustness than VQGAN. To evaluate the inpainting performance of our model, we built up 3 open-domain benchmarks, where N��WA-LIP is also superior to recent strong baselines., 10 pages, 6 figures
Published: 2022
Full Text: View/download PDF

12. DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder

Author: Shi, Jie, Wu, Chenfei, Liang, Jian, Liu, Xiang, and Duan, Nan
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: Recently most successful image synthesis models are multi stage process to combine the advantages of different methods, which always includes a VAE-like model for faithfully reconstructing embedding to image and a prior model to generate image embedding. At the same time, diffusion models have shown be capacity to generate high-quality synthetic images. Our work proposes a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis. We explore how to input image embedding into diffusion model for excellent performance and find that simple modification on diffusion's UNet can achieve it. Training on ImageNet, Our model achieves state-of-the-art results and generates more photorealistic images specifically. In addition, we apply the DiVAE with an Auto-regressive generator on conditional synthesis tasks to perform more human-feeling and detailed samples.
Published: 2022
Full Text: View/download PDF

13. KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Author: Liu, Yongfei, Wu, Chenfei, Tseng, Shao-yen, Lal, Vasudev, He, Xuming, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies.
Published: 2021

14. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Author: Luo, Huaishao, Ji, Lei, Zhong, Ming, Chen, Yang, Lei, Wen, Duan, Nan, and Li, Tianrui
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Science - Computer Vision and Pattern Recognition
Abstract: Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo. We release our code at https://github.com/ArrowLuo/CLIP4Clip.
Published: 2021

15. N��WA: Visual Synthesis Pre-training for Neural visUal World creAtion

Author: Wu, Chenfei, Liang, Jian, Ji, Lei, Yang, Fan, Fang, Yuejian, Jiang, Daxin, and Duan, Nan
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: This paper presents a unified multimodal pre-trained model called N��WA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N��WA on 8 downstream tasks. Compared to several strong baselines, N��WA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.
Published: 2021
Full Text: View/download PDF

16. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Author: Wu, Chenfei, Huang, Lun, Zhang, Qianxi, Li, Binyang, Ji, Lei, Yang, Fan, Sapiro, Guillermo, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited. In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts. We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.
Published: 2021
Full Text: View/download PDF

17. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Author: Ni, Minheng, Huang, Haoyang, Su, Lin, Cui, Edward, Bharti, Taroon, Wang, Lijuan, Gao, Jianfeng, Zhang, Dongdong, and Duan, Nan
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages., Accepted to CVPR 2021
Published: 2020

18. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Author: Luo, Huaishao, Ji, Lei, Shi, Botian, Huang, Haoyang, Duan, Nan, Li, Tianrui, Li, Jason, Bharti, Taroon, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Computation and Language (cs.CL), Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Published: 2020

19. XGPT: Cross-modal Generative Pre-Training for Image Captioning

Author: Xia, Qiaolin, Huang, Haoyang, Duan, Nan, Zhang, Dongdong, Ji, Lei, Sui, Zhifang, Cui, Edward, Bharti, Taroon, Liu, Xin, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics., Comment: 12 pages, 3 figures, 7 tables
Published: 2020
Full Text: View/download PDF

20. PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

Author: Li, Yikang, Ma, Tao, Bai, Yeqi, Duan, Nan, Wei, Sining, and Wang, Xiaogang
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite some exciting progress on high-quality image generation from structured(scene graphs) or free-form(sentences) descriptions, most of them only guarantee the image-level semantical consistency, i.e. the generated image matching the semantic meaning of the description. They still lack the investigations on synthesizing the images in a more controllable way, like finely manipulating the visual appearance of every object. Therefore, to generate the images with preferred objects and rich interactions, we propose a semi-parametric method, PasteGAN, for generating the image from the scene graph and the image crops, where spatial arrangements of the objects and their pair-wise relationships are defined by the scene graph and the object appearances are determined by the given object crops. To enhance the interactions of the objects in the output, we design a Crop Refining Network and an Object-Image Fuser to embed the objects as well as their relationships into one map. Multiple losses work collaboratively to guarantee the generated images highly respecting the crops and complying with the scene graphs while maintaining excellent image quality. A crop selector is also proposed to pick the most-compatible crops from our external object tank by encoding the interactions around the objects in the scene graph if the crops are not provided. Evaluated on Visual Genome and COCO-Stuff dataset, our proposed method significantly outperforms the SOTA methods on Inception Score, Diversity Score and Fr\'echet Inception Distance. Extensive experiments also demonstrate our method's ability to generate complex and diverse images with given objects., Comment: 10 pages, 6 figures; Accepted by NeurIPS 2019
Published: 2019

21. Deep Reason: A Strong Baseline for Real-World Visual Reasoning

Author: Wu, Chenfei, Zhou, Yanzhao, Li, Gen, Duan, Nan, Tang, Duyu, and Wang, Xiaojie
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a strong baseline for real-world visual reasoning (GQA), which achieves 60.93% in GQA 2019 challenge and won the sixth place. GQA is a large dataset with 22M questions involving spatial understanding and multi-step inference. To help further research in this area, we identified three crucial parts that improve the performance, namely: multi-source features, fine-grained encoder, and score-weighted ensemble. We provide a series of analysis on their impact on performance., Comment: CVPR 2019 Visual Question Answering and Dialog Workshop
Published: 2019
Full Text: View/download PDF

22. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Author: Li, Gen, Duan, Nan, Fang, Yuejian, Gong, Ming, Jiang, Daxin, and Zhou, Ming
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer Transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling (MLM), Masked Object Classification (MOC) and Visual-linguistic Matching (VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training., Comment: accepted by AAAI-2020. arXiv admin note: text overlap with arXiv:1909.11740 by other authors
Published: 2019
Full Text: View/download PDF

23. Visual Question Generation as Dual Task of Visual Question Answering

Author: Li, Yikang, Duan, Nan, Zhou, Bolei, Chu, Xiao, Ouyang, Wanli, and Wang, Xiaogang
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently visual question answering (VQA) and visual question generation (VQG) are two trending topics in the computer vision, which have been explored separately. In this work, we propose an end-to-end unified framework, the Invertible Question Answering Network (iQAN), to leverage the complementary relations between questions and answers in images by jointly training the model on VQA and VQG tasks. Corresponding parameter sharing scheme and regular terms are proposed as constraints to explicitly leverage Q,A's dependencies to guide the training process. After training, iQAN can take either question or answer as input, then output the counterpart. Evaluated on the large-scale visual question answering datasets CLEVR and VQA2, our iQAN improves the VQA accuracy over the baselines. We also show the dual learning framework of iQAN can be generalized to other VQA architectures and consistently improve the results over both the VQA and VQG tasks., Comment: 9 pages
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

23 results on '"Duan, Nan"'

1. GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

2. NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

3. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

4. Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

5. ReCo: Region-Controlled Text-to-Image Generation

6. An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

7. HORIZON: A High-Resolution Panorama Synthesis Framework

8. CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

9. NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

10. BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

11. N��WA-LIP: Language Guided Image Inpainting with Defect-free VQGAN

12. DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder

13. KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

14. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

15. N��WA: Visual Synthesis Pre-training for Neural visUal World creAtion

16. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

17. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

18. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

19. XGPT: Cross-modal Generative Pre-Training for Image Captioning

20. PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

21. Deep Reason: A Strong Baseline for Real-World Visual Reasoning

22. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

23. Visual Question Generation as Dual Task of Visual Question Answering

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

23 results on '"Duan, Nan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources