Author: "Xu, Mengmeng" / Topic: computer science - computer vision and pattern recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xu, Mengmeng"' showing total 23 results

Start Over Author "Xu, Mengmeng" Topic computer science - computer vision and pattern recognition

23 results on '"Xu, Mengmeng"'

1. MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Author: Liu, Haozhe, Liu, Shikun, Zhou, Zijian, Xu, Mengmeng, Xie, Yanping, Han, Xiao, Pérez, Juan C., Liu, Ding, Kahatapitiya, Kumara, Jia, Menglin, Wu, Jui-Chieh, He, Sen, Xiang, Tao, Schmidhuber, Jürgen, and Pérez-Rúa, Juan-Manuel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models., Comment: Project Page: https://mardini-vidgen.github.io
Published: 2024

2. Move Anything with Layered Scene Diffusion

Author: Ren, Jiawei, Xu, Mengmeng, Wu, Jui-Chieh, Liu, Ziwei, Xiang, Tao, and Toisoul, Antoine
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second., Comment: CVPR 2024 camera-ready
Published: 2024

3. Faster Diffusion via Temporal Attention Decomposition

Author: Liu, Haozhe, Zhang, Wentian, Xie, Jinheng, Faccio, Francesco, Xu, Mengmeng, Xiang, Tao, Shou, Mike Zheng, Perez-Rua, Juan-Manuel, and Schmidhuber, Jürgen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
Published: 2024

4. Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

Author: Simon, Christian, He, Sen, Perez-Rua, Juan-Manuel, Xu, Mengmeng, Benhalloum, Amine, and Xiang, Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
Published: 2023

5. GenTron: Diffusion Transformers for Image and Video Generation

Author: Chen, Shoufa, Xu, Mengmeng, Ren, Jiawei, Cong, Yuren, He, Sen, Xie, Yanping, Sinha, Animesh, Luo, Ping, Xiang, Tao, and Perez-Rua, Juan-Manuel
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research., Comment: CVPR2024 Camera Ready. Website: https://www.shoufachen.com/gentron_website/
Published: 2023

6. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Author: Cong, Yuren, Xu, Mengmeng, Simon, Christian, Chen, Shoufa, Ren, Jiawei, Xie, Yanping, Perez-Rua, Juan-Manuel, Rosenhahn, Bodo, Xiang, Tao, and He, Sen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos., Comment: Accepted by ICLR2024. Project page: https://flatten-video-editing.github.io/
Published: 2023

7. Mindstorms in Natural Language-Based Societies of Mind

Author: Zhuge, Mingchen, Liu, Haozhe, Faccio, Francesco, Ashley, Dylan R., Csordás, Róbert, Gopalakrishnan, Anand, Hamdi, Abdullah, Hammoud, Hasan Abed Al Kader, Herrmann, Vincent, Irie, Kazuki, Kirsch, Louis, Li, Bing, Li, Guohao, Liu, Shuming, Mai, Jinjie, Piękos, Piotr, Ramesh, Aditya, Schlag, Imanol, Shi, Weimin, Stanić, Aleksandar, Wang, Wenyi, Wang, Yuhui, Xu, Mengmeng, Fan, Deng-Ping, Ghanem, Bernard, and Schmidhuber, Jürgen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multiagent Systems, 68T07, I.2.6, I.2.11
Abstract: Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions., Comment: 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices
Published: 2023

8. Boundary-Denoising for Video Activity Localization

Author: Xu, Mengmeng, Soldan, Mattia, Gao, Jialin, Liu, Shuming, Pérez-Rúa, Juan-Manuel, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
Published: 2023

9. Multi-Modal Few-Shot Temporal Action Detection

Author: Nag, Sauradip, Xu, Mengmeng, Zhu, Xiatian, Perez-Rua, Juan-Manuel, Ghanem, Bernard, Song, Yi-Zhe, and Xiang, Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET, Comment: Technical Report
Published: 2022

10. Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

Author: Xu, Mengmeng, Li, Yanghao, Fu, Cheng-Yang, Ghanem, Bernard, Xiang, Tao, and Perez-Rua, Juan-Manuel
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr., Comment: We ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge
Published: 2022

11. Negative Frames Matter in Egocentric Visual Query 2D Localization

Author: Xu, Mengmeng, Fu, Cheng-Yang, Li, Yanghao, Ghanem, Bernard, Perez-Rua, Juan-Manuel, and Xiang, Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is registered by a single tight visual crop of the object in a different scene. Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark. The baseline solves the problem by detection and tracking: detect the similar objects in all the frames, then run a tracker from the most confident detection result. In the VQ2D challenge, we identified two limitations of the current baseline. (1) The training configuration has redundant computation. Although the training set has millions of instances, most of them are repetitive and the number of unique object is only around 14.6k. The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames. This is due to the distribution gap between training and evaluation. During training, the model is only able to see the clean, stable, and labeled frames, but the egocentric videos also have noisy, blurry, or unlabeled background frames. To this end, we developed a more efficient and effective solution. Concretely, we bring the training loop from ~15 days to less than 24 hours, and we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline. Our solution got the first ranking on the public leaderboard. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr., Comment: First place winning solution for VQ2D task in CVPR-2022 Ego4D Challenge. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr
Published: 2022

12. ETAD: Training Action Detection End to End on a Laptop

Author: Liu, Shuming, Xu, Mengmeng, Zhao, Chen, Zhao, Xu, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal action detection (TAD) with end-to-end training often suffers from the pain of huge demand for computing resources due to long video duration. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames with extremely low GPU memory consumption. Our main idea is to minimize and balance the heavy computation among features and gradients in each training iteration. We propose to sequentially forward the snippet frame through the video encoder, and backward only a small necessary portion of gradients to update the encoder. To further alleviate the computational redundancy in training, we propose to dynamically sample only a small subset of proposals during training. Moreover, various sampling strategies and ratios are studied for both the encoder and detector. ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. On ActivityNet-1.3, training ETAD in 18 hours can reach 38.25% average mAP with only 1.3 GB memory consumption per video under end-to-end training. Our code will be publicly released.
Published: 2022

13. Contrastive Language-Action Pre-training for Temporal Localization

Author: Xu, Mengmeng, Gundogdu, Erhan, Lapin, Maksim, Ghanem, Bernard, Donoser, Michael, and Bazzani, Loris
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks., Comment: 18 pages, 4 figures
Published: 2022

14. SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Author: Zhao, Chen, Ramazanova, Merey, Xu, Mengmeng, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it as a proposal generation - classification problem. However, there are two caveats with this paradigm. First, proposals are not equipped with annotated labels, which have to be empirically compiled, thus the information in the annotations is not necessarily precisely employed in the model training process. Second, there are large variations in the temporal scale of actions, and neglecting this fact may lead to deficient representation in the video features. To address these issues and precisely model temporal action detection, we formulate the task of temporal action detection in a novel perspective of semantic segmentation. Owing to the 1-dimensional property of TAD, we are able to convert the coarse-grained detection annotations to fine-grained semantic segmentation annotations for free. We take advantage of them to provide precise supervision so as to mitigate the impact induced by the imprecise proposal labels. We propose an end-to-end framework SegTAD composed of a 1D semantic segmentation network (1D-SSN) and a proposal detection network (PDN).
Published: 2022

15. Ego4D: Around the World in 3,000 Hours of Egocentric Video

Author: Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, Martin, Miguel, Nagarajan, Tushar, Radosavovic, Ilija, Ramakrishnan, Santhosh Kumar, Ryan, Fiona, Sharma, Jayant, Wray, Michael, Xu, Mengmeng, Xu, Eric Zhongcong, Zhao, Chen, Bansal, Siddhant, Batra, Dhruv, Cartillier, Vincent, Crane, Sean, Do, Tien, Doulaty, Morrie, Erapalli, Akshay, Feichtenhofer, Christoph, Fragomeni, Adriano, Fu, Qichen, Gebreselasie, Abrham, Gonzalez, Cristina, Hillis, James, Huang, Xuhua, Huang, Yifei, Jia, Wenqi, Khoo, Weslie, Kolar, Jachym, Kottur, Satwik, Kumar, Anurag, Landini, Federico, Li, Chao, Li, Yanghao, Li, Zhenqiang, Mangalam, Karttikeya, Modhugu, Raghava, Munro, Jonathan, Murrell, Tullie, Nishiyasu, Takumi, Price, Will, Puentes, Paola Ruiz, Ramazanova, Merey, Sari, Leda, Somasundaram, Kiran, Southerland, Audrey, Sugano, Yusuke, Tao, Ruijie, Vo, Minh, Wang, Yuchen, Wu, Xindi, Yagi, Takuma, Zhao, Ziwei, Zhu, Yunyi, Arbelaez, Pablo, Crandall, David, Damen, Dima, Farinella, Giovanni Maria, Fuegen, Christian, Ghanem, Bernard, Ithapu, Vamsi Krishna, Jawahar, C. V., Joo, Hanbyul, Kitani, Kris, Li, Haizhou, Newcombe, Richard, Oliva, Aude, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Shi, Jianbo, Shou, Mike Zheng, Torralba, Antonio, Torresani, Lorenzo, Yan, Mingfei, and Malik, Jitendra
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/, Comment: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)
Published: 2021

16. Relation-aware Video Reading Comprehension for Temporal Language Grounding

Author: Gao, Jialin, Sun, Xin, Xu, Mengmeng, Zhou, Xi, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes have been available., Comment: Accepted by EMNLP-21
Published: 2021

17. Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Author: Xu, Mengmeng, Perez-Rua, Juan-Manuel, Zhu, Xiatian, Ghanem, Bernard, and Martinez, Brais
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin., Comment: To appear at NeurIPS 2021. 15 pages, 1 figure
Published: 2021

18. Boundary-sensitive Pre-training for Temporal Localization in Videos

Author: Xu, Mengmeng, Perez-Rua, Juan-Manuel, Escorcia, Victor, Martinez, Brais, Zhu, Xiatian, Zhang, Li, Ghanem, Bernard, and Xiang, Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks., Comment: 11 pages, 4 figures
Published: 2020

19. VLG-Net: Video-Language Graph Matching Network for Video Grounding

Author: Soldan, Mattia, Xu, Mengmeng, Qu, Sisi, Tegner, Jesper, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo., Comment: 14 pages, 7 figures, In proceeding of the ICCV21 workshop: AI for Creative Video Editing and Understanding 2021
Published: 2020

20. LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

Author: Li, Guohao, Xu, Mengmeng, Giancola, Silvio, Thabet, Ali, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Point cloud architecture design has become a crucial problem for 3D deep learning. Several efforts exist to manually design architectures with high accuracy in point cloud tasks such as classification, segmentation, and detection. Recent progress in automatic Neural Architecture Search (NAS) minimizes the human effort in network design and optimizes high performing architectures. However, these efforts fail to consider important factors such as latency during inference. Latency is of high importance in time critical applications like self-driving cars, robot navigation, and mobile applications, that are generally bound by the available hardware. In this paper, we introduce a new NAS framework, dubbed LC-NAS, where we search for point cloud architectures that are constrained to a target latency. We implement a novel latency constraint formulation to trade-off between accuracy and latency in our architecture search. Contrary to previous works, our latency loss guarantees that the final network achieves latency under a specified target value. This is crucial when the end task is to be deployed in a limited hardware setting. Extensive experiments show that LC-NAS is able to find state-of-the-art architectures for point cloud classification in ModelNet40 with minimal computational cost. We also show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy. Finally, we show how our searched architectures easily transfer to a different task, part segmentation on PartNet, where we achieve state-of-the-art results while lowering latency by a factor of 10., Comment: Originally submitted to ECCV'2020 but rejected. This work was filed with the United States Patent and Trademark Office (USPTO) on May 19, 2020 and assigned Serial No. 63/027,241
Published: 2020

21. G-TAD: Sub-Graph Localization for Temporal Action Detection

Author: Xu, Mengmeng, Zhao, Chen, Rojas, David S., Thabet, Ali, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design an SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActivityNet-1.3, it obtains an average mAP of 34.09%; on THUMOS14, it reaches 51.6% at IoU@0.5 when combined with a proposal processing method. G-TAD code is publicly available at https://github.com/frostinassiky/gtad., Comment: Accepted by CVPR2020. 8 pages, 9 figures, 2 pages appendix
Published: 2019

22. BAOD: Budget-Aware Object Detection

Author: Pardo, Alejandro, Xu, Mengmeng, Thabet, Ali, Arbelaez, Pablo, and Ghanem, Bernard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of object detection from a novel perspective in which annotation budget constraints are taken into consideration, appropriately coined Budget Aware Object Detection (BAOD). When provided with a fixed budget, we propose a strategy for building a diverse and informative dataset that can be used to optimally train a robust detector. We investigate both optimization and learning-based methods to sample which images to annotate and what type of annotation (strongly or weakly supervised) to annotate them with. We adopt a hybrid supervised learning framework to train the object detector from both these types of annotation. We conduct a comprehensive empirical study showing that a handcrafted optimization method outperforms other selection techniques including random sampling, uncertainty sampling and active learning. By combining an optimal image/annotation selection scheme with hybrid supervised learning to solve the BAOD problem, we show that one can achieve the performance of a strongly supervised detector on PASCAL-VOC 2007 while saving 12.8% of its original annotation budget. Furthermore, when $100\%$ of the budget is used, it surpasses this performance by 2.0 mAP percentage points.
Published: 2019

23. Boundary-sensitive Pre-training for Temporal Localization in Videos

Author: Xu, Mengmeng, Perez-Rua, Juan-Manuel, Escorcia, Victor, Martinez, Brais, Zhu, Xiatian, Zhang, Li, Ghanem, Bernard, and Xiang, Tao
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks., 11 pages, 4 figures
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

23 results on '"Xu, Mengmeng"'

1. MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

2. Move Anything with Layered Scene Diffusion

3. Faster Diffusion via Temporal Attention Decomposition

4. Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

5. GenTron: Diffusion Transformers for Image and Video Generation

6. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

7. Mindstorms in Natural Language-Based Societies of Mind

8. Boundary-Denoising for Video Activity Localization

9. Multi-Modal Few-Shot Temporal Action Detection

10. Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

11. Negative Frames Matter in Egocentric Visual Query 2D Localization

12. ETAD: Training Action Detection End to End on a Laptop

13. Contrastive Language-Action Pre-training for Temporal Localization

14. SegTAD: Precise Temporal Action Detection via Semantic Segmentation

15. Ego4D: Around the World in 3,000 Hours of Egocentric Video

16. Relation-aware Video Reading Comprehension for Temporal Language Grounding

17. Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

18. Boundary-sensitive Pre-training for Temporal Localization in Videos

19. VLG-Net: Video-Language Graph Matching Network for Video Grounding

20. LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

21. G-TAD: Sub-Graph Localization for Temporal Action Detection

22. BAOD: Budget-Aware Object Detection

23. Boundary-sensitive Pre-training for Temporal Localization in Videos

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

23 results on '"Xu, Mengmeng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources