1,275 results on '"Chang, Shih-Fu"'
Search Results
52. Task-Adaptive Negative Envision for Few-Shot Open-Set Recognition
- Author
-
Huang, Shiyuan, Ma, Jiawei, Han, Guangxing, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We study the problem of few-shot open-set recognition (FSOR), which learns a recognition system capable of both fast adaptation to new classes with limited labeled examples and rejection of unknown negative samples. Traditional large-scale open-set methods have been shown ineffective for FSOR problem due to data limitation. Current FSOR methods typically calibrate few-shot closed-set classifiers to be sensitive to negative samples so that they can be rejected via thresholding. However, threshold tuning is a challenging process as different FSOR tasks may require different rejection powers. In this paper, we instead propose task-adaptive negative class envision for FSOR to integrate threshold tuning into the learning process. Specifically, we augment the few-shot closed-set classifier with additional negative prototypes generated from few-shot examples. By incorporating few-shot class correlations in the negative generation process, we are able to learn dynamic rejection boundaries for FSOR tasks. Besides, we extend our method to generalized few-shot open-set recognition (GFSOR), which requires classification on both many-shot and few-shot classes as well as rejection of negative samples. Extensive experiments on public benchmarks validate our methods on both problems., Comment: Accepted by CVPR2022
- Published
- 2020
53. Open-Vocabulary Object Detection Using Captions
- Author
-
Zareian, Alireza, Rosa, Kevin Dela, Hu, Derek Hao, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection., Comment: To be presented at CVPR 2021 (oral paper)
- Published
- 2020
54. Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language
- Author
-
Akbari, Hassan, Palangi, Hamid, Yang, Jianwei, Rao, Sudha, Celikyilmaz, Asli, Fernandez, Roland, Smolensky, Paul, Gao, Jianfeng, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Neuro-symbolic representations have proved effective in learning structure information in vision and language. In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions. We refer to these relations as relative roles and leverage them to make each token role-aware using attention. This results in a more structured and interpretable architecture that incorporates modality-specific inductive biases for the captioning task. Intuitively, the model is able to learn spatial, temporal, and cross-modal relations in a given pair of video and text. The disentanglement achieved by our proposal gives the model more capacity to capture multi-modal structures which result in captions with higher quality for videos. Our experiments on two established video captioning datasets verifies the effectiveness of the proposed approach based on automatic metrics. We further conduct a human evaluation to measure the grounding and relevance of the generated captions and observe consistent improvement for the proposed model. The codes and trained models can be found at https://github.com/hassanhub/R3Transformer
- Published
- 2020
55. Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
- Author
-
Li, Liunian Harold, You, Haoxuan, Wang, Zhecan, Zareian, Alireza, Chang, Shih-Fu, and Chang, Kai-Wei
- Subjects
Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct ``mask-and-predict'' pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models., Comment: NAACL 2021 Camera Ready
- Published
- 2020
56. Uncertainty-Aware Few-Shot Image Classification
- Author
-
Zhang, Zhizheng, Lan, Cuiling, Zeng, Wenjun, Chen, Zhibo, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Few-shot image classification learns to recognize new categories from limited labelled data. Metric learning based approaches have been widely investigated, where a query sample is classified by finding the nearest prototype from the support set based on their feature similarities. A neural network has different uncertainties on its calculated similarities of different pairs. Understanding and modeling the uncertainty on the similarity could promote the exploitation of limited samples in few-shot optimization. In this work, we propose Uncertainty-Aware Few-Shot framework for image classification by modeling uncertainty of the similarities of query-support pairs and performing uncertainty-aware optimization. Particularly, we exploit such uncertainty by converting observed similarities to probabilistic representations and incorporate them to the loss for more effective optimization. In order to jointly consider the similarities between a query and the prototypes in a support set, a graph-based model is utilized to estimate the uncertainty of the pairs. Extensive experiments show our proposed method brings significant improvements on top of a strong baseline and achieves the state-of-the-art performance., Comment: Accepted by IJCAI2021
- Published
- 2020
57. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
- Author
-
Chen, Long, Ma, Wenbo, Xiao, Jun, Zhang, Hanwang, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Multimedia - Abstract
The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https://github.com/ChopinSharp/ref-nms., Comment: Camera ready version at AAAI 2021, Codes are available at: https://github.com/ChopinSharp/ref-nms
- Published
- 2020
58. Analogical Reasoning for Visually Grounded Language Acquisition
- Author
-
Wu, Bo, Qin, Haoyu, Zareian, Alireza, Vondrick, Carl, and Chang, Shih-Fu
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics ,68T07, 68T45, 68T50, 68T40, 68T27 ,I.2.10 ,I.2.6 ,I.2.7 ,I.2.9 - Abstract
Children acquire language subconsciously by observing the surrounding world and listening to descriptions. They can discover the meaning of words even without explicit language knowledge, and generalize to novel compositions effortlessly. In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition (VLA). We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning, which approximates novel compositions by learning semantic mapping and reasoning operations from previously seen compositions. Our proposed method, Analogical Reasoning Transformer Networks (ARTNet), is trained on raw multimedia data (video frames and transcripts), and after observing a set of compositions such as "washing apple" or "cutting carrot", it can generalize and recognize new compositions in new video frames, such as "washing carrot" or "cutting apple". To this end, ARTNet refers to relevant instances in the training data and uses their visual features and captions to establish analogies with the query image. Then it chooses the suitable verb and noun to create a new composition that describes the new image best. Extensive experiments on an instructional video dataset demonstrate that the proposed method achieves significantly better generalization capability and recognition accuracy compared to state-of-the-art transformer models., Comment: 12 pages
- Published
- 2020
59. COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation
- Author
-
Wang, Qingyun, Li, Manling, Wang, Xuan, Parulian, Nikolaus, Han, Guangxing, Ma, Jiawei, Tu, Jingxuan, Lin, Ying, Zhang, Haoran, Liu, Weili, Chauhan, Aabhas, Guan, Yingjun, Li, Bangzheng, Li, Ruisong, Song, Xiangchen, Fung, Yi R., Ji, Heng, Han, Jiawei, Chang, Shih-Fu, Pustejovsky, James, Rah, Jasmine, Liem, David, Elsayed, Ahmed, Palmer, Martha, Voss, Clare, Schneider, Cynthia, and Onyshkevych, Boyan
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
To combat COVID-19, both clinicians and scientists need to digest vast amounts of relevant biomedical knowledge in scientific literature to understand the disease mechanism and related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities and their visual chemical structures, relations, and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence., Comment: 12 pages, Accepted by Proceedings of 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics System Demonstrations, for resources see http://blender.cs.illinois.edu/covid19/, for video see http://159.89.180.81/demo/covid/Covid-KG_DemoVideo.mp4, for slides see https://eaglew.github.io/files/Covid-KG_DemoVideo_with_ethics.pdf
- Published
- 2020
60. Learning Visual Commonsense for Robust Scene Graph Generation
- Author
-
Zareian, Alireza, Wang, Zhecan, You, Haoxuan, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to improve the robustness of scene understanding. To this end, we extend Transformer models to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our model can be applied on any scene graph generation model and correct its obvious mistakes, resulting in more semantically plausible scene graphs. Through extensive experiments, we show our model learns commonsense better than any alternative, and improves the accuracy of state-of-the-art scene graph generation methods., Comment: To be presented at ECCV 2020
- Published
- 2020
61. Beyond Triplet Loss: Meta Prototypical N-tuple Loss for Person Re-identification
- Author
-
Zhang, Zhizheng, Lan, Cuiling, Zeng, Wenjun, Chen, Zhibo, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Person Re-identification (ReID) aims at matching a person of interest across images. In convolutional neural network (CNN) based approaches, loss design plays a vital role in pulling closer features of the same identity and pushing far apart features of different identities. In recent years, triplet loss achieves superior performance and is predominant in ReID. However, triplet loss considers only three instances of two classes in per-query optimization (with an anchor sample as query) and it is actually equivalent to a two-class classification. There is a lack of loss design which enables the joint optimization of multiple instances (of multiple classes) within per-query optimization for person ReID. In this paper, we introduce a multi-class classification loss, i.e., N-tuple loss, to jointly consider multiple (N) instances for per-query optimization. This in fact aligns better with the ReID test/inference process, which conducts the ranking/comparisons among multiple instances. Furthermore, for more efficient multi-class classification, we propose a new meta prototypical N-tuple loss. With the multi-class classification incorporated, our model achieves the state-of-the-art performance on the benchmark person ReID datasets., Comment: Accepted by IEEE Transactions on Multimedia
- Published
- 2020
62. Deep Learning Guided Building Reconstruction from Satellite Imagery-derived Point Clouds
- Author
-
Xu, Bo, Zhang, Xu, Li, Zhixin, Leotta, Matt, Chang, Shih-Fu, and Shan, Jie
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
3D urban reconstruction of buildings from remotely sensed imagery has drawn significant attention during the past two decades. While aerial imagery and LiDAR provide higher resolution, satellite imagery is cheaper and more efficient to acquire for large scale need. However, the high, orbital altitude of satellite observation brings intrinsic challenges, like unpredictable atmospheric effect, multi view angles, significant radiometric differences due to the necessary multiple views, diverse land covers and urban structures in a scene, small base-height ratio or narrow field of view, all of which may degrade 3D reconstruction quality. To address these major challenges, we present a reliable and effective approach for building model reconstruction from the point clouds generated from multi-view satellite images. We utilize multiple types of primitive shapes to fit the input point cloud. Specifically, a deep-learning approach is adopted to distinguish the shape of building roofs in complex and yet noisy scenes. For points that belong to the same roof shape, a multi-cue, hierarchical RANSAC approach is proposed for efficient and reliable segmenting and reconstructing the building point cloud. Experimental results over four selected urban areas (0.34 to 2.04 sq km in size) demonstrate the proposed method can generate detailed roof structures under noisy data environments. The average successful rate for building shape recognition is 83.0%, while the overall completeness and correctness are over 70% with reference to ground truth created from airborne lidar. As the first effort to address the public need of large scale city model generation, the development is deployed as open source software.
- Published
- 2020
63. Cross-media Structured Common Space for Multimedia Event Extraction
- Author
-
Li, Manling, Zareian, Alireza, Zeng, Qi, Whitehead, Spencer, Lu, Di, Ji, Heng, and Chang, Shih-Fu
- Subjects
Computer Science - Multimedia ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods., Comment: Accepted as an oral paper at ACL 2020
- Published
- 2020
64. Unifying Specialist Image Embedding into Universal Image Embedding
- Author
-
Feng, Yang, Peng, Futang, Zhang, Xu, Zhu, Wei, Zhang, Shanfeng, Zhou, Howard, Li, Zhen, Duerig, Tom, Chang, Shih-Fu, and Luo, Jiebo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Deep image embedding provides a way to measure the semantic similarity of two images. It plays a central role in many applications such as image search, face verification, and zero-shot learning. It is desirable to have a universal deep embedding model applicable to various domains of images. However, existing methods mainly rely on training specialist embedding models each of which is applicable to images from a single domain. In this paper, we study an important but unexplored task: how to train a single universal image embedding model to match the performance of several specialists on each specialist's domain. Simply fusing the training data from multiple domains cannot solve this problem because some domains become overfitted sooner when trained together using existing methods. Therefore, we propose to distill the knowledge in multiple specialists into a universal embedding to solve this problem. In contrast to existing embedding distillation methods that distill the absolute distances between images, we transform the absolute distances between images into a probabilistic distribution and minimize the KL-divergence between the distributions of the specialists and the universal embedding. Using several public datasets, we validate that our proposed method accomplishes the goal of universal image embedding.
- Published
- 2020
65. Training with Streaming Annotation
- Author
-
Zhang, Tongtao, Ji, Heng, Chang, Shih-Fu, and Freedman, Marjorie
- Subjects
Computer Science - Computation and Language - Abstract
In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts. To tackle the situation, we utilize a pre-trained transformer network to preserve and integrate the most salient document information from the earlier batches while focusing on the annotation (presumably with higher quality) from the current batch. Using event extraction as a case study, we demonstrate in the experiments that our proposed framework can perform better than conventional approaches (the improvement ranges from 3.6 to 14.9% absolute F-score gain), especially when there is more noise in the early annotation; and our approach spares 19.1% time with regard to the best conventional method.
- Published
- 2020
66. Weakly Supervised Visual Semantic Parsing
- Author
-
Zareian, Alireza, Karaman, Svebor, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method., Comment: To be presented at CVPR 2020 (oral paper)
- Published
- 2020
67. Bridging Knowledge Graphs to Generate Scene Graphs
- Author
-
Zareian, Alireza, Karaman, Svebor, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method., Comment: To be presented at ECCV 2020
- Published
- 2020
68. General Partial Label Learning via Dual Bipartite Graph Autoencoder
- Author
-
Chen, Brian, Wu, Bo, Zareian, Alireza, Zhang, Hanwang, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We formulate a practical yet challenging problem: General Partial Label Learning (GPLL). Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level -- a label set partially labels an instance -- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed -- instances in a group may be partially linked to the label set from another group. Such ambiguous group-level supervision is more practical in real-world scenarios as additional annotation on the instance-level is no longer required, e.g., face-naming in videos where the group consists of faces in a frame, labeled by a name set in the corresponding caption. In this paper, we propose a novel graph convolutional network (GCN) called Dual Bipartite Graph Autoencoder (DB-GAE) to tackle the label ambiguity challenge of GPLL. First, we exploit the cross-group correlations to represent the instance groups as dual bipartite graphs: within-group and cross-group, which reciprocally complements each other to resolve the linking ambiguities. Second, we design a GCN autoencoder to encode and decode them, where the decodings are considered as the refined results. It is worth noting that DB-GAE is self-supervised and transductive, as it only uses the group-level supervision without a separate offline training stage. Extensive experiments on two real-world datasets demonstrate that DB-GAE significantly outperforms the best baseline over absolute 0.159 F1-score and 24.8% accuracy. We further offer analysis on various levels of label ambiguities., Comment: 8 pages
- Published
- 2020
- Full Text
- View/download PDF
69. Rapidly adaptable automated interpretation of point-of-care COVID-19 diagnostics
- Author
-
Arumugam, Siddarth, Ma, Jiawei, Macar, Uzay, Han, Guangxing, McAulay, Kathrine, Ingram, Darrell, Ying, Alex, Chellani, Harshit Harpaldas, Chern, Terry, Reilly, Kenta, Colburn, David A. M., Stanciu, Robert, Duffy, Craig, Williams, Ashley, Grys, Thomas, Chang, Shih-Fu, and Sia, Samuel K.
- Published
- 2023
- Full Text
- View/download PDF
70. Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition
- Author
-
Huang, Shiyuan, Lin, Xudong, Karaman, Svebor, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Two-stream networks have achieved great success in video recognition. A two-stream network combines a spatial stream of RGB frames and a temporal stream of Optical Flow to make predictions. However, the temporal redundancy of RGB frames as well as the high-cost of optical flow computation creates challenges for both the performance and efficiency. Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes. Previous works create one stream for each modality which are combined with an additional temporal stream through late fusion. This is redundant since some modalities like motion vectors already contain temporal information. Based on this observation, we propose a compressed domain two-stream network IP TSN for compressed video recognition, where the two streams are represented by the two types of frames (I and P frames) in compressed videos, without needing a separate temporal stream. With this goal, we propose to fully exploit the motion information of P-stream through generalized distillation from optical flow, which largely improves the efficiency and accuracy. Our P-stream runs 60 times faster than using optical flow while achieving higher accuracy. Our full IP TSN, evaluated over public action recognition benchmarks (UCF101, HMDB51 and a subset of Kinetics), outperforms other compressed domain methods by large margins while improving the total inference speed by 20%.
- Published
- 2019
71. Learning to Learn Words from Visual Scenes
- Author
-
Surís, Dídac, Epstein, Dave, Ji, Heng, Chang, Shih-Fu, and Vondrick, Carl
- Subjects
Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at https://expert.cs.columbia.edu/, Comment: 26 pages, 12 figures
- Published
- 2019
72. Towards Train-Test Consistency for Semi-supervised Temporal Action Localization
- Author
-
Lin, Xudong, Shou, Zheng, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, Weakly-supervised Temporal Action Localization (WTAL) has been densely studied but there is still a large gap between weakly-supervised models and fully-supervised models. It is practical and intuitive to annotate temporal boundaries of a few examples and utilize them to help WTAL models better detect actions. However, the train-test discrepancy of action localization strategy prevents WTAL models from leveraging semi-supervision for further improvement. At training time, attention or multiple instance learning is used to aggregate predictions of each snippet for video-level classification; at test time, they first obtain action score sequences over time, then truncate segments of scores higher than a fixed threshold, and post-process action segments. The inconsistent strategy makes it hard to explicitly supervise the action localization model with temporal boundary annotations at training time. In this paper, we propose a Train-Test Consistent framework, TTC-Loc. In both training and testing time, our TTC-Loc localizes actions by comparing scores of action classes and predicted threshold, which enables it to be trained with semi-supervision. By fixing the train-test discrepancy, our TTC-Loc significantly outperforms the state-of-the-art performance on THUMOS'14, ActivityNet 1.2 and 1.3 when only video-level labels are provided for training. With full annotations of only one video per class and video-level labels for the other videos, our TTC-Loc further boosts the performance and achieves 33.4\% mAP (IoU threshold 0.5) on THUMOS's 14., Comment: Work in progress
- Published
- 2019
73. Context-Gated Convolution
- Author
-
Lin, Xudong, Ma, Lin, Liu, Wei, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
As the basic building block of Convolutional Neural Networks (CNNs), the convolutional layer is designed to extract local patterns and lacks the ability to model global context in its nature. Many efforts have been recently devoted to complementing CNNs with the global modeling ability, especially by a family of works on global feature interaction. In these works, the global context information is incorporated into local features before they are fed into convolutional layers. However, research on neuroscience reveals that the neurons' ability of modifying their functions dynamically according to context is essential for the perceptual tasks, which has been overlooked in most of CNNs. Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features. Moreover, our proposed CGC is lightweight and applicable with modern CNN architectures, and consistently improves the performance of CNNs according to extensive experiments on image classification, action recognition, and machine translation. Our code of this paper is available at https://github.com/XudongLinthu/context-gated-convolution., Comment: ECCV 2020 camera ready version with appendix
- Published
- 2019
74. Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps
- Author
-
Chang, Shih-Fu, Hauptmann, Alex, Morency, Louis-Philippe, Antani, Sameer, Bulterman, Dick, Busso, Carlos, Chai, Joyce, Hirschberg, Julia, Jain, Ramesh, Mayer-Patel, Ketan, Meth, Reuven, Mooney, Raymond, Nahrstedt, Klara, Narayanan, Shri, Natarajan, Prem, Oviatt, Sharon, Prabhakaran, Balakrishnan, Smeulders, Arnold, Sundaram, Hari, Zhang, Zhengyou, and Zhou, Michelle
- Subjects
Computer Science - Multimedia - Abstract
With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the rapid exchange and influence of ideas across traditional discipline boundaries have made the emphasis on multimedia multimodal research even more important than before. To seize these opportunities and respond to the challenges, we have organized a workshop to specifically address and brainstorm the challenges, opportunities, and research roadmaps for MM research. The two-day workshop, held on March 30 and 31, 2017 in Washington DC, was sponsored by the Information and Intelligent Systems Division of the National Science Foundation of the United States. Twenty-three (23) invited participants were asked to review and identify research areas in the MM field that are most important over the next 10-15 year timeframe. Important topics were selected through discussion and consensus, and then discussed in depth in breakout groups. Breakout groups reported initial discussion results to the whole group, who continued with further extensive deliberation. For each identified topic, a summary was produced after the workshop to describe the main findings, including the state of the art, challenges, and research roadmaps planned for the next 5, 10, and 15 years in the identified area., Comment: Long Report of NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps, held in March 2017, Washington DC. Short report available separately
- Published
- 2019
75. Detecting and Simulating Artifacts in GAN Fake Images
- Author
-
Zhang, Xu, Karaman, Svebor, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
To detect GAN generated images, conventional supervised machine learning algorithms require collection of a number of real and fake images from the targeted GAN model. However, the specific model used by the attacker is often unavailable. To address this, we propose a GAN simulator, AutoGAN, which can simulate the artifacts produced by the common pipeline shared by several popular GAN models. Additionally, we identify a unique artifact caused by the up-sampling component included in the common GAN pipeline. We show theoretically such artifacts are manifested as replications of spectra in the frequency domain and thus propose a classifier model based on the spectrum input, rather than the pixel input. By using the simulated images to train a spectrum based classifier, even without seeing the fake images produced by the targeted GAN model during training, our approach achieves state-of-the-art performances on detecting fake images generated by popular GAN models such as CycleGAN., Comment: This is an extended version of our original AutoGAN paper which will be appeared in WIFS 2019
- Published
- 2019
76. Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions
- Author
-
Niu, Yulei, Zhang, Hanwang, Lu, Zhiwu, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., ``largest elephant standing behind baby elephant''. This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context -- visual attributes (e.g., ``largest'', ``baby'') and relationships (e.g., ``behind'') that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Specifically, our framework exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. In addition to reciprocity, our framework considers the semantic information of context, i.e., the referring expression can be reproduced based on the estimated context. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings., Comment: Accepted as regular paper in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Substantial text overlap with arXiv:1712.01892
- Published
- 2019
- Full Text
- View/download PDF
77. CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation
- Author
-
Ma, Jiawei, Shou, Zheng, Zareian, Alireza, Mansour, Hassan, Vetro, Anthony, and Chang, Shih-Fu
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Many real-world applications involve multivariate, geo-tagged time series data: at each location, multiple sensors record corresponding measurements. For example, air quality monitoring system records PM2.5, CO, etc. The resulting time-series data often has missing values due to device outages or communication errors. In order to impute the missing values, state-of-the-art methods are built on Recurrent Neural Networks (RNN), which process each time stamp sequentially, prohibiting the direct modeling of the relationship between distant time stamps. Recently, the self-attention mechanism has been proposed for sequence modeling tasks such as machine translation, significantly outperforming RNN because the relationship between each two time stamps can be modeled explicitly. In this paper, we are the first to adapt the self-attention mechanism for multivariate, geo-tagged time series data. In order to jointly capture the self-attention across multiple dimensions, including time, location and the sensor measurements, while maintain low computational complexity, we propose a novel approach called Cross-Dimensional Self-Attention (CDSA) to process each dimension sequentially, yet in an order-independent manner. Our extensive experiments on four real-world datasets, including three standard benchmarks and our newly collected NYC-traffic dataset, demonstrate that our approach outperforms the state-of-the-art imputation and forecasting methods. A detailed systematic analysis confirms the effectiveness of our design choices.
- Published
- 2019
78. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature
- Author
-
Ye, Mang, Zhang, Xu, Yuen, Pong C., and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space. Motivated by the positive concentrated and negative separated properties observed from category-wise supervised learning, we propose to utilize the instance-wise supervision to approximate these properties, which aims at learning data augmentation invariant and instance spread-out features. To achieve this goal, we propose a novel instance based softmax embedding method, which directly optimizes the `real' instance features on top of the softmax function. It achieves significantly faster learning speed and higher accuracy than all existing methods. The proposed method performs well for both seen and unseen testing categories with cosine similarity. It also achieves competitive performance even without pre-trained network over samples from fine-grained categories., Comment: CVPR 2019
- Published
- 2019
79. Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval
- Author
-
Karaman, Svebor, Lin, Xudong, Hu, Xuefeng, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Information Retrieval ,Computer Science - Multimedia - Abstract
We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation. Such compact hash codes enable the complete elimination of real-valued feature storage and allow for significant reduction of the computation complexity and storage cost of large-scale image retrieval applications. Specifically, we learn a neural network-based model, which transforms the input representation into a binary representation. We formalize the training objective of the network in an intuitive and effective way, considering each training sample as a query and aiming to obtain the same retrieval results using the produced hash codes as those obtained with the original features. This training formulation directly optimizes the hashing model for the target usage of the hash codes it produces. We further explore the addition of a decoder trained to obtain an approximated reconstruction of the original features. At test time, we retrieved the most promising database samples with an efficient graph-based search procedure using only our hash codes and perform re-ranking using the reconstructed features, thus without needing to access the original features at all. Experiments conducted on multiple publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.
- Published
- 2019
80. DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
- Author
-
Shou, Zheng, Lin, Xudong, Kalantidis, Yannis, Sevilla-Lara, Laura, Rohrbach, Marcus, Chang, Shih-Fu, and Yan, Zhicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is noisy and has substantially reduced resolution, which makes it a less discriminative motion representation. To remedy these issues, we propose a lightweight generator network, which reduces noises in motion vectors and captures fine motion details, achieving a more Discriminative Motion Cue (DMC) representation. Since optical flow is a more accurate motion representation, we train the DMC generator to approximate flow using a reconstruction loss and a generative adversarial loss, jointly with the downstream action classification task. Extensive evaluations on three action recognition benchmarks (HMDB-51, UCF-101, and a subset of Kinetics) confirm the effectiveness of our method. Our full system, consisting of the generator and the classifier, is coined as DMC-Net which obtains high accuracy close to that of using flow and runs two orders of magnitude faster than using optical flow at inference time., Comment: Accepted by CVPR'19
- Published
- 2019
81. Few-Shot End-to-End Object Detection via Constantly Concentrated Encoding Across Heads
- Author
-
Ma, Jiawei, Han, Guangxing, Huang, Shiyuan, Yang, Yuncong, Chang, Shih-Fu, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
- Published
- 2022
- Full Text
- View/download PDF
82. Counterfactual Critic Multi-Agent Training for Scene Graph Generation
- Author
-
Chen, Long, Zhang, Hanwang, Xiao, Jun, He, Xiangnan, Pu, Shiliang, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Scene graphs -- objects as nodes and visual relationships as edges -- describe the whereabouts and interactions of the things and stuff in an image for comprehensive scene understanding. To generate coherent scene graphs, almost all existing methods exploit the fruitful visual context by modeling message passing among objects, fitting the dynamic nature of reasoning with visual context, eg, "person" on "bike" can help to determine the relationship "ride", which in turn contributes to the category confidence of the two objects. However, we argue that the scene dynamics is not properly learned by using the prevailing cross-entropy based supervised learning paradigm, which is not sensitive to graph inconsistency: errors at the hub or non-hub nodes are unfortunately penalized equally. To this end, we propose a Counterfactual critic Multi-Agent Training (CMAT) approach to resolve the mismatch. CMAT is a multi-agent policy gradient method that frames objects as cooperative agents, and then directly maximizes a graph-level metric as the reward. In particular, to assign the reward properly to each agent, CMAT uses a counterfactual baseline that disentangles the agent-specific reward by fixing the dynamics of other agents. Extensive validations on the challenging Visual Genome benchmark show that CMAT achieves a state-of-the-art by significant performance gains under various settings and metrics., Comment: International Conference on Computer Vision (ICCV), 2019 (oral)
- Published
- 2018
83. Multi-granularity Generator for Temporal Action Proposal
- Author
-
Liu, Yuan, Ma, Lin, Zhang, Yifeng, Liu, Wei, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal action proposal generation is an important task, aiming to localize the video segments containing human actions in an untrimmed video. In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information. First, we propose to use a bilinear matching model to exploit the rich local information within the video sequence. Afterwards, two components, namely segment proposal producer (SPP) and frame actionness producer (FAP), are combined to perform the task of temporal action proposal at two distinct granularities. SPP considers the whole video in the form of feature pyramid and generates segment proposals from one coarse perspective, while FAP carries out a finer actionness evaluation for each video frame. Our proposed MGG can be trained in an end-to-end fashion. By temporally adjusting the segment proposals with fine-grained frame actionness information, MGG achieves the superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, we employ existing action classifiers to perform the classification of the proposals generated by MGG, leading to significant improvements compared against the competing methods for the video detection task., Comment: Accepted to CVPR 2019
- Published
- 2018
84. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding
- Author
-
Akbari, Hassan, Karaman, Svebor, Bhargava, Surabhi, Chen, Brian, Vondrick, Carl, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub., Comment: Accepted in CVPR 2019
- Published
- 2018
85. Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks
- Author
-
Gao, Hang, Shou, Zheng, Zareian, Alireza, Zhang, Hanwang, and Chang, Shih-Fu
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Deep neural networks suffer from over-fitting and catastrophic forgetting when trained with small data. One natural remedy for this problem is data augmentation, which has been recently shown to be effective. However, previous works either assume that intra-class variances can always be generalized to new classes, or employ naive generation methods to hallucinate finite examples without modeling their latent distributions. In this work, we propose Covariance-Preserving Adversarial Augmentation Networks to overcome existing limits of low-shot learning. Specifically, a novel Generative Adversarial Network is designed to model the latent distribution of each novel class given its related base counterparts. Since direct estimation of novel classes can be inductively biased, we explicitly preserve covariance information as the `variability' of base examples during the generation process. Empirical results show that our model can generate realistic yet diverse examples, leading to substantial improvements on the ImageNet benchmark over the state of the art.
- Published
- 2018
86. Heated-Up Softmax Embedding
- Author
-
Zhang, Xu, Yu, Felix Xinnan, Karaman, Svebor, Zhang, Wei, and Chang, Shih-Fu
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Statistics - Machine Learning - Abstract
Metric learning aims at learning a distance which is consistent with the semantic meaning of the samples. The problem is generally solved by learning an embedding for each sample such that the embeddings of samples of the same category are compact while the embeddings of samples of different categories are spread-out in the feature space. We study the features extracted from the second last layer of a deep neural network based classifier trained with the cross entropy loss on top of the softmax layer. We show that training classifiers with different temperature values of softmax function leads to features with different levels of compactness. Leveraging these insights, we propose a "heating-up" strategy to train a classifier with increasing temperatures, leading the corresponding embeddings to achieve state-of-the-art performance on a variety of metric learning benchmarks., Comment: 11 pages, 4 figures
- Published
- 2018
87. Multimodal Social Media Analysis for Gang Violence Prevention
- Author
-
Blandfort, Philipp, Patton, Desmond, Frey, William R., Karaman, Svebor, Bhargava, Surabhi, Lee, Fei-Tzin, Varia, Siddharth, Kedzie, Chris, Gaskell, Michael B., Schifanella, Rossano, McKeown, Kathleen, and Chang, Shih-Fu
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language ,Statistics - Machine Learning - Abstract
Gang violence is a severe issue in major cities across the U.S. and recent studies [Patton et al. 2017] have found evidence of social media communications that can be linked to such violence in communities with high rates of exposure to gang activity. In this paper we partnered computer scientists with social work researchers, who have domain expertise in gang violence, to analyze how public tweets with images posted by youth who mention gang associations on Twitter can be leveraged to automatically detect psychosocial factors and conditions that could potentially assist social workers and violence outreach workers in prevention and early intervention programs. To this end, we developed a rigorous methodology for collecting and annotating tweets. We gathered 1,851 tweets and accompanying annotations related to visual concepts and the psychosocial codes: aggression, loss, and substance use. These codes are relevant to social work interventions, as they represent possible pathways to violence on social media. We compare various methods for classifying tweets into these three classes, using only the text of the tweet, only the image of the tweet, or both modalities as input to the classifier. In particular, we analyze the usefulness of mid-level visual concepts and the role of different modalities for this tweet classification task. Our experiments show that individually, text information dominates classification performance of the loss class, while image information dominates the aggression and substance use classes. Our multimodal approach provides a very promising improvement (18% relative in mean average precision) over the best single modality approach. Finally, we also illustrate the complexity of understanding social media data and elaborate on open challenges.
- Published
- 2018
88. AutoLoc: Weakly-supervised Temporal Action Localization
- Author
-
Shou, Zheng, Gao, Hang, Zhang, Lei, Miyazawa, Kazuyuki, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS'14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods., Comment: Accepted by ECCV'18
- Published
- 2018
89. Entity-aware Image Caption Generation
- Author
-
Lu, Di, Whitehead, Spencer, Huang, Lifu, Ji, Heng, and Chang, Shih-Fu
- Subjects
Computer Science - Computation and Language - Abstract
Current image captioning approaches generate descriptions which lack specific information, such as named entities that are involved in the images. In this paper we propose a new task which aims to generate informative image captions, given images and hashtags as input. We propose a simple but effective approach to tackle this problem. We first train a convolutional neural networks - long short term memory networks (CNN-LSTM) model to generate a template caption based on the input image. Then we use a knowledge graph based collective inference algorithm to fill in the template with specific named entities retrieved via the hashtags. Experiments on a new benchmark dataset collected from Flickr show that our model generates news-style image descriptions with much richer information. Our model outperforms unimodal baselines significantly with various evaluation metrics., Comment: In proceedings of EMNLP 2018
- Published
- 2018
90. Online Detection of Action Start in Untrimmed, Streaming Videos
- Author
-
Shou, Zheng, Pan, Junting, Chan, Jonathan, Miyazawa, Kazuyuki, Mansour, Hassan, Vetro, Anthony, Giro-i-Nieto, Xavier, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos. The goal of ODAS is to detect the start of an action instance, with high categorization accuracy and low detection latency. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. We propose three novel methods to specifically address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. We conduct extensive experiments using THUMOS'14 and ActivityNet. We show that our proposed methods lead to significant performance gains and improve the state-of-the-art methods. An ablation study confirms the effectiveness of each proposed method., Comment: Accepted by ECCV'18
- Published
- 2018
91. Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks
- Author
-
Chen, Long, Zhang, Hanwang, Xiao, Jun, Liu, Wei, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training. SP-AEN aims to tackle the inherent problem --- semantic loss --- in the prevailing family of embedding-based ZSL, where some semantics would be discarded during training if they are non-discriminative for training classes, but could become critical for recognizing test classes. Specifically, SP-AEN prevents the semantic loss by introducing an independent visual-to-semantic space embedder which disentangles the semantic space into two subspaces for the two arguably conflicting objectives: classification and reconstruction. Through adversarial learning of the two subspaces, SP-AEN can transfer the semantics from the reconstructive subspace to the discriminative one, accomplishing the improved zero-shot recognition of unseen classes. Comparing with prior works, SP-AEN can not only improve classification but also generate photo-realistic images, demonstrating the effectiveness of semantic preservation. On four popular benchmarks: CUB, AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art methods by an absolute performance difference of 12.2\%, 9.3\%, 4.0\%, and 3.6\% in terms of harmonic mean values
- Published
- 2017
92. Grounding Referring Expressions in Images by Variational Context
- Author
-
Zhang, Hanwang, Niu, Yulei, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Our model exploits the reciprocal relation between the referent and context, i.e., either of them influences the estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced, resulting in better localization of referent. We develop a novel cue-specific language-vision embedding network that learns this reciprocity model end-to-end. We also extend the model to the unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings., Comment: in 2018 Conference on Computer Vision and Pattern Recognition (CVPR'18)
- Published
- 2017
93. Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
- Author
-
Niu, Yulei, Lu, Zhiwu, Wen, Ji-Rong, Xiang, Tao, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi-scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets and the results show that our method significantly outperforms the state-of-the-art., Comment: Submited to IEEE TIP
- Published
- 2017
94. Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
- Author
-
Campos, Victor, Jou, Brendan, Giro-i-Nieto, Xavier, Torres, Jordi, and Chang, Shih-Fu
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. Source code is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/ ., Comment: Accepted as conference paper at ICLR 2018
- Published
- 2017
95. Learning Spread-out Local Feature Descriptors
- Author
-
Zhang, Xu, Yu, Felix X., Kumar, Sanjiv, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses in learning local feature descriptors. The idea is that in order to fully utilize the expressive power of the descriptor space, good local feature descriptors should be sufficiently "spread-out" over the space. In this work, we propose a regularization term to maximize the spread in feature descriptor inspired by the property of uniform distribution. We show that the proposed regularization with triplet loss outperforms existing Euclidean distance based descriptor learning techniques by a large margin. As an extension, the proposed regularization technique can also be used to improve image-level deep feature embedding., Comment: ICCV 2017. 9 pages, 7 figures
- Published
- 2017
96. More cat than cute? Interpretable Prediction of Adjective-Noun Pairs
- Author
-
Fernandez, Delia, Woodward, Alejandro, Campos, Victor, Giro-i-Nieto, Xavier, Jou, Brendan, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Multimedia - Abstract
The increasing availability of affect-rich multimedia resources has bolstered interest in understanding sentiment and emotions in and from visual content. Adjective-noun pairs (ANP) are a popular mid-level semantic construct for capturing affect via visually detectable concepts such as "cute dog" or "beautiful landscape". Current state-of-the-art methods approach ANP prediction by considering each of these compound concepts as individual tokens, ignoring the underlying relationships in ANPs. This work aims at disentangling the contributions of the `adjectives' and `nouns' in the visual prediction of ANPs. Two specialised classifiers, one trained for detecting adjectives and another for nouns, are fused to predict 553 different ANPs. The resulting ANP prediction model is more interpretable as it allows us to study contributions of the adjective and noun components. Source code and models are available at https://imatge-upc.github.io/affective-2017-musa2/ ., Comment: Oral paper at ACM Multimedia 2017 Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes (MUSA2)
- Published
- 2017
- Full Text
- View/download PDF
97. ConvNet Architecture Search for Spatiotemporal Feature Learning
- Author
-
Tran, Du, Ray, Jamie, Shou, Zheng, Chang, Shih-Fu, and Paluri, Manohar
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning. Although any image representation can be applied to video frames, a dedicated spatiotemporal representation is still vital in order to incorporate motion patterns that cannot be captured by appearance based models alone. This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3-dimensional (3D) Residual ConvNet. Our proposed architecture outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.
- Published
- 2017
98. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN
- Author
-
Zhang, Hanwang, Kyaw, Zawlin, Yu, Jinyang, and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect "subject-predicate-object" relations in an image with object relation groundtruths available only at the image level. This is motivated by the fact that it is extremely expensive to label the combinatorial relations between objects at the instance level. Compared to the extensively studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more challenging as it needs to examine a large set of regions pairs, which is computationally prohibitive and more likely stuck in a local optimal solution such as those involving wrong spatial context. To this end, we present a Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD. It uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image. In particular, we propose a novel position-role-sensitive score map with pairwise RoI pooling to efficiently capture the crucial context associated with a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines in solving the WSVRD challenge by using results of extensive experiments over two visual relation benchmarks., Comment: To appear in International Conference on Computer Vision (ICCV) 2017, Venice, Italy
- Published
- 2017
99. Localizing Actions from Video Labels and Pseudo-Annotations
- Author
-
Mettes, Pascal, Snoek, Cees G. M., and Chang, Shih-Fu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The goal of this paper is to determine the spatio-temporal location of actions in video. Where training from hard to obtain box annotations is the norm, we propose an intuitive and effective algorithm that localizes actions from their class label only. We are inspired by recent work showing that unsupervised action proposals selected with human point-supervision perform as well as using expensive box annotations. Rather than asking users to provide point supervision, we propose fully automatic visual cues that replace manual point annotations. We call the cues pseudo-annotations, introduce five of them, and propose a correlation metric for automatically selecting and combining them. Thorough evaluation on challenging action localization datasets shows that we reach results comparable to results with full box supervision. We also show that pseudo-annotations can be leveraged during testing to improve weakly- and strongly-supervised localizers., Comment: BMVC
- Published
- 2017
100. Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification
- Author
-
Jiang, Yu-Gang, Wu, Zuxuan, Tang, Jinhui, Li, Zechao, Xue, Xiangyang, and Chang, Shih-Fu
- Subjects
Computer Science - Multimedia ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose to refine the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that (1) LSTM networks which model sequences in an explicitly recurrent manner are highly complementary with CNN models; (2) the feature fusion network which produces a fused representation through modeling feature relationships outperforms alternative fusion strategies; (3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks, the UCF-101 and the Columbia Consumer Videos (CCV), provide strong quantitative evidence that our framework achieves promising results: $93.1\%$ on the UCF-101 and $84.5\%$ on the CCV, outperforming competing methods with clear margins.
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.