234 results on '"Wong, Yongkang"'
Search Results
2. STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting
- Author
-
Chai, Zenghao, Tang, Chen, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Graphics ,Computer Science - Multimedia - Abstract
The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \href{https://star-avatar.github.io}{https://star-avatar.github.io}., Comment: Tech report
- Published
- 2024
3. TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
- Author
-
Li, Wei, Fan, Hehe, Wong, Yongkang, Kankanhalli, Mohan, and Yang, Yi
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises from the inherent complexity of videos and the inefficient language supervision in recent web-collected video-text datasets. In this paper, we introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding, without the need for pre-training on real video data. Specifically, we first employ an advanced LLM to automatically generate Textual Videos comprising continuous textual frames, along with corresponding annotations to simulate real video-text data. Then, these annotated textual videos are used to pre-align a language-only LLM with the video modality. To bridge the gap between textual and real videos, we employ the CLIP model as the feature extractor to align image and text modalities. During text-only pre-alignment, the continuous textual frames, encoded as a sequence of CLIP text features, are analogous to continuous CLIP image features, thus aligning the LLM with real video representation. Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video content with LLMs. In particular, without training on any video data, the TOPA-Llama2-13B model achieves a Top-1 accuracy of 51.0% on the challenging long-form video understanding benchmark, Egoschema. This performance surpasses previous video-text pre-training approaches and proves competitive with recent GPT-3.5-based video agents., Comment: NeurIPS 2024 (Spotlight)
- Published
- 2024
4. Bridging the Intent Gap: Knowledge-Enhanced Visual Generation
- Author
-
Cheng, Yi, Xu, Ziwei, Lin, Dongyun, Cheng, Harry, Wong, Yongkang, Sun, Ying, Lim, Joo Hwee, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leading to a mismatch between the desired and generated output. Second, generative models trained on visual-label pairs lack the comprehensive knowledge to accurately represent all aspects of the input data in their generated outputs. To address these challenges, we propose a knowledge-enhanced iterative refinement framework for visual content generation. We begin by analyzing and identifying the key challenges faced by existing generative models. Then, we introduce various knowledge sources, including human insights, pre-trained models, logic rules, and world knowledge, which can be leveraged to address these challenges. Furthermore, we propose a novel visual generation framework that incorporates a knowledge-based feedback module to iteratively refine the generation process. This module gradually improves the alignment between the generated content and user intentions. We demonstrate the efficacy of the proposed framework through preliminary results, highlighting the potential of knowledge-enhanced generative models for intention-aligned content generation.
- Published
- 2024
5. Finetuning Text-to-Image Diffusion Models for Fairness
- Author
-
Shen, Xudong, Du, Chao, Pang, Tianyu, Lin, Min, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computers and Society - Abstract
The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss that steers specific characteristics of the generated images towards a user-defined target distribution, and (2) adjusted direct finetuning of diffusion model's sampling process (adjusted DFT), which leverages an adjusted gradient to directly optimize losses defined on the generated images. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias is significantly reduced even when finetuning just five soft tokens. Crucially, our method supports diverse perspectives of fairness beyond absolute equality, which is demonstrated by controlling age to a $75\%$ young and $25\%$ old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once by simply including these prompts in the finetuning data. We share code and various fair diffusion model adaptors at https://sail-sg.github.io/finetune-fair-diffusion/., Comment: ICLR 2024 oral presentation
- Published
- 2023
6. ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
- Author
-
Guo, Yangyang, Zhang, Haoyu, Wong, Yongkang, Nie, Liqiang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, \emph{etc}. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance.
- Published
- 2023
7. MCM: Multi-condition Motion Synthesis Framework for Multi-scenario
- Author
-
Ling, Zeyu, Han, Bo, Wong, Yongkang, Kangkanhalli, Mohan, and Geng, Weidong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".
- Published
- 2023
8. A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023
- Author
-
Cheng, Yi, Xu, Ziwei, Fang, Fen, Lin, Dongyun, Fan, Hehe, Wong, Yongkang, Sun, Ying, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this technical report, we present our findings from a study conducted on the EPIC-KITCHENS-100 Unsupervised Domain Adaptation task for Action Recognition. Our research focuses on the innovative application of a differentiable logic loss in the training to leverage the co-occurrence relations between verb and noun, as well as the pre-trained Large Language Models (LLMs) to generate the logic rules for the adaptation to unseen action labels. Specifically, the model's predictions are treated as the truth assignment of a co-occurrence logic formula to compute the logic loss, which measures the consistency between the predictions and the logic constraints. By using the verb-noun co-occurrence matrix generated from the dataset, we observe a moderate improvement in model performance compared to our baseline framework. To further enhance the model's adaptability to novel action labels, we experiment with rules generated using GPT-3.5, which leads to a slight decrease in performance. These findings shed light on the potential and challenges of incorporating differentiable logic and LLMs for knowledge extraction in unsupervised domain adaptation for action recognition. Our final submission (entitled `NS-LLM') achieved the first place in terms of top-1 action recognition accuracy., Comment: Technical report submitted to CVPR 2023 EPIC-Kitchens challenges
- Published
- 2023
9. Chairs Can be Stood on: Overcoming Object Bias in Human-Object Interaction Detection
- Author
-
Wang, Guangzhi, Guo, Yangyang, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
Detecting Human-Object Interaction (HOI) in images is an important step towards high-level visual comprehension. Existing work often shed light on improving either human and object detection, or interaction recognition. However, due to the limitation of datasets, these methods tend to fit well on frequent interactions conditioned on the detected objects, yet largely ignoring the rare ones, which is referred to as the object bias problem in this paper. In this work, we for the first time, uncover the problem from two aspects: unbalanced interaction distribution and biased model learning. To overcome the object bias problem, we propose a novel plug-and-play Object-wise Debiasing Memory (ODM) method for re-balancing the distribution of interactions under detected objects. Equipped with carefully designed read and write strategies, the proposed ODM allows rare interaction instances to be more frequently sampled for training, thereby alleviating the object bias induced by the unbalanced interaction distribution. We apply this method to three advanced baselines and conduct experiments on the HICO-DET and HOI-COCO datasets. To quantitatively study the object bias problem, we advocate a new protocol for evaluating model performance. As demonstrated in the experimental results, our method brings consistent and significant improvements over baselines, especially on rare interactions under each object. In addition, when evaluating under the conventional standard setting, our method achieves new state-of-the-art on the two benchmarks.
- Published
- 2022
10. Distance Matters in Human-Object Interaction Detection
- Author
-
Wang, Guangzhi, Guo, Yangyang, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
Human-Object Interaction (HOI) detection has received considerable attention in the context of scene understanding. Despite the growing progress on benchmarks, we realize that existing methods often perform unsatisfactorily on distant interactions, where the leading causes are two-fold: 1) Distant interactions are by nature more difficult to recognize than close ones. A natural scene often involves multiple humans and objects with intricate spatial relations, making the interaction recognition for distant human-object largely affected by complex visual context. 2) Insufficient number of distant interactions in benchmark datasets results in under-fitting on these instances. To address these problems, in this paper, we propose a novel two-stage method for better handling distant interactions in HOI detection. One essential component in our method is a novel Far Near Distance Attention module. It enables information propagation between humans and objects, whereby the spatial distance is skillfully taken into consideration. Besides, we devise a novel Distance-Aware loss function which leads the model to focus more on distant yet rare interactions. We conduct extensive experiments on two challenging datasets - HICO-DET and V-COCO. The results demonstrate that the proposed method can surpass existing approaches by a large margin, resulting in new state-of-the-art performance.
- Published
- 2022
11. A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA
- Author
-
Guo, Yangyang, Nie, Liqiang, Wong, Yongkang, Liu, Yibing, Cheng, Zhiyong, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As for the noise problem encountered by the retrieval operation on explicit knowledge, we design a novel scheme to create pseudo labels for effective knowledge supervision. This scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering. To validate the effectiveness of the proposed method, we conduct extensive experiments on the benchmark dataset. The experimental results reveal that our method outperforms existing baselines by a noticeable margin. Beyond the reported numbers, this paper further spawns several insights on knowledge utilization for future research with some empirical findings.
- Published
- 2022
12. Learning to Predict Gradients for Semi-Supervised Continual Learning
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan, and Zhao, Qi
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition - Abstract
A key challenge for machine intelligence is to learn new visual concepts without forgetting the previously acquired knowledge. Continual learning is aimed towards addressing this challenge. However, there is a gap between existing supervised continual learning and human-like intelligence, where human is able to learn from both labeled and unlabeled data. How unlabeled data affects learning and catastrophic forgetting in the continual learning process remains unknown. To explore these issues, we formulate a new semi-supervised continual learning method, which can be generically applied to existing continual learning models. Specifically, a novel gradient learner learns from labeled data to predict gradients on unlabeled data. Hence, the unlabeled data could fit into the supervised continual learning method. Different from conventional semi-supervised settings, we do not hypothesize that the underlying classes, which are associated to the unlabeled data, are known to the learning process. In other words, the unlabeled data could be very distinct from the labeled data. We evaluate the proposed method on mainstream continual learning, adversarial continual learning, and semi-supervised learning tasks. The proposed method achieves state-of-the-art performance on classification accuracy and backward transfer in the continual learning setting while achieving desired performance on classification accuracy in the semi-supervised learning setting. This implies that the unlabeled images can enhance the generalizability of continual learning models on the predictive ability on unseen data and significantly alleviate catastrophic forgetting. The code is available at \url{https://github.com/luoyan407/grad_prediction.git}., Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
- Published
- 2022
13. Learning to Minimize the Remainder in Supervised Learning
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan S., and Zhao, Qi
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
The learning process of deep learning methods usually updates the model's parameters in multiple iterations. Each iteration can be viewed as the first-order approximation of Taylor's series expansion. The remainder, which consists of higher-order terms, is usually ignored in the learning process for simplicity. This learning scheme empowers various multimedia based applications, such as image retrieval, recommendation system, and video search. Generally, multimedia data (e.g., images) are semantics-rich and high-dimensional, hence the remainders of approximations are possibly non-zero. In this work, we consider the remainder to be informative and study how it affects the learning process. To this end, we propose a new learning approach, namely gradient adjustment learning (GAL), to leverage the knowledge learned from the past training iterations to adjust vanilla gradients, such that the remainders are minimized and the approximations are improved. The proposed GAL is model- and optimizer-agnostic, and is easy to adapt to the standard learning framework. It is evaluated on three tasks, i.e., image classification, object detection, and regression, with state-of-the-art models and optimizers. The experiments show that the proposed GAL consistently enhances the evaluated models, whereas the ablation studies validate various aspects of the proposed GAL. The code is available at \url{https://github.com/luoyan407/gradient_adjustment.git}., Comment: Accepted to IEEE TMM
- Published
- 2022
14. Unsupervised Motion Representation Learning with Capsule Autoencoders
- Author
-
Xu, Ziwei, Shen, Xudong, Wong, Yongkang, and Kankanhalli, Mohan S
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
We propose the Motion Capsule Autoencoder (MCAE), which addresses a key challenge in the unsupervised learning of motion representations: transformation invariance. MCAE models motion in a two-level hierarchy. In the lower level, a spatio-temporal motion signal is divided into short, local, and semantic-agnostic snippets. In the higher level, the snippets are aggregated to form full-length semantic-aware segments. For both levels, we represent motion with a set of learned transformation invariant templates and the corresponding geometric transformations by using capsule autoencoders of a novel design. This leads to a robust and efficient encoding of viewpoint changes. MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets. Notably, it achieves better results than baselines on Trajectory20 with considerably fewer parameters and state-of-the-art performance on the unsupervised skeleton-based action recognition task., Comment: Accepted by NeurIPS 2021
- Published
- 2021
15. Learning to Predict Trustworthiness with Steep Slope Loss
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan S., and Zhao, Qi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Understanding the trustworthiness of a prediction yielded by a classifier is critical for the safe and effective use of AI models. Prior efforts have been proven to be reliable on small-scale datasets. In this work, we study the problem of predicting trustworthiness on real-world large-scale datasets, where the task is more challenging due to high-dimensional features, diverse visual concepts, and large-scale samples. In such a setting, we observe that the trustworthiness predictors trained with prior-art loss functions, i.e., the cross entropy loss, focal loss, and true class probability confidence loss, are prone to view both correct predictions and incorrect predictions to be trustworthy. The reasons are two-fold. Firstly, correct predictions are generally dominant over incorrect predictions. Secondly, due to the data complexity, it is challenging to differentiate the incorrect predictions from the correct ones on real-world large-scale datasets. To improve the generalizability of trustworthiness predictors, we propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other. The proposed loss is evaluated with two representative deep learning models, i.e., Vision Transformer and ResNet, as trustworthiness predictors. We conduct comprehensive experiments and analyses on ImageNet, which show that the proposed loss effectively improves the generalizability of trustworthiness predictors. The code and pre-trained trustworthiness predictors for reproducibility are available at https://github.com/luoyan407/predict_trustworthiness., Comment: NeurIPS 2021
- Published
- 2021
16. Fair Representation: Guaranteeing Approximate Multiple Group Fairness for Unknown Tasks
- Author
-
Shen, Xudong, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computers and Society - Abstract
Motivated by scenarios where data is used for diverse prediction tasks, we study whether fair representation can be used to guarantee fairness for unknown tasks and for multiple fairness notions simultaneously. We consider seven group fairness notions that cover the concepts of independence, separation, and calibration. Against the backdrop of the fairness impossibility results, we explore approximate fairness. We prove that, although fair representation might not guarantee fairness for all prediction tasks, it does guarantee fairness for an important subset of tasks -- the tasks for which the representation is discriminative. Specifically, all seven group fairness notions are linearly controlled by fairness and discriminativeness of the representation. When an incompatibility exists between different fairness notions, fair and discriminative representation hits the sweet spot that approximately satisfies all notions. Motivated by our theoretical findings, we propose to learn both fair and discriminative representations using pretext loss which self-supervises learning, and Maximum Mean Discrepancy as a fair regularizer. Experiments on tabular, image, and face datasets show that using the learned representation, downstream predictions that we are unaware of when learning the representation indeed become fairer for seven group fairness notions, and the fairness guarantees computed from our theoretical results are all valid., Comment: published in TPAMI
- Published
- 2021
- Full Text
- View/download PDF
17. Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition
- Author
-
Xu, Ziwei, Wang, Guangzhi, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
This paper proposes a novel model for recognizing images with composite attribute-object concepts, notably for composite concepts that are unseen during model training. We aim to explore the three key properties required by the task --- relation-aware, consistent, and decoupled --- to learn rich and robust features for primitive concepts that compose attribute-object pairs. To this end, we propose the Blocked Message Passing Network (BMP-Net). The model consists of two modules. The concept module generates semantically meaningful features for primitive concepts, whereas the visual module extracts visual features for attributes and objects from input images. A message passing mechanism is used in the concept module to capture the relations between primitive concepts. Furthermore, to prevent the model from being biased towards seen composite concepts and reduce the entanglement between attributes and objects, we propose a blocking mechanism that equalizes the information available to the model for both seen and unseen concepts. Extensive experiments and ablation studies on two benchmarks show the efficacy of the proposed model., Comment: Accepted by IEEE Transactions on Multimedia
- Published
- 2021
- Full Text
- View/download PDF
18. Multi2Human: Controllable human image generation with multimodal controls
- Author
-
Gu, Xiaoling, Xu, Shengwenzhuo, Wong, Yongkang, Wu, Zizhao, Yu, Jun, Fan, Jianping, and Kankanhalli, Mohan S.
- Published
- 2024
- Full Text
- View/download PDF
19. $n$-Reference Transfer Learning for Saliency Prediction
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan S., and Zhao, Qi
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Benefiting from deep learning research and large-scale datasets, saliency prediction has achieved significant success in the past decade. However, it still remains challenging to predict saliency maps on images in new domains that lack sufficient data for data-hungry models. To solve this problem, we propose a few-shot transfer learning paradigm for saliency prediction, which enables efficient transfer of knowledge learned from the existing large-scale saliency datasets to a target domain with limited labeled examples. Specifically, very few target domain examples are used as the reference to train a model with a source domain dataset such that the training process can converge to a local minimum in favor of the target domain. Then, the learned model is further fine-tuned with the reference. The proposed framework is gradient-based and model-agnostic. We conduct comprehensive experiments and ablation study on various source domain and target domain pairs. The results show that the proposed framework achieves a significant performance improvement. The code is publicly available at \url{https://github.com/luoyan407/n-reference}., Comment: ECCV 2020
- Published
- 2020
20. Weakly-Supervised Multi-Person Action Recognition in 360$^{\circ}$ Videos
- Author
-
Li, Junnan, Liu, Jianquan, Wong, Yongkang, Nishimura, Shoji, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent development of commodity 360$^{\circ}$ cameras have enabled a single video to capture an entire scene, which endows promising potentials in surveillance scenarios. However, research in omnidirectional video analysis has lagged behind the hardware advances. In this work, we address the important problem of action recognition in top-view 360$^{\circ}$ videos. Due to the wide filed-of-view, 360$^{\circ}$ videos usually capture multiple people performing actions at the same time. Furthermore, the appearance of people are deformed. The proposed framework first transforms omnidirectional videos into panoramic videos, then it extracts spatial-temporal features using region-based 3D CNNs for action recognition. We propose a weakly-supervised method based on multi-instance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision. We perform experiments to quantitatively validate the efficacy of the proposed method and qualitatively demonstrate action localization results. To enable research in this direction, we introduce 360Action, the first omnidirectional video dataset for multi-person action recognition.
- Published
- 2020
21. GradMix: Multi-source Transfer across Domains and Tasks
- Author
-
Li, Junnan, Xu, Ziwei, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The computer vision community is witnessing an unprecedented rate of new tasks being proposed and addressed, thanks to the deep convolutional networks' capability to find complex mappings from X to Y. The advent of each task often accompanies the release of a large-scale annotated dataset, for supervised training of deep network. However, it is expensive and time-consuming to manually label sufficient amount of training data. Therefore, it is important to develop algorithms that can leverage off-the-shelf labeled dataset to learn useful knowledge for the target task. While previous works mostly focus on transfer learning from a single source, we study multi-source transfer across domains and tasks (MS-DTT), in a semi-supervised setting. We propose GradMix, a model-agnostic method applicable to any model trained with gradient-based learning rule, to transfer knowledge via gradient descent by weighting and mixing the gradients from all sources during training. GradMix follows a meta-learning objective, which assigns layer-wise weights to the source gradients, such that the combined gradient follows the direction that minimize the loss for a small set of samples from the target dataset. In addition, we propose to adaptively adjust the learning rate for each mini-batch based on its importance to the target task, and a pseudo-labeling method to leverage the unlabeled samples in the target domain. We conduct MS-DTT experiments on two tasks: digit recognition and action recognition, and demonstrate the advantageous performance of the proposed method against multiple baselines.
- Published
- 2020
22. Direction Concentration Learning: Enhancing Congruency in Machine Learning
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan S., and Zhao, Qi
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Statistics - Machine Learning - Abstract
One of the well-known challenges in computer vision tasks is the visual diversity of images, which could result in an agreement or disagreement between the learned knowledge and the visual content exhibited by the current observation. In this work, we first define such an agreement in a concepts learning process as congruency. Formally, given a particular task and sufficiently large dataset, the congruency issue occurs in the learning process whereby the task-specific semantics in the training data are highly varying. We propose a Direction Concentration Learning (DCL) method to improve congruency in the learning process, where enhancing congruency influences the convergence path to be less circuitous. The experimental results show that the proposed DCL method generalizes to state-of-the-art models and optimizers, as well as improves the performances of saliency prediction task, continual learning task, and classification task. Moreover, it helps mitigate the catastrophic forgetting problem in the continual learning task. The code is publicly available at https://github.com/luoyan407/congruency., Comment: This is a preprint and the formal version has been published in TPAMI
- Published
- 2019
- Full Text
- View/download PDF
23. Explainable Video Action Reasoning via Prior Knowledge and State Transitions
- Author
-
Zhuo, Tao, Cheng, Zhiyong, Zhang, Peng, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are linked by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.
- Published
- 2019
24. $\mathcal{G}$-softmax: Improving Intra-class Compactness and Inter-class Separability of Features
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan, and Zhao, Qi
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Intra-class compactness and inter-class separability are crucial indicators to measure the effectiveness of a model to produce discriminative features, where intra-class compactness indicates how close the features with the same label are to each other and inter-class separability indicates how far away the features with different labels are. In this work, we investigate intra-class compactness and inter-class separability of features learned by convolutional networks and propose a Gaussian-based softmax ($\mathcal{G}$-softmax) function that can effectively improve intra-class compactness and inter-class separability. The proposed function is simple to implement and can easily replace the softmax function. We evaluate the proposed $\mathcal{G}$-softmax function on classification datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) and on multi-label classification datasets (i.e., MS COCO and NUS-WIDE). The experimental results show that the proposed $\mathcal{G}$-softmax function improves the state-of-the-art models across all evaluated datasets. In addition, analysis of the intra-class compactness and inter-class separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement. More importantly, we observe that high intra-class compactness and inter-class separability are linearly correlated to average precision on MS COCO and NUS-WIDE. This implies that improvement of intra-class compactness and inter-class separability would lead to improvement of average precision., Comment: 15 pages, published in TNNLS
- Published
- 2019
- Full Text
- View/download PDF
25. Visual Social Relationship Recognition
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Social relationships form the basis of social structure of humans. Developing computational models to understand social relationships from visual data is essential for building intelligent machines that can better interact with humans in a social environment. In this work, we study the problem of visual social relationship recognition in images. We propose a Dual-Glance model for social relationship recognition, where the first glance fixates at the person of interest and the second glance deploys attention mechanism to exploit contextual cues. To enable this study, we curated a large scale People in Social Context (PISC) dataset, which comprises of 23,311 images and 79,244 person pairs with annotated social relationships. Since visually identifying social relationship bears certain degree of uncertainty, we further propose an Adaptive Focal Loss to leverage the ambiguous annotations for more effective learning. We conduct extensive experiments to quantitatively and qualitatively demonstrate the efficacy of our proposed method, which yields state-of-the-art performance on social relationship recognition., Comment: arXiv admin note: text overlap with arXiv:1708.00634
- Published
- 2018
26. Learning to Learn from Noisy Labeled Data
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Statistics - Machine Learning - Abstract
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect. There exist many inexpensive data sources on the web, but they tend to contain inaccurate labels. Training on noisy labeled datasets causes performance degradation because DNNs can easily overfit to the label noise. To overcome this problem, we propose a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update. The proposed meta-learning method simulates actual training by generating synthetic noisy labels, and train the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise. We conduct extensive experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset. The results demonstrate the advantageous performance of the proposed method compared to several state-of-the-art baselines.
- Published
- 2018
27. Unsupervised Online Video Object Segmentation with Motion Property Understanding
- Author
-
Zhuo, Tao, Cheng, Zhiyong, Zhang, Peng, Wong, Yongkang, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Unsupervised video object segmentation aims to automatically segment moving objects over an unconstrained video without any user annotation. So far, only few unsupervised online methods have been reported in literature and their performance is still far from satisfactory, because the complementary information from future frames cannot be processed under online setting. To solve this challenging problem, in this paper, we propose a novel Unsupervised Online Video Object Segmentation (UOVOS) framework by construing the motion property to mean moving in concurrence with a generic object for segmented regions. By incorporating salient motion detection and object proposal, a pixel-wise fusion strategy is developed to effectively remove detection noise such as dynamic background and stationary objects. Furthermore, by leveraging the obtained segmentation from immediately preceding frames, a forward propagation algorithm is employed to deal with unreliable motion detection and object proposals. Experimental results on several benchmark datasets demonstrate the efficacy of the proposed method. Compared to the state-of-the-art unsupervised online segmentation algorithms, the proposed method achieves an absolute gain of 6.2%. Moreover, our method achieves better performance than the best unsupervised offline algorithm on the DAVIS-2016 benchmark dataset. Our code is available on the project website: https://github.com/visiontao/uovos.
- Published
- 2018
28. Unsupervised Learning of View-invariant Action Representations
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets., Comment: NIPS 2018
- Published
- 2018
29. Interact as You Intend: Intention-Driven Human-Object Interaction Detection
- Author
-
Xu, Bingjie, Li, Junnan, Wong, Yongkang, Kankanhalli, Mohan S., and Zhao, Qi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, the ability to fully comprehend a social scene is still in its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) in social scene images, which is demanding in terms of research and increasingly useful for practical applications. To undertake social tasks interacting with objects, humans direct their attention and move their body based on their intention. Based on this observation, we provide a unique computational perspective to explore human intention in HOI detection. Specifically, the proposed human intention-driven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances. It also utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting. In addition, we propose a hard negative sampling strategy to address the problem of mis-grouping. We perform extensive experiments on two benchmark datasets, namely V-COCO and HICO-DET. The efficacy of each proposed component has also been validated.
- Published
- 2018
30. Video Storytelling: Textual Summaries for Events
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
- Subjects
Computer Science - Multimedia ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a Residual Bidirectional Recurrent Neural Network to leverage contextual information from past and future. Second, we propose a Narrator model to discover the underlying storyline. The Narrator is formulated as a reinforcement learning agent which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the Video Story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines, and show that our method achieves better performance, in terms of quantitative measures and user study., Comment: Published in IEEE Transactions on Multimedia
- Published
- 2018
- Full Text
- View/download PDF
31. Attention Transfer from Web Images for Video Recognition
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method., Comment: ACM Multimedia, 2017
- Published
- 2017
32. Dual-Glance Model for Deciphering Social Relationships
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Since the beginning of early civilizations, social relationships derived from each individual fundamentally form the basis of social structure in our daily life. In the computer vision literature, much progress has been made in scene understanding, such as object detection and scene parsing. Recent research focuses on the relationship between objects based on its functionality and geometrical relations. In this work, we aim to study the problem of social relationship recognition, in still images. We have proposed a dual-glance model for social relationship recognition, where the first glance fixates at the individual pair of interest and the second glance deploys attention mechanism to explore contextual cues. We have also collected a new large scale People in Social Context (PISC) dataset, which comprises of 22,670 images and 76,568 annotated samples from 9 types of social relationship. We provide benchmark results on the PISC dataset, and qualitatively demonstrate the efficacy of the proposed model., Comment: IEEE International Conference on Computer Vision (ICCV), 2017
- Published
- 2017
33. Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking
- Author
-
Li, Wenhui, Wong, Yongkang, Liu, An-An, Li, Yang, Su, Yu-Ting, and Kankanhalli, Mohan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Action recognition has received increasing attention from the computer vision and machine learning communities in the last decade. To enable the study of this problem, there exist a vast number of action datasets, which are recorded under controlled laboratory settings, real-world surveillance environments, or crawled from the Internet. Apart from the "in-the-wild" datasets, the training and test split of conventional datasets often possess similar environments conditions, which leads to close to perfect performance on constrained datasets. In this paper, we introduce a new dataset, namely Multi-Camera Action Dataset (MCAD), which is designed to evaluate the open view classification problem under the surveillance environment. In total, MCAD contains 14,298 action samples from 18 action categories, which are performed by 20 subjects and independently recorded with 5 cameras. Inspired by the well received evaluation approach on the LFW dataset, we designed a standard evaluation protocol and benchmarked MCAD under several scenarios. The benchmark shows that while an average of 85% accuracy is achieved under the closed-view scenario, the performance suffers from a significant drop under the cross-view scenario. In the worst case scenario, the performance of 10-fold cross validation drops from 87.0% to 47.4%.
- Published
- 2016
- Full Text
- View/download PDF
34. n-Reference Transfer Learning for Saliency Prediction
- Author
-
Luo, Yan, Wong, Yongkang, Kankanhalli, Mohan S., Zhao, Qi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
- Published
- 2020
- Full Text
- View/download PDF
35. KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network.
- Author
-
Wu, Zizhao, Liu, Siyu, Lu, Peioyan, Yang, Ping, Wong, Yongkang, Gu, Xiaoling, and Kankanhalli, Mohan S.
- Subjects
VIRTUAL networks ,PRODUCT image ,CLOTHING & dress ,CLEANING compounds ,SPLINES - Abstract
Image-based virtual try-on aims to fit a target garment to a reference person. Most existing methods are limited to solving the Garment-To-Person (G2P) try-on task that transfers a garment from a clean product image to the reference person and do not consider the Person-To-Person (P2P) try-on task that transfers a garment from a clothed person image to the reference person, which limits the practical applicability. The P2P try-on task is more challenging due to spatial discrepancies caused by different poses, body shapes, and views between the reference person and the target person. To address this issue, we propose a novel Keypoints-Driven Flow Based Virtual Try-On Network (KF-VTON) for handling both the G2P and P2P try-on tasks. Our KF-VTON has two key innovations: (1) We propose a new keypoints-driven flow based deformation model to warp the garment. This model establishes spatial correspondences between the target garment and reference person by combining the robustness of Thin-plate Spline (TPS) based deformation and the flexibility of appearance flow based deformation. (2) We investigate a powerful Context-aware Spatially Adaptive Normalization (CSAN) generative module to synthesize the final try-on image. Particularly, CSAN integrates rich contextual information with semantic parsing guidance to properly infer unobserved garment appearances. Extensive experiments demonstrate that our KF-VTON is capable of producing photo-realistic and high-fidelity try-on results for the G2P as well as P2P try-on tasks and surpasses previous state-of-the-art methods both quantitatively and qualitatively. Our code is available at https://github.com/OIUIU/KF-VTON. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Recurrent Appearance Flow for Occlusion-Free Virtual Try-On.
- Author
-
Gu, Xiaoling, Zhu, Junkai, Wong, Yongkang, Wu, Zizhao, Yu, Jun, Fan, Jianping, and Kankanhalli, Mohan
- Subjects
SCIENTIFIC community ,VIRTUAL networks ,LEARNING strategies ,CLOTHING & dress ,POSTURE - Abstract
Image-based virtual try-on aims at transferring a target in-shop garment onto a reference person, and has garnered significant attention from the research communities recently. However, previous methods have faced severe challenges in handling occlusion problems. To address this limitation, we classify occlusion problems into three types based on the reference person's arm postures: single-arm occlusion, two-arm non-crossed occlusion, and two-arm crossed occlusion. Specifically, we propose a novel Occlusion-Free Virtual Try-On Network (OF-VTON) that effectively overcomes these occlusion challenges. The OF-VTON framework consists of two core components: (i) a new Recurrent Appearance Flow based Deformation (RAFD) model that robustly aligns the in-shop garment to the reference person by adopting a multi-task learning strategy. This model jointly produces the dense appearance flow to warp the garment and predicts a human segmentation map to provide semantic guidance for the subsequent image synthesis model. (ii) a powerful Multi-mask Image SynthesiS (MISS) model that generates photo-realistic try-on results by introducing a new mask generation and selection mechanism. Experimental results demonstrate that our proposed OF-VTON significantly outperforms existing state-of-the-art methods by mitigating the impact of occlusion problems. Our code is available at https://github.com/gxl-groups/OF-VTON. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. Automatic Classification of Human Epithelial Type 2 Cell Indirect Immunofluorescence Images using Cell Pyramid Matching
- Author
-
Wiliem, Arnold, Sanderson, Conrad, Wong, Yongkang, Hobson, Peter, Minchin, Rodney F., and Lovell, Brian C.
- Subjects
Quantitative Biology - Cell Behavior ,Computer Science - Computer Vision and Pattern Recognition ,Quantitative Biology - Quantitative Methods ,J.3 ,I.4.7 ,I.4.9 ,I.5.1 ,I.5.4 ,G.3 - Abstract
This paper describes a novel system for automatic classification of images obtained from Anti-Nuclear Antibody (ANA) pathology tests on Human Epithelial type 2 (HEp-2) cells using the Indirect Immunofluorescence (IIF) protocol. The IIF protocol on HEp-2 cells has been the hallmark method to identify the presence of ANAs, due to its high sensitivity and the large range of antigens that can be detected. However, it suffers from numerous shortcomings, such as being subjective as well as time and labour intensive. Computer Aided Diagnostic (CAD) systems have been developed to address these problems, which automatically classify a HEp-2 cell image into one of its known patterns (eg. speckled, homogeneous). Most of the existing CAD systems use handpicked features to represent a HEp-2 cell image, which may only work in limited scenarios. We propose a novel automatic cell image classification method termed Cell Pyramid Matching (CPM), which is comprised of regional histograms of visual words coupled with the Multiple Kernel Learning framework. We present a study of several variations of generating histograms and show the efficacy of the system on two publicly available datasets: the ICPR HEp-2 cell classification contest dataset and the SNPHEp-2 dataset., Comment: arXiv admin note: substantial text overlap with arXiv:1304.1262
- Published
- 2014
- Full Text
- View/download PDF
38. Visual Social Relationship Recognition
- Author
-
Li, Junnan, Wong, Yongkang, Zhao, Qi, and Kankanhalli, Mohan S.
- Published
- 2020
- Full Text
- View/download PDF
39. Learning Controllable Face Generator from Disjoint Datasets
- Author
-
Li, Jing, Wong, Yongkang, Sim, Terence, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vento, Mario, editor, and Percannella, Gennaro, editor
- Published
- 2019
- Full Text
- View/download PDF
40. Dynamic Amelioration of Resolution Mismatches for Local Feature Based Identity Inference
- Author
-
Wong, Yongkang, Sanderson, Conrad, Mau, Sandra, and Lovell, Brian C.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Information Retrieval ,I.5.4 ,I.4 - Abstract
While existing face recognition systems based on local features are robust to issues such as misalignment, they can exhibit accuracy degradation when comparing images of differing resolutions. This is common in surveillance environments where a gallery of high resolution mugshots is compared to low resolution CCTV probe images, or where the size of a given image is not a reliable indicator of the underlying resolution (eg. poor optics). To alleviate this degradation, we propose a compensation framework which dynamically chooses the most appropriate face recognition system for a given pair of image resolutions. This framework applies a novel resolution detection method which does not rely on the size of the input images, but instead exploits the sensitivity of local features to resolution using a probabilistic multi-region histogram approach. Experiments on a resolution-modified version of the "Labeled Faces in the Wild" dataset show that the proposed resolution detector frontend obtains a 99% average accuracy in selecting the most appropriate face recognition system, resulting in higher overall face discrimination accuracy (across several resolutions) compared to the individual baseline face recognition systems.
- Published
- 2013
- Full Text
- View/download PDF
41. Classification of Human Epithelial Type 2 Cell Indirect Immunofluoresence Images via Codebook Based Descriptors
- Author
-
Wiliem, Arnold, Wong, Yongkang, Sanderson, Conrad, Hobson, Peter, Chen, Shaokang, and Lovell, Brian C.
- Subjects
Quantitative Biology - Cell Behavior ,Computer Science - Computer Vision and Pattern Recognition ,Quantitative Biology - Quantitative Methods ,I.2.10 ,I.4.6 ,I.4.7 ,I.4.10 ,I.5.4 ,G.3 - Abstract
The Anti-Nuclear Antibody (ANA) clinical pathology test is commonly used to identify the existence of various diseases. A hallmark method for identifying the presence of ANAs is the Indirect Immunofluorescence method on Human Epithelial (HEp-2) cells, due to its high sensitivity and the large range of antigens that can be detected. However, the method suffers from numerous shortcomings, such as being subjective as well as time and labour intensive. Computer Aided Diagnostic (CAD) systems have been developed to address these problems, which automatically classify a HEp-2 cell image into one of its known patterns (eg., speckled, homogeneous). Most of the existing CAD systems use handpicked features to represent a HEp-2 cell image, which may only work in limited scenarios. In this paper, we propose a cell classification system comprised of a dual-region codebook-based descriptor, combined with the Nearest Convex Hull Classifier. We evaluate the performance of several variants of the descriptor on two publicly available datasets: ICPR HEp-2 cell classification contest dataset and the new SNPHEp-2 dataset. To our knowledge, this is the first time codebook-based descriptors are applied and studied in this domain. Experiments show that the proposed system has consistent high performance and is more robust than two recent CAD systems.
- Published
- 2013
- Full Text
- View/download PDF
42. Patch-based Probabilistic Image Quality Assessment for Face Selection and Improved Video-based Face Recognition
- Author
-
Wong, Yongkang, Chen, Shaokang, Mau, Sandra, Sanderson, Conrad, and Lovell, Brian C.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Statistics - Applications ,I.4.1 ,I.4.6 ,I.4.7 ,I.4.9 ,I.4.10 ,I.5.1 ,I.5.4 ,G.3 - Abstract
In video based face recognition, face images are typically captured over multiple frames in uncontrolled conditions, where head pose, illumination, shadowing, motion blur and focus change over the sequence. Additionally, inaccuracies in face localisation can also introduce scale and alignment variations. Using all face images, including images of poor quality, can actually degrade face recognition performance. While one solution it to use only the "best" subset of images, current face selection techniques are incapable of simultaneously handling all of the abovementioned issues. We propose an efficient patch-based face image quality assessment algorithm which quantifies the similarity of a face image to a probabilistic face model, representing an "ideal" face. Image characteristics that affect recognition are taken into account, including variations in geometric alignment (shift, rotation and scale), sharpness, head pose and cast shadows. Experiments on FERET and PIE datasets show that the proposed algorithm is able to identify images which are simultaneously the most frontal, aligned, sharp and well illuminated. Further experiments on a new video surveillance dataset (termed ChokePoint) show that the proposed method provides better face subsets than existing face selection techniques, leading to significant improvements in recognition accuracy.
- Published
- 2013
- Full Text
- View/download PDF
43. Combined Learning of Salient Local Descriptors and Distance Metrics for Image Set Face Verification
- Author
-
Sanderson, Conrad, Harandi, Mehrtash T., Wong, Yongkang, and Lovell, Brian C.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,I.5.1 ,I.5.4 ,I.2.10 - Abstract
In contrast to comparing faces via single exemplars, matching sets of face images increases robustness and discrimination performance. Recent image set matching approaches typically measure similarities between subspaces or manifolds, while representing faces in a rigid and holistic manner. Such representations are easily affected by variations in terms of alignment, illumination, pose and expression. While local feature based representations are considerably more robust to such variations, they have received little attention within the image set matching area. We propose a novel image set matching technique, comprised of three aspects: (i) robust descriptors of face regions based on local features, partly inspired by the hierarchy in the human visual system, (ii) use of several subspace and exemplar metrics to compare corresponding face regions, (iii) jointly learning which regions are the most discriminative while finding the optimal mixing weights for combining metrics. Face recognition experiments on LFW, PIE and MOBIO face datasets show that the proposed algorithm obtains considerably better performance than several recent state-of-the-art techniques, such as Local Principal Angle and the Kernel Affine Hull Method.
- Published
- 2013
- Full Text
- View/download PDF
44. On Robust Face Recognition via Sparse Encoding: the Good, the Bad, and the Ugly
- Author
-
Wong, Yongkang, Harandi, Mehrtash T., and Sanderson, Conrad
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In the field of face recognition, Sparse Representation (SR) has received considerable attention during the past few years. Most of the relevant literature focuses on holistic descriptors in closed-set identification applications. The underlying assumption in SR-based methods is that each class in the gallery has sufficient samples and the query lies on the subspace spanned by the gallery of the same class. Unfortunately, such assumption is easily violated in the more challenging face verification scenario, where an algorithm is required to determine if two faces (where one or both have not been seen before) belong to the same person. In this paper, we first discuss why previous attempts with SR might not be applicable to verification problems. We then propose an alternative approach to face verification via SR. Specifically, we propose to use explicit SR encoding on local image patches rather than the entire face. The obtained sparse signals are pooled via averaging to form multiple region descriptors, which are then concatenated to form an overall face descriptor. Due to the deliberate loss spatial relations within each region (caused by averaging), the resulting descriptor is robust to misalignment & various image deformations. Within the proposed framework, we evaluate several SR encoding techniques: l1-minimisation, Sparse Autoencoder Neural Network (SANN), and an implicit probabilistic technique based on Gaussian Mixture Models. Thorough experiments on AR, FERET, exYaleB, BANCA and ChokePoint datasets show that the proposed local SR approach obtains considerably better and more robust performance than several previous state-of-the-art holistic SR methods, in both verification and closed-set identification problems. The experiments also show that l1-minimisation based encoding has a considerably higher computational than the other techniques, but leads to higher recognition rates.
- Published
- 2013
- Full Text
- View/download PDF
45. Privacy-Enhancing Person Re-identification Framework – A Dual-Stage Approach
- Author
-
Kansal, Kajal, primary, Wong, Yongkang, additional, and Kankanhalli, Mohan, additional
- Published
- 2024
- Full Text
- View/download PDF
46. Learning to Predict Gradients for Semi-Supervised Continual Learning
- Author
-
Luo, Yan, primary, Wong, Yongkang, additional, Kankanhalli, Mohan, additional, and Zhao, Qi, additional
- Published
- 2024
- Full Text
- View/download PDF
47. Rejecting Unknown Gestures based on Surface-Electromyography Using Variational Autoencoder
- Author
-
Dai, Qingfeng, primary, Wong, Yongkang, additional, Kankanhalli, Mohan, additional, Li, Xiangdong, additional, and Geng, Weidong, additional
- Published
- 2024
- Full Text
- View/download PDF
48. NarSUM '23: The 2nd Workshop on User-Centric Narrative Summarization of Long Videos
- Author
-
Kankanhalli, Mohan S., primary, Patras, Ioannis (Yiannis), additional, Liu, Jianquan, additional, Wong, Yongkang, additional, Komamizu, Takahiro, additional, Yamazaki, Satoshi, additional, Stephen, Karen, additional, and Kansal, Kajal, additional
- Published
- 2023
- Full Text
- View/download PDF
49. Improved Network and Training Scheme for Cross-Trial Surface Electromyography (sEMG)-Based Gesture Recognition
- Author
-
Dai, Qingfeng, primary, Wong, Yongkang, additional, Kankanhali, Mohan, additional, Li, Xiangdong, additional, and Geng, Weidong, additional
- Published
- 2023
- Full Text
- View/download PDF
50. Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language
- Author
-
Liu, An-An, Xu, Ning, Wong, Yongkang, Li, Junnan, Su, Yu-Ting, and Kankanhalli, Mohan
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.