Author: "Chiu, Wei-Chen" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chiu, Wei-Chen"' showing total 205 results

Start Over Author "Chiu, Wei-Chen"

205 results on '"Chiu, Wei-Chen"'

1. In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models

Author: Chin, Zhi-Yi, Mu, Kuan-Chen, Fritz, Mario, Chen, Pin-Yu, and Chiu, Wei-Chen
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. While various safety mechanisms have been developed, the field lacks systematic tools for evaluating their effectiveness against real-world misuse scenarios. In this work, we propose ICER, a novel red-teaming framework that leverages Large Language Models (LLMs) and a bandit optimization-based algorithm to generate interpretable and semantic meaningful problematic prompts by learning from past successful red-teaming attempts. Our ICER efficiently probes safety mechanisms across different T2I models without requiring internal access or additional training, making it broadly applicable to deployed systems. Through extensive experiments, we demonstrate that ICER significantly outperforms existing prompt attack methods in identifying model vulnerabilities while maintaining high semantic similarity with intended content. By uncovering that successful jailbreaking instances can systematically facilitate the discovery of new vulnerabilities, our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
Published: 2024

2. Realizing Video Summarization from the Path of Language-based Semantic Understanding

Author: Mu, Kuan-Chen, Chin, Zhi-Yi, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.
Published: 2024

3. T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Author: Yeh, Chen, Chang, You-Ming, Chiu, Wei-Chen, and Yu, Ning
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA) task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu-eva-lab/VHD11K., Comment: Accepted to NeurIPS'24 Datasets and Benchmarks Track
Published: 2024

4. Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

Author: Lin, Chiu-Chou, Shih, Yu-Wei, Kuo, Kuei-Ting, Chen, Yu-Cheng, Chen, Chien-Hua, Chiu, Wei-Chen, and Wu, I-Chen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Computer Science - Information Retrieval, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: How can balance be quantified in game settings? This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions-such as hero combinations in multiplayer online battle arena (MOBA) games or decks in card games-is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in zero-sum competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations. Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology hinges on a simple technique to enhance codebook utilization in discrete representation with a deterministic vector quantization process for an extremely small state space. Our framework has been validated in popular online games, including Age of Empires II, Hearthstone, Brawl Stars, and League of Legends. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design., Comment: TMLR 09/2024 https://openreview.net/forum?id=2D36otXvBE
Published: 2024

5. Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

Author: Lin, Chiu-Chou, Chiu, Wei-Chen, and Wu, I-Chen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Defining and measuring decision-making styles, also known as playstyles, is crucial in gaming, where these styles reflect a broad spectrum of individuality and diversity. However, finding a universally applicable measure for these styles poses a challenge. Building on Playstyle Distance, the first unsupervised metric to measure playstyle similarity based on game screens and raw actions, we introduce three enhancements to increase accuracy: multiscale analysis with varied state granularity, a perceptual kernel rooted in psychology, and the utilization of the intersection-over-union method for efficient evaluation. These innovations not only advance measurement precision but also offer insights into human cognition of similarity. Across two racing games and seven Atari games, our techniques significantly improve the precision of zero-shot playstyle classification, achieving an accuracy exceeding 90 percent with fewer than 512 observation-action pairs, which is less than half an episode of these games. Furthermore, our experiments with 2048 and Go demonstrate the potential of discrete playstyle measures in puzzle and board games. We also develop an algorithm for assessing decision-making diversity using these measures. Our findings improve the measurement of end-to-end game analysis and the evolution of artificial intelligence for diverse playstyles., Comment: TMLR 08/2024 https://openreview.net/forum?id=30C9AWBW49
Published: 2024

6. A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic Counting

Author: Chou, Tsung-Han, Wang, Brian, Chiu, Wei-Chen, and Chen, Jun-Cheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Class agnostic counting (CAC) is a vision task that can be used to count the total occurrence number of any given reference objects in the query image. The task is usually formulated as a density map estimation problem through similarity computation among a few image samples of the reference object and the query image. In this paper, we point out a severe issue of the existing CAC framework: Given a multi-class setting, models don't consider reference images and instead blindly match all dominant objects in the query image. Moreover, the current evaluation metrics and dataset cannot be used to faithfully assess the model's generalization performance and robustness. To this end, we discover that the combination of mosaic augmentation with generalized loss is essential for addressing the aforementioned issue of CAC models to count objects of majority (i.e. dominant objects) regardless of the references. Furthermore, we introduce a new evaluation protocol and metrics for resolving the problem behind the existing CAC evaluation scheme and better benchmarking CAC models in a more fair manner. Besides, extensive evaluation results demonstrate that our proposed recipe can consistently improve the performance of different CAC models. The code is available at https://github.com/littlepenguin89106/MGCAC., Comment: Accepted by ACCV 2024
Published: 2024

7. MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes

Author: Wang, Bor-Shiun, Wang, Chien-Yi, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent advancements in post-hoc and inherently interpretable methods have markedly enhanced the explanations of black box classifier models. These methods operate either through post-analysis or by integrating concept learning during model training. Although being effective in bridging the semantic gap between a model's latent space and human interpretation, these explanation methods only partially reveal the model's decision-making process. The outcome is typically limited to high-level semantics derived from the last feature map. We argue that the explanations lacking insights into the decision processes at low and mid-level features are neither fully faithful nor useful. Addressing this gap, we introduce the Multi-Level Concept Prototypes Classifier (MCPNet), an inherently interpretable model. MCPNet autonomously learns meaningful concept prototypes across multiple feature map levels using Centered Kernel Alignment (CKA) loss and an energy-based weighted PCA mechanism, and it does so without reliance on predefined concept labels. Further, we propose a novel classifier paradigm that learns and aligns multi-level concept prototype distributions for classification purposes via Class-aware Concept Distribution (CCD) loss. Our experiments reveal that our proposed MCPNet while being adaptable to various model architectures, offers comprehensive multi-level explanations while maintaining classification accuracy. Additionally, its concept distribution-based classification approach shows improved generalization capabilities in few-shot classification scenarios., Comment: Accepted by CVPR 2024
Published: 2024

8. MENTOR: Multilingual tExt detectioN TOward leaRning by analogy

Author: Lin, Hsin-Ju, Chung, Tsu-Chun, Hsiao, Ching-Chun, Chen, Pin-Yu, Chiu, Wei-Chen, and Huang, Ching-Chun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: "We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training". To this end, we propose "MENTOR", the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection., Comment: 8 pages, 4 figures, published to IROS 2023
Published: 2024
Full Text: View/download PDF

9. Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields

Author: Cheng, Bo-Yu, Chiu, Wei-Chen, and Liu, Yu-Lun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization., Comment: AAAI 2024. Project page: https://alex04072000.github.io/Joint-TensoRF/
Published: 2024

10. A Recipe for CAC: Mosaic-Based Generalized Loss for Improved Class-Agnostic Counting

Author: Chou, Tsung-Han, Wang, Brian, Chiu, Wei-Chen, Chen, Jun-Cheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cho, Minsu, editor, Laptev, Ivan, editor, Tran, Du, editor, Yao, Angela, editor, and Zha, Hongbin, editor
Published: 2025
Full Text: View/download PDF

11. Best of Both Sides: Integration of Absolute and Relative Depth Sensing Modalities Based on iToF and RGB Cameras

Author: Fang, I-Sheng, Chiu, Wei-Chen, Chen, Yong-Sheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

12. AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

Author: Chang, You-Ming, Yeh, Chen, Chiu, Wei-Chen, and Yu, Ning
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep generative models can create remarkably photorealistic fake images while raising concerns about misinformation and copyright infringement, known as deepfake threats. Deepfake detection technique is developed to distinguish between real and fake images, where the existing methods typically learn classifiers in the image domain or various feature domains. However, the generalizability of deepfake detection against emerging and more advanced generative models remains challenging. In this paper, being inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach called AntifakePrompt, using VLMs (e.g., InstructBLIP) and prompt tuning techniques to improve the deepfake detection accuracy over unseen data. We formulate deepfake detection as a visual question answering problem, and tune soft prompts for InstructBLIP to answer the real/fake information of a query image. We conduct full-spectrum experiments on datasets from a diversity of 3 held-in and 20 held-out generative models, covering modern text-to-image generation, image editing and adversarial image attacks. These testing datasets provide useful benchmarks in the realm of deepfake detection for further research. Moreover, results demonstrate that (1) the deepfake detection accuracy can be significantly and consistently improved (from 71.06% to 92.11%, in average accuracy over unseen domains) using pretrained vision-language models with prompt tuning; (2) our superior performance is at less cost of training data and trainable parameters, resulting in an effective and efficient solution for deepfake detection. Code and models can be found at https://github.com/nctu-eva-lab/AntifakePrompt.
Published: 2023

13. Skin the sheep not only once: Reusing Various Depth Datasets to Drive the Learning of Optical Flow

Author: Huang, Sheng-Chi and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Optical flow estimation is crucial for various applications in vision and robotics. As the difficulty of collecting ground truth optical flow in real-world scenarios, most of the existing methods of learning optical flow still adopt synthetic dataset for supervised training or utilize photometric consistency across temporally adjacent video frames to drive the unsupervised learning, where the former typically has issues of generalizability while the latter usually performs worse than the supervised ones. To tackle such challenges, we propose to leverage the geometric connection between optical flow estimation and stereo matching (based on the similarity upon finding pixel correspondences across images) to unify various real-world depth estimation datasets for generating supervised training data upon optical flow. Specifically, we turn the monocular depth datasets into stereo ones via synthesizing virtual disparity, thus leading to the flows along the horizontal direction; moreover, we introduce virtual camera motion into stereo data to produce additional flows along the vertical direction. Furthermore, we propose applying geometric augmentations on one image of an optical flow pair, encouraging the optical flow estimator to learn from more challenging cases. Lastly, as the optical flow maps under different geometric augmentations actually exhibit distinct characteristics, an auxiliary classifier which trains to identify the type of augmentation from the appearance of the flow map is utilized to further enhance the learning of the optical flow estimator. Our proposed method is general and is not tied to any particular flow estimator, where extensive experiments based on various datasets and optical flow estimation models verify its efficacy and superiority.
Published: 2023

14. Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where

Author: Chin, Zhi-Yi, Jiang, Chieh-Ming, Huang, Ching-Chun, Chen, Pin-Yu, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: While image data starts to enjoy the simple-but-effective self-supervised learning scheme built upon masking and self-reconstruction objective thanks to the introduction of tokenization procedure and vision transformer backbone, convolutional neural networks as another important and widely-adopted architecture for image data, though having contrastive-learning techniques to drive the self-supervised learning, still face the difficulty of leveraging such straightforward and general masking operation to benefit their learning process significantly. In this work, we aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks as an extra augmentation method. In addition to the additive but unwanted edges (between masked and unmasked regions) as well as other adverse effects caused by the masking operations for ConvNets, which have been discussed by prior works, we particularly identify the potential problem where for one view in a contrastive sample-pair the randomly-sampled masking regions could be overly concentrated on important/salient objects thus resulting in misleading contrastiveness to the other view. To this end, we propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background for realizing the masking-based augmentation. Moreover, we introduce hard negative samples by masking larger regions of salient patches in an input image. Extensive experiments conducted on various datasets, contrastive learning mechanisms, and downstream tasks well verify the efficacy as well as the superior performance of our proposed method with respect to several state-of-the-art baselines., Comment: WACV 2024. Source code is available at: https://github.com/joycenerd/Saliency-Guided-Masking-for-ConvNets
Published: 2023

15. Transformer-based Image Compression with Variable Image Quality Objectives

Author: Kao, Chia-Hao, Chen, Yi-Hsin, Chien, Cheng, Chiu, Wei-Chen, and Peng, Wen-Hsiao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: This paper presents a Transformer-based image compression system that allows for a variable image quality objective according to the user's preference. Optimizing a learned codec for different quality objectives leads to reconstructed images with varying visual characteristics. Our method provides the user with the flexibility to choose a trade-off between two image quality objectives using a single, shared model. Motivated by the success of prompt-tuning techniques, we introduce prompt tokens to condition our Transformer-based autoencoder. These prompt tokens are generated adaptively based on the user's preference and input image through learning a prompt generation network. Extensive experiments on commonly used quality metrics demonstrate the effectiveness of our method in adapting the encoding and/or decoding processes to a variable quality objective. While offering the additional flexibility, our proposed method performs comparably to the single-objective methods in terms of rate-distortion performance.
Published: 2023

16. Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Author: Chin, Zhi-Yi, Jiang, Chieh-Ming, Huang, Ching-Chun, Chen, Pin-Yu, and Chiu, Wei-Chen
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models., Comment: ICML 2024 main conference paper. The source code is available at https://github.com/joycenerd/P4D
Published: 2023

17. TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception

Author: Chen, Yi-Hsin, Weng, Ying-Chieh, Kao, Chia-Hao, Chien, Cheng, Chiu, Wei-Chen, and Peng, Wen-Hsiao
Subjects: Electrical Engineering and Systems Science - Image and Video Processing
Abstract: This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task., Comment: Accepted to ICCV 2023
Published: 2023

18. Transformer-based Variable-rate Image Compression with Region-of-interest Control

Author: Kao, Chia-Hao, Weng, Ying-Chieh, Chen, Yi-Hsin, Chiu, Wei-Chen, and Peng, Wen-Hsiao
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes a transformer-based learned image compression system. It is capable of achieving variable-rate compression with a single model while supporting the region-of-interest (ROI) functionality. Inspired by prompt tuning, we introduce prompt generation networks to condition the transformer-based autoencoder of compression. Our prompt generation networks generate content-adaptive tokens according to the input image, an ROI mask, and a rate parameter. The separation of the ROI mask and the rate parameter allows an intuitive way to achieve variable-rate and ROI coding simultaneously. Extensive experiments validate the effectiveness of our proposed method and confirm its superiority over the other competing methods., Comment: Accepted to IEEE ICIP 2023
Published: 2023

19. Multimodal Prompting with Missing Modalities for Visual Recognition

Author: Lee, Yi-Lun, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Lee, Chen-Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available., Comment: Accepted by CVPR 2023. Codes are available at https://github.com/YiLunLee/Missing_aware_prompts
Published: 2023

20. Mitigating Forgetting in Online Continual Learning via Contrasting Semantically Distinct Augmentations

Author: Yu, Sheng-Feng and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Online continual learning (OCL) aims to enable model learning from a non-stationary data stream to continuously acquire new knowledge as well as retain the learnt one, under the constraints of having limited system size and computational cost, in which the main challenge comes from the "catastrophic forgetting" issue -- the inability to well remember the learnt knowledge while learning the new ones. With the specific focus on the class-incremental OCL scenario, i.e. OCL for classification, the recent advance incorporates the contrastive learning technique for learning more generalised feature representation to achieve the state-of-the-art performance but is still unable to fully resolve the catastrophic forgetting. In this paper, we follow the strategy of adopting contrastive learning but further introduce the semantically distinct augmentation technique, in which it leverages strong augmentation to generate more data samples, and we show that considering these samples semantically different from their original classes (thus being related to the out-of-distribution samples) in the contrastive learning mechanism contributes to alleviate forgetting and facilitate model stability. Moreover, in addition to contrastive learning, the typical classification mechanism and objective (i.e. softmax classifier and cross-entropy loss) are included in our model design for faster convergence and utilising the label information, but particularly equipped with a sampling strategy to tackle the tendency of favouring the new classes (i.e. model bias towards the recently learnt classes). Upon conducting extensive experiments on CIFAR-10, CIFAR-100, and Mini-Imagenet datasets, our proposed method is shown to achieve superior performance against various baselines.
Published: 2022

21. 3D-PL: Domain Adaptive Depth Estimation with 3D-aware Pseudo-Labeling

Author: Yen, Yu-Ting, Lu, Chia-Ni, Chiu, Wei-Chen, and Tsai, Yi-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: For monocular depth estimation, acquiring ground truths for real data is not easy, and thus domain adaptation methods are commonly adopted using the supervised synthetic data. However, this may still incur a large domain gap due to the lack of supervision from the real data. In this paper, we develop a domain adaptation framework via generating reliable pseudo ground truths of depth from real data to provide direct supervisions. Specifically, we propose two mechanisms for pseudo-labeling: 1) 2D-based pseudo-labels via measuring the consistency of depth predictions when images are with the same content but different styles; 2) 3D-aware pseudo-labels via a point cloud completion network that learns to complete the depth values in the 3D space, thus providing more structural information in a scene to refine and generate more reliable pseudo-labels. In experiments, we show that our pseudo-labeling methods improve depth estimation in various settings, including the usage of stereo pairs during training. Furthermore, the proposed method performs favorably against several state-of-the-art unsupervised domain adaptation approaches in real-world datasets., Comment: Accepted in ECCV 2022. Project page: https://ccc870206.github.io/3D-PL/
Published: 2022

22. BiFuse++: Self-supervised and Efficient Bi-projection Fusion for 360 Depth Estimation

Author: Wang, Fu-En, Yeh, Yu-Hsuan, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Sun, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the rise of spherical cameras, monocular 360 depth estimation becomes an important technique for many applications (e.g., autonomous systems). Thus, state-of-the-art frameworks for monocular 360 depth estimation such as bi-projection fusion in BiFuse are proposed. To train such a framework, a large number of panoramas along with the corresponding depth ground truths captured by laser sensors are required, which highly increases the cost of data collection. Moreover, since such a data collection procedure is time-consuming, the scalability of extending these methods to different scenes becomes a challenge. To this end, self-training a network for monocular depth estimation from 360 videos is one way to alleviate this issue. However, there are no existing frameworks that incorporate bi-projection fusion into the self-training scheme, which highly limits the self-supervised performance since bi-projection fusion can leverage information from different projection types. In this paper, we propose BiFuse++ to explore the combination of bi-projection fusion and the self-training scenario. To be specific, we propose a new fusion module and Contrast-Aware Photometric Loss to improve the performance of BiFuse and increase the stability of self-training on real-world videos. We conduct both supervised and self-supervised experiments on benchmark datasets and achieve state-of-the-art performance., Comment: Accepted in TPAMI 2022; Code: https://github.com/fuenwang/BiFusev2
Published: 2022
Full Text: View/download PDF

23. Adaptively-Realistic Image Generation from Stroke and Sketch with Diffusion Model

Author: Cheng, Shin-I, Chen, Yu-Jie, Chiu, Wei-Chen, Tseng, Hung-Yu, and Lee, Hsin-Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating images from hand-drawings is a crucial and fundamental task in content creation. The translation is difficult as there exist infinite possibilities and the different users usually expect different outcomes. Therefore, we propose a unified framework supporting a three-dimensional control over the image synthesis from sketches and strokes based on diffusion models. Users can not only decide the level of faithfulness to the input strokes and sketches, but also the degree of realism, as the user inputs are usually not consistent with the real images. Qualitative and quantitative experiments demonstrate that our framework achieves state-of-the-art performance while providing flexibility in generating customized images with control over shape, color, and realism. Moreover, our method unleashes applications such as editing on real images, generation with partial sketches and strokes, and multi-domain multi-modal synthesis.
Published: 2022

24. Vector Quantized Image-to-Image Translation

Author: Chen, Yu-Jie, Cheng, Shin-I, Chiu, Wei-Chen, Tseng, Hung-Yu, and Lee, Hsin-Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Current image-to-image translation methods formulate the task with conditional generation models, leading to learning only the recolorization or regional changes as being constrained by the rich structural information provided by the conditional contexts. In this work, we propose introducing the vector quantization technique into the image-to-image translation framework. The vector quantized content representation can facilitate not only the translation, but also the unconditional distribution shared among different domains. Meanwhile, along with the disentangled style representation, the proposed method further enables the capability of image extension with flexibility in both intra- and inter-domains. Qualitative and quantitative experiments demonstrate that our framework achieves comparable performance to the state-of-the-art image-to-image translation and image extension methods. Compared to methods for individual tasks, the proposed method, as a unified framework, unleashes applications combining image-to-image translation, unconditional generation, and image extension altogether. For example, it provides style variability for image generation and extension, and equips image-to-image translation with further extension capabilities.
Published: 2022

25. Self-Supervised Feature Learning from Partial Point Clouds via Pose Disentanglement

Author: Tsai, Meng-Shiun, Chiang, Pei-Ze, Tsai, Yi-Hsuan, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised learning on point clouds has gained a lot of attention recently, since it addresses the label-efficiency and domain-gap problems on point cloud tasks. In this paper, we propose a novel self-supervised framework to learn informative representations from partial point clouds. We leverage partial point clouds scanned by LiDAR that contain both content and pose attributes, and we show that disentangling such two factors from partial point clouds enhances feature representation learning. To this end, our framework consists of three main parts: 1) a completion network to capture holistic semantics of point clouds; 2) a pose regression network to understand the viewing angle where partial data is scanned from; 3) a partial reconstruction network to encourage the model to learn content and pose features. To demonstrate the robustness of the learnt feature representations, we conduct several downstream tasks including classification, part segmentation, and registration, with comparisons against state-of-the-art methods. Our method not only outperforms existing self-supervised methods, but also shows a better generalizability across synthetic and real-world datasets., Comment: 10 pages, 4 figures and 6 tables
Published: 2022

26. Make an Omelette with Breaking Eggs: Zero-Shot Learning for Novel Attribute Synthesis

Author: Li, Yu-Hsuan, Chao, Tzu-Yin, Huang, Ching-Chun, Chen, Pin-Yu, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Most of the existing algorithms for zero-shot classification problems typically rely on the attribute-based semantic relations among categories to realize the classification of novel categories without observing any of their instances. However, training the zero-shot classification models still requires attribute labeling for each class (or even instance) in the training dataset, which is also expensive. To this end, in this paper, we bring up a new problem scenario: "Can we derive zero-shot learning for novel attribute detectors/classifiers and use them to automatically annotate the dataset for labeling efficiency?". Basically, given only a small set of detectors that are learned to recognize some manually annotated attributes (i.e., the seen attributes), we aim to synthesize the detectors of novel attributes in a zero-shot learning manner. Our proposed method, Zero-Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge, tackles this new research problem by applying the set operations to first decompose the seen attributes into their basic attributes and then recombine these basic attributes into the novel ones. Extensive experiments are conducted to verify the capacity of our synthesized detectors for accurately capturing the semantics of the novel attributes and show their superior performance in terms of detection and localization compared to other baseline approaches. Moreover, we demonstrate the application of automatic annotation using our synthesized detectors on Caltech-UCSD Birds-200-2011 dataset. Various generalized zero-shot classification algorithms trained upon the dataset re-annotated by ZSLA show comparable performance with those trained with the manual ground-truth annotations. Please refer to our project page for source code: https://yuhsuanli.github.io/ZSLA/, Comment: Accepted by the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). (* Yu-Hsuan Li and Tzu-Yin Chao contributed equally to this work.)
Published: 2021

27. An Unsupervised Video Game Playstyle Metric via State Discretization

Author: Lin, Chiu-Chou, Chiu, Wei-Chen, and Wu, I-Chen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: On playing video games, different players usually have their own playstyles. Recently, there have been great improvements for the video game AIs on the playing strength. However, past researches for analyzing the behaviors of players still used heuristic rules or the behavior features with the game-environment support, thus being exhausted for the developers to define the features of discriminating various playstyles. In this paper, we propose the first metric for video game playstyles directly from the game observations and actions, without any prior specification on the playstyle in the target game. Our proposed method is built upon a novel scheme of learning discrete representations that can map game observations into latent discrete states, such that playstyles can be exhibited from these discrete states. Namely, we measure the playstyle distance based on game observations aligned to the same states. We demonstrate high playstyle accuracy of our metric in experiments on some video game platforms, including TORCS, RGSK, and seven Atari games, and for different agents including rule-based AI bots, learning-based AI bots, and human players., Comment: This version was also published on UAI 2021
Published: 2021

28. Towards Interpretable Deep Networks for Monocular Depth Estimation

Author: You, Zunzhi, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Li, Guanbin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep networks for Monocular Depth Estimation (MDE) have achieved promising performance recently and it is of great importance to further understand the interpretability of these networks. Existing methods attempt to provide posthoc explanations by investigating visual cues, which may not explore the internal representations learned by deep networks. In this paper, we find that some hidden units of the network are selective to certain ranges of depth, and thus such behavior can be served as a way to interpret the internal representations. Based on our observations, we quantify the interpretability of a deep MDE network by the depth selectivity of its hidden units. Moreover, we then propose a method to train interpretable MDE deep networks without changing their original architectures, by assigning a depth range for each unit to select. Experimental results demonstrate that our method is able to enhance the interpretability of deep MDE networks by largely improving the depth selectivity of their units, while not harming or even improving the depth estimation accuracy. We further provide a comprehensive analysis to show the reliability of selective units, the applicability of our method on different layers, models, and datasets, and a demonstration on analysis of model error. Source code and models are available at https://github.com/youzunzhi/InterpretableMDE ., Comment: Accepted by ICCV2021
Published: 2021

29. Learning Facial Representations from the Cycle-consistency of Face

Author: Chang, Jia-Ren, Chen, Yong-Sheng, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Faces manifest large variations in many aspects, such as identity, expression, pose, and face styling. Therefore, it is a great challenge to disentangle and extract these characteristics from facial images, especially in an unsupervised manner. In this work, we introduce cycle-consistency in facial characteristics as free supervisory signal to learn facial representations from unlabeled facial images. The learning is realized by superimposing the facial motion cycle-consistency and identity cycle-consistency constraints. The main idea of the facial motion cycle-consistency is that, given a face with expression, we can perform de-expression to a neutral face via the removal of facial motion and further perform re-expression to reconstruct back to the original face. The main idea of the identity cycle-consistency is to exploit both de-identity into mean face by depriving the given neutral face of its identity via feature re-normalization and re-identity into neutral face by adding the personal attributes to the mean face. At training time, our model learns to disentangle two distinct facial representations to be useful for performing cycle-consistent face reconstruction. At test time, we use the linear protocol scheme for evaluating facial representations on various tasks, including facial expression recognition and head pose regression. We also can directly apply the learnt facial representations to person recognition, frontalization and image-to-image translation. Our experiments show that the results of our approach is competitive with those of existing methods, demonstrating the rich and unique information embedded in the disentangled representations. Code is available at https://github.com/JiaRenChang/FaceCycle ., Comment: ICCV 2021
Published: 2021

30. MAML is a Noisy Contrastive Learner in Classification

Author: Kao, Chia-Hsiang, Chiu, Wei-Chen, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Model-agnostic meta-learning (MAML) is one of the most popular and widely adopted meta-learning algorithms, achieving remarkable success in various learning problems. Yet, with the unique design of nested inner-loop and outer-loop updates, which govern the task-specific and meta-model-centric learning, respectively, the underlying learning objective of MAML remains implicit and thus impedes a more straightforward understanding of it. In this paper, we provide a new perspective of the working mechanism of MAML. We discover that MAML is analogous to a meta-learner using a supervised contrastive objective. The query features are pulled towards the support features of the same class and against those of different classes. Such contrastiveness is experimentally verified via an analysis based on the cosine similarity. Moreover, we reveal that vanilla MAML has an undesirable interference term originating from the random initialization and the cross-task interaction. We thus propose a simple but effective technique, zeroing trick, to alleviate the interference. Extensive experiments are conducted on both mini-ImageNet and Omniglot datasets to demonstrate the consistent improvement brought by our proposed method, validating its effectiveness., Comment: 22 pages, 17 figures. Accepted by ICLR 2022
Published: 2021

31. RPG: Learning Recursive Point Cloud Generation

Author: Ko, Wei-Jan, Huang, Hui-Yu, Kuo, Yu-Liang, Chiu, Chen-Yi, Wang, Li-Heng, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we propose a novel point cloud generator that is able to reconstruct and generate 3D point clouds composed of semantic parts. Given a latent representation of the target 3D model, the generation starts from a single point and gets expanded recursively to produce the high-resolution point cloud via a sequence of point expansion stages. During the recursive procedure of generation, we not only obtain the coarse-to-fine point clouds for the target 3D model from every expansion stage, but also unsupervisedly discover the semantic segmentation of the target model according to the hierarchical/parent-child relation between the points across expansion stages. Moreover, the expansion modules and other elements used in our recursive generator are mostly sharing weights thus making the overall framework light and efficient. Extensive experiments are conducted to demonstrate that our proposed point cloud generator has comparable or even superior performance on both generation and reconstruction tasks in comparison to various baselines, as well as provides the consistent co-segmentation among 3D instances of the same object class.
Published: 2021

32. Stylizing 3D Scene via Implicit Representation and HyperNetwork

Author: Chiang, Pei-Ze, Tsai, Meng-Shiun, Tseng, Hung-Yu, Lai, Wei-sheng, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we aim to address the 3D scene stylization problem - generating stylized images of the scene at arbitrary novel view angles. A straightforward solution is to combine existing novel view synthesis and image/video style transfer approaches, which often leads to blurry results or inconsistent appearance. Inspired by the high-quality results of the neural radiance fields (NeRF) method, we propose a joint framework to directly render novel views with the desired style. Our framework consists of two components: an implicit representation of the 3D scene with the neural radiance fields model, and a hypernetwork to transfer the style information into the scene representation. In particular, our implicit representation model disentangles the scene into the geometry and appearance branches, and the hypernetwork learns to predict the parameters of the appearance branch from the reference style image. To alleviate the training difficulties and memory burden, we propose a two-stage training procedure and a patch sub-sampling approach to optimize the style and content losses with the neural radiance fields model. After optimization, our model is able to render consistent novel views at arbitrary view angles with arbitrary style. Both quantitative evaluation and human subject study have demonstrated that the proposed method generates faithful stylization results with consistent appearance across different views., Comment: Accepted to WACV2022; Project page: https://ztex08010518.github.io/3dstyletransfer/
Published: 2021

33. Robust 360-8PA: Redesigning The Normalized 8-point Algorithm for 360-FoV Images

Author: Solarte, Bolivar, Wu, Chin-Hsuan, Lu, Kuan-Wei, Sun, Min, Chiu, Wei-Chen, and Tsai, Yi-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: This paper presents a novel preconditioning strategy for the classic 8-point algorithm (8-PA) for estimating an essential matrix from 360-FoV images (i.e., equirectangular images) in spherical projection. To alleviate the effect of uneven key-feature distributions and outlier correspondences, which can potentially decrease the accuracy of an essential matrix, our method optimizes a non-rigid transformation to deform a spherical camera into a new spatial domain, defining a new constraint and a more robust and accurate solution for an essential matrix. Through several experiments using random synthetic points, 360-FoV, and fish-eye images, we demonstrate that our normalization can increase the camera pose accuracy by about 20% without significantly overhead the computation time. In addition, we present further benefits of our method through both a constant weighted least-square optimization that improves further the well known Gold Standard Method (GSM) (i.e., the non-linear optimization by using epipolar errors); and a relaxation of the number of RANSAC iterations, both showing that our normalization outcomes a more reliable, robust, and accurate solution., Comment: Accepted to ICRA 2021
Published: 2021

34. LED2-Net: Monocular 360 Layout Estimation via Differentiable Depth Rendering

Author: Wang, Fu-En, Yeh, Yu-Hsuan, Sun, Min, Chiu, Wei-Chen, and Tsai, Yi-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although significant progress has been made in room layout estimation, most methods aim to reduce the loss in the 2D pixel coordinate rather than exploiting the room structure in the 3D space. Towards reconstructing the room layout in 3D, we formulate the task of 360 layout estimation as a problem of predicting depth on the horizon line of a panorama. Specifically, we propose the Differentiable Depth Rendering procedure to make the conversion from layout to depth prediction differentiable, thus making our proposed model end-to-end trainable while leveraging the 3D geometric information, without the need of providing the ground truth depth. Our method achieves state-of-the-art performance on numerous 360 layout benchmark datasets. Moreover, our formulation enables a pre-training step on the depth dataset, which further improves the generalizability of our layout estimation model., Comment: CVPR 2021 Oral, see https://fuenwang.ml/project/led2net
Published: 2021

35. Bridging the Visual Gap: Wide-Range Image Blending

Author: Lu, Chia-Ni, Chang, Ya-Chu, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we propose a new problem scenario in image processing, wide-range image blending, which aims to smoothly merge two different input photos into a panorama by generating novel image content for the intermediate region between them. Although such problem is closely related to the topics of image inpainting, image outpainting, and image blending, none of the approaches from these topics is able to easily address it. We introduce an effective deep-learning model to realize wide-range image blending, where a novel Bidirectional Content Transfer module is proposed to perform the conditional prediction for the feature representation of the intermediate region via recurrent neural networks. In addition to ensuring the spatial and semantic consistency during the blending, we also adopt the contextual attention mechanism as well as the adversarial learning scheme in our proposed method for improving the visual quality of the resultant panorama. We experimentally demonstrate that our proposed method is not only able to produce visually appealing results for wide-range image blending, but also able to provide superior performance with respect to several baselines built upon the state-of-the-art image inpainting and outpainting approaches., Comment: Accepted to CVPR 2021. Project page: http://github.com/julia0607/Wide-Range-Image-Blending
Published: 2021

36. Domain Adaptation for Learning Generator from Paired Few-Shot Data

Author: Teng, Chun-Chih, Chen, Pin-Yu, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a Paired Few-shot GAN (PFS-GAN) model for learning generators with sufficient source data and a few target data. While generative model learning typically needs large-scale training data, our PFS-GAN not only uses the concept of few-shot learning but also domain shift to transfer the knowledge across domains, which alleviates the issue of obtaining low-quality generator when only trained with target domain data. The cross-domain datasets are assumed to have two properties: (1) each target-domain sample has its source-domain correspondence and (2) two domains share similar content information but different appearance. Our PFS-GAN aims to learn the disentangled representation from images, which composed of domain-invariant content features and domain-specific appearance features. Furthermore, a relation loss is introduced on the content features while shifting the appearance features to increase the structural diversity. Extensive experiments show that our method has better quantitative and qualitative results on the generated target-domain data with higher diversity in comparison to several baselines., Comment: accepted in ICASSP 2021
Published: 2021

37. Spectral Analysis for Semantic Segmentation with Applications on Feature Truncation and Weak Annotation

Author: Chen, Li-Wei, Chiu, Wei-Chen, and Wu, Chin-Tien
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: It is well known that semantic segmentation neural networks (SSNNs) produce dense segmentation maps to resolve the objects' boundaries while restrict the prediction on down-sampled grids to alleviate the computational cost. A striking balance between the accuracy and the training cost of the SSNNs such as U-Net exists. We propose a spectral analysis to investigate the correlations among the resolution of the down sampled grid, the loss function and the accuracy of the SSNNs. By analyzing the network back-propagation process in frequency domain, we discover that the traditional loss function, cross-entropy, and the key features of CNN are mainly affected by the low-frequency components of segmentation labels. Our discoveries can be applied to SSNNs in several ways including (i) determining an efficient low resolution grid for resolving the segmentation maps (ii) pruning the networks by truncating the high frequency decoder features for saving computation costs, and (iii) using block-wise weak annotation for saving the labeling time. Experimental results shown in this paper agree with our spectral analysis for the networks such as DeepLab V3+ and Deep Aggregation Net (DAN)., Comment: 16 pages
Published: 2020

38. Class-incremental Learning with Rectified Feature-Graph Preservation

Author: Lei, Cheng-Hsun, Chen, Yi-Hsin, Peng, Wen-Hsiao, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes. Our project page is at : https://github.com/yhchen12101/FGP-ICL., Comment: Accepted by ACCV 2020 (oral)
Published: 2020

39. Benefiting Deep Latent Variable Models via Learning the Prior and Removing Latent Regularization

Author: Morrow, Rogan and Chiu, Wei-Chen
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: There exist many forms of deep latent variable models, such as the variational autoencoder and adversarial autoencoder. Regardless of the specific class of model, there exists an implicit consensus that the latent distribution should be regularized towards the prior, even in the case where the prior distribution is learned. Upon investigating the effect of latent regularization on image generation our results indicate that in the case where a sufficiently expressive prior is learned, latent regularization is not necessary and may in fact be harmful insofar as image quality is concerned. We additionally investigate the benefit of learned priors on two common problems in computer vision: latent variable disentanglement, and diversity in image-to-image translation.
Published: 2020

40. Variational Autoencoders with Normalizing Flow Decoders

Author: Morrow, Rogan and Chiu, Wei-Chen
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Recently proposed normalizing flow models such as Glow have been shown to be able to generate high quality, high dimensional images with relatively fast sampling speed. Due to their inherently restrictive architecture, however, it is necessary that they are excessively deep in order to train effectively. In this paper we propose to combine Glow with an underlying variational autoencoder in order to counteract this issue. We demonstrate that our proposed model is competitive with Glow in terms of image quality and test likelihood while requiring far less time for training.
Published: 2020

41. LayoutMP3D: Layout Annotation of Matterport3D

Author: Wang, Fu-En, Yeh, Yu-Hsuan, Sun, Min, Chiu, Wei-Chen, and Tsai, Yi-Hsuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inferring the information of 3D layout from a single equirectangular panorama is crucial for numerous applications of virtual reality or robotics (e.g., scene understanding and navigation). To achieve this, several datasets are collected for the task of 360 layout estimation. To facilitate the learning algorithms for autonomous systems in indoor scenarios, we consider the Matterport3D dataset with their originally provided depth map ground truths and further release our annotations for layout ground truths from a subset of Matterport3D. As Matterport3D contains accurate depth ground truths from time-of-flight (ToF) sensors, our dataset provides both the layout and depth information, which enables the opportunity to explore the environment by integrating both cues., Comment: Annotation is available at https://github.com/fuenwang/LayoutMP3D
Published: 2020

42. 360SD-Net: 360{\deg} Stereo Depth Estimation with Learnable Cost Volume

Author: Wang, Ning-Hsu, Solarte, Bolivar, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Sun, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, end-to-end trainable deep neural networks have significantly improved stereo depth estimation for perspective images. However, 360{\deg} images captured under equirectangular projection cannot benefit from directly adopting existing methods due to distortion introduced (i.e., lines in 3D are not projected onto lines in 2D). To tackle this issue, we present a novel architecture specifically designed for spherical disparity using the setting of top-bottom 360{\deg} camera pairs. Moreover, we propose to mitigate the distortion issue by (1) an additional input branch capturing the position and relation of each pixel in the spherical coordinate, and (2) a cost volume built upon a learnable shifting filter. Due to the lack of 360{\deg} stereo data, we collect two 360{\deg} stereo datasets from Matterport3D and Stanford3D for training and evaluation. Extensive experiments and ablation study are provided to validate our method against existing algorithms. Finally, we show promising results on real-world environments capturing images with two consumer-level cameras., Comment: Accepted to 2020 IEEE International Conference on Robotics and Automation (ICRA 2020). Project page and code are at https://albert100121.github.io/360SD-Net-Project-Page
Published: 2019

43. Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence

Author: Lai, Hsueh-Ying, Tsai, Yi-Hsuan, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Stereo matching and flow estimation are two essential tasks for scene understanding, spatially in 3D and temporally in motion. Existing approaches have been focused on the unsupervised setting due to the limited resource to obtain the large-scale ground truth data. To construct a self-learnable objective, co-related tasks are often linked together to form a joint framework. However, the prior work usually utilizes independent networks for each task, thus not allowing to learn shared feature representations across models. In this paper, we propose a single and principled network to jointly learn spatiotemporal correspondence for stereo matching and flow estimation, with a newly designed geometric connection as the unsupervised signal for temporally adjacent stereo pairs. We show that our method performs favorably against several state-of-the-art baselines for both unsupervised depth and flow estimation on the KITTI benchmark dataset., Comment: Accepted in CVPR'19. Code and model are available at: https://github.com/lelimite4444/BridgeDepthFlow
Published: 2019

44. 3D LiDAR and Stereo Fusion using Stereo Matching Network with Conditional Cost Volume Normalization

Author: Wang, Tsun-Hsuan, Hu, Hou-Ning, Lin, Chieh Hubert, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Sun, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The complementary characteristics of active and passive depth sensing techniques motivate the fusion of the Li-DAR sensor and stereo camera for improved depth perception. Instead of directly fusing estimated depths across LiDAR and stereo modalities, we take advantages of the stereo matching network with two enhanced techniques: Input Fusion and Conditional Cost Volume Normalization (CCVNorm) on the LiDAR information. The proposed framework is generic and closely integrated with the cost volume component that is commonly utilized in stereo matching neural networks. We experimentally verify the efficacy and robustness of our method on the KITTI Stereo and Depth Completion datasets, obtaining favorable performance against various fusion strategies. Moreover, we demonstrate that, with a hierarchical extension of CCVNorm, the proposed method brings only slight overhead to the stereo matching network in terms of computation time and model size. For project page, see https://zswang666.github.io/Stereo-LiDAR-CCVNorm-Project-Page/, Comment: ver.1
Published: 2019

45. All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation

Author: Chang, Wei-Lun, Wang, Hui-Po, Peng, Wen-Hsiao, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we tackle the problem of unsupervised domain adaptation for the task of semantic segmentation, where we attempt to transfer the knowledge learned upon synthetic datasets with ground-truth labels to real-world images without any annotation. With the hypothesis that the structural content of images is the most informative and decisive factor to semantic segmentation and can be readily shared across domains, we propose a Domain Invariant Structure Extraction (DISE) framework to disentangle images into domain-invariant structure and domain-specific texture representations, which can further realize image-translation across domains and enable label transfer to improve segmentation performance. Extensive experiments verify the effectiveness of our proposed DISE model and demonstrate its superiority over several state-of-the-art approaches.
Published: 2019

46. Plug-and-Play: Improve Depth Estimation via Sparse Data Propagation

Author: Wang, Tsun-Hsuan, Wang, Fu-En, Lin, Juan-Ting, Tsai, Yi-Hsuan, Chiu, Wei-Chen, and Sun, Min
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel plug-and-play (PnP) module for improving depth prediction with taking arbitrary patterns of sparse depths as input. Given any pre-trained depth prediction model, our PnP module updates the intermediate feature map such that the model outputs new depths consistent with the given sparse depths. Our method requires no additional training and can be applied to practical applications such as leveraging both RGB and sparse LiDAR points to robustly estimate dense depth map. Our approach achieves consistent improvements on various state-of-the-art methods on indoor (i.e., NYU-v2) and outdoor (i.e., KITTI) datasets. Various types of LiDARs are also synthesized in our experiments to verify the general applicability of our PnP module in practice. For project page, see https://zswang666.github.io/PnP-Depth-Project-Page/, Comment: 7 pages. 7 figures. ver.2
Published: 2018

47. Self-Contained Stylization via Steganography for Reverse and Serial Style Transfer

Author: Chen, Hung-Yu, Fang, I-Sheng, and Chiu, Wei-Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Style transfer has been widely applied to give real-world images a new artistic look. However, given a stylized image, the attempts to use typical style transfer methods for de-stylization or transferring it again into another style usually lead to artifacts or undesired results. We realize that these issues are originated from the content inconsistency between the original image and its stylized output. Therefore, in this paper we advance to keep the content information of the input image during the process of style transfer by the power of steganography, with two approaches proposed: a two-stage model and an end-to-end model. We conduct extensive experiments to successfully verify the capacity of our models, in which both of them are able to not only generate stylized images of quality comparable with the ones produced by typical style transfer methods, but also effectively eliminate the artifacts introduced in reconstructing original input from a stylized image as well as performing multiple times of style transfer in series., Comment: 21 pages, 21 figures
Published: 2018

48. Summarizing First-Person Videos from Third Persons' Points of Views

Author: Ho, Hsuan-I, Chiu, Wei-Chen, and Wang, Yu-Chiang Frank
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video highlight or summarization is among interesting topics in computer vision, which benefits a variety of applications like viewing, searching, or storage. However, most existing studies rely on training data of third-person videos, which cannot easily generalize to highlight the first-person ones. With the goal of deriving an effective model to summarize first-person videos, we propose a novel deep neural network architecture for describing and discriminating vital spatiotemporal information across videos with different points of view. Our proposed model is realized in a semi-supervised setting, in which fully annotated third-person videos, unlabeled first-person videos, and a small number of annotated first-person ones are presented during training. In our experiments, qualitative and quantitative evaluations on both benchmarks and our collected first-person video datasets are presented., Comment: 16+10 pages, ECCV 2018
Published: 2017

49. Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation

Author: Liu, Yen-Cheng, Yeh, Yu-Ying, Fu, Tzu-Chien, Wang, Sheng-De, Chiu, Wei-Chen, and Wang, Yu-Chiang Frank
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While representation learning aims to derive interpretable features for describing visual data, representation disentanglement further results in such features so that particular image attributes can be identified and manipulated. However, one cannot easily address this task without observing ground truth annotation for the training data. To address this problem, we propose a novel deep learning model of Cross-Domain Representation Disentangler (CDRD). By observing fully annotated source-domain data and unlabeled target-domain data of interest, our model bridges the information across data domains and transfers the attribute information accordingly. Thus, cross-domain joint feature disentanglement and adaptation can be jointly performed. In the experiments, we provide qualitative results to verify our disentanglement capability. Moreover, we further confirm that our model can be applied for solving classification tasks of unsupervised domain adaptation, and performs favorably against state-of-the-art image disentanglement and translation methods., Comment: CVPR 2018 Spotlight
Published: 2017

50. Demystifying T1-MRI to FDG-PET Image Translation via Representational Similarity

Author: Kao, Chia-Hsiang, Chen, Yong-Sheng, Chen, Li-Fen, Chiu, Wei-Chen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, de Bruijne, Marleen, editor, Cattin, Philippe C., editor, Cotin, Stéphane, editor, Padoy, Nicolas, editor, Speidel, Stefanie, editor, Zheng, Yefeng, editor, and Essert, Caroline, editor
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

205 results on '"Chiu, Wei-Chen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources