Author: "Kim, Jinkyu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kim, Jinkyu"' showing total 362 results

Start Over Author "Kim, Jinkyu"

362 results on '"Kim, Jinkyu"'

1. Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Author: Chang, Gyusam, Lee, Jiwon, Kim, Donghyun, Kim, Jinkyu, Lee, Dongwook, Ji, Daehyun, Jang, Sujin, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (\ie, direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (\ie, 1$\%$ and 5$\%)$, while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods., Comment: Accepted to NeurIPS 2024
Published: 2024

2. ENTP: Encoder-only Next Token Prediction

Author: Ewer, Ethan, Chae, Daewon, Zeng, Thomas, Kim, Jinkyu, and Lee, Kangwook
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent "cheating" by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's superior performance across various realistic tasks, such as length generalization and in-context learning.
Published: 2024

3. Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

Author: Jang, Jiyun, Chang, Mincheol, Park, Jongwon, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy., Comment: Accepted in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024
Published: 2024

4. Sparse-to-Dense LiDAR Point Generation by LiDAR-Camera Fusion for 3D Object Detection

Author: Lee, Minseung, Moon, Seokha, Lee, Seung Joon, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation Network (LCANet), a novel framework that reconstructs LiDAR point cloud data by fusing 2D image features, which contain rich semantic information, generating additional points to improve detection accuracy. LCANet fuses data from LiDAR sensors and cameras by projecting image features into the 3D space, integrating semantic information into the point cloud data. This fused data is then encoded to produce 3D features that contain both semantic and spatial information, which are further refined to reconstruct final points before bounding box prediction. This fusion effectively compensates for LiDAR's weakness in detecting objects at long distances, which are often represented by sparse points. Additionally, due to the sparsity of many objects in the original dataset, which makes effective supervision for point generation challenging, we employ a point cloud completion network to create a complete point cloud dataset that supervises the generation of dense point clouds in our network. Extensive experiments on the KITTI and Waymo datasets demonstrate that LCANet significantly outperforms existing models, particularly in detecting sparse and distant objects., Comment: 7 pages
Published: 2024

5. DeepClair: Utilizing Market Forecasts for Effective Portfolio Selection

Author: Choi, Donghee, Kim, Jinkyu, Gim, Mogan, Lee, Jinho, and Kang, Jaewoo
Subjects: Computer Science - Computational Engineering, Finance, and Science, Computer Science - Artificial Intelligence
Abstract: Utilizing market forecasts is pivotal in optimizing portfolio selection strategies. We introduce DeepClair, a novel framework for portfolio selection. DeepClair leverages a transformer-based time-series forecasting model to predict market trends, facilitating more informed and adaptable portfolio decisions. To integrate the forecasting model into a deep reinforcement learning-driven portfolio selection framework, we introduced a two-step strategy: first, pre-training the time-series model on market data, followed by fine-tuning the portfolio selection architecture using this model. Additionally, we investigated the optimization technique, Low-Rank Adaptation (LoRA), to enhance the pre-trained forecasting model for fine-tuning in investment scenarios. This work bridges market forecasting and portfolio selection, facilitating the advancement of investment strategies., Comment: CIKM 2024 Accepted
Published: 2024
Full Text: View/download PDF

6. VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Author: Moon, Seokha, Woo, Hyun, Park, Hongbeen, Jung, Haeji, Mahjourian, Reza, Chi, Hyung-gun, Lim, Hyerin, Kim, Sangpil, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision during training to guide the model on what to learn from the input data. Despite using these extra inputs, our method achieves a latency of 53 ms, making it feasible for real-time processing, which is significantly faster than that of previous single-agent prediction methods with similar performance. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction. Our project page is at https://moonseokha.github.io/VisionTrap/, Comment: Accepted at ECCV 2024
Published: 2024

7. Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

Author: Moon, Seokha, Park, Hongbeen, Kwon, Jungphil, Lee, Jaekoo, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In autonomous driving and robotics, there is a growing interest in utilizing short-term historical data to enhance multi-camera 3D object detection, leveraging the continuous and correlated nature of input video streams. Recent work has focused on spatially aligning BEV-based features over timesteps. However, this is often limited as its gain does not scale well with long-term past observations. To address this, we advocate for supervising a model to predict objects' poses given past observations, thus explicitly guiding to learn objects' temporal cues. To this end, we propose a model called DAP (Detection After Prediction), consisting of a two-branch network: (i) a branch responsible for forecasting the current objects' poses given past observations and (ii) another branch that detects objects based on the current and past observations. The features predicting the current objects from branch (i) is fused into branch (ii) to transfer predictive knowledge. We conduct extensive experiments with the large-scale nuScenes datasets, and we observe that utilizing such predictive information significantly improves the overall detection performance. Our model can be used plug-and-play, showing consistent performance gain.
Published: 2024

8. Just Add $100 More: Augmenting NeRF-based Pseudo-LiDAR Point Cloud for Resolving Class-imbalance Problem

Author: Chang, Mincheol, Lee, Siyeong, Kim, Jinkyu, and Kim, Namil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Typical LiDAR-based 3D object detection models are trained in a supervised manner with real-world data collection, which is often imbalanced over classes (or long-tailed). To deal with it, augmenting minority-class examples by sampling ground truth (GT) LiDAR points from a database and pasting them into a scene of interest is often used, but challenges still remain: inflexibility in locating GT samples and limited sample diversity. In this work, we propose to leverage pseudo-LiDAR point clouds generated (at a low cost) from videos capturing a surround view of miniatures or real-world objects of minor classes. Our method, called Pseudo Ground Truth Augmentation (PGT-Aug), consists of three main steps: (i) volumetric 3D instance reconstruction using a 2D-to-3D view synthesis model, (ii) object-level domain alignment with LiDAR intensity estimation and (iii) a hybrid context-aware placement method from ground and map information. We demonstrate the superiority and generality of our method through performance improvements in extensive experiments conducted on three popular benchmarks, i.e., nuScenes, KITTI, and Lyft, especially for the datasets with large domain gaps captured by different LiDAR configurations. Our code and data will be publicly available upon publication., Comment: 28 pages, 12 figures, 11 tables
Published: 2024

9. CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection

Author: Chang, Gyusam, Roh, Wonseok, Jang, Sujin, Lee, Dongwook, Ji, Daehyun, Oh, Gyeongrok, Park, Jinsun, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more generalizable, we introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which (i) leverages visual semantic cues from an image modality (i.e., camera images) as an effective semantic bridge to close the domain gap in the cross-modal Bird's Eye View (BEV) representations. Further, (ii) we also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features, which disrupt the discrimination of whether a feature instance comes from a source or an unseen target domain. Overall, our CMDA framework guides the 3DOD model to generate highly informative and domain-adaptive features for novel data distributions. In our extensive experiments with large-scale benchmarks, such as nuScenes, Waymo, and KITTI, those mentioned above provide significant performance gains for UDA tasks, achieving state-of-the-art performance., Comment: Accepted by AAAI 2024
Published: 2024

10. Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer

Author: Jung, Haeji, Oh, Changdae, Kang, Jooeon, Sohn, Jimin, Song, Kyungwoo, Kim, Jinkyu, and Mortensen, David R.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between these languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies. To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced. We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap., Comment: Accepted to the 4th Multilingual Representation Learning (MRL) Workshop (co-located with EMNLP 2024)
Published: 2024

11. Relaxed Contrastive Learning for Federated Learning

Author: Seo, Seonguk, Kim, Jinkyu, Kim, Geeho, and Han, Bohyung
Subjects: Computer Science - Machine Learning
Abstract: We propose a novel contrastive learning framework to effectively address the challenges of data heterogeneity in federated learning. We first analyze the inconsistency of gradient updates across clients during local training and establish its dependence on the distribution of feature representations, leading to the derivation of the supervised contrastive learning (SCL) objective to mitigate local deviations. In addition, we show that a na\"ive adoption of SCL in federated learning leads to representation collapse, resulting in slow convergence and limited performance gains. To address this issue, we introduce a relaxed contrastive learning loss that imposes a divergence penalty on excessively similar sample pairs within each class. This strategy prevents collapsed representations and enhances feature transferability, facilitating collaborative training and leading to significant performance improvements. Our framework outperforms all existing federated learning approaches by huge margins on the standard benchmarks through extensive experimental results.
Published: 2024

12. MEVG: Multi-event Video Generation with Text-to-Video Models

Author: Oh, Gyeongrok, Jeong, Jaehwan, Kim, Sieun, Byeon, Wonmin, Kim, Jinkyu, Kim, Sungwoong, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained diffusion-based text-to-video generative model without a fine-tuning process. Specifically, we propose a last frame-aware diffusion process to preserve visual coherence between consecutive videos where each video consists of different events by initializing the latent and simultaneously adjusting noise in the latent to enhance the motion dynamic in a generated video. Furthermore, we find that the iterative update of latent vectors by referring to all the preceding frames maintains the global appearance across the frames in a video clip. To handle dynamic text input for video generation, we utilize a novel prompt generator that transfers course text messages from the user into the multiple optimal prompts for the text-to-video diffusion model. Extensive experiments and user studies show that our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics. Video examples are available on our project page: https://kuai-lab.github.io/eccv2024mevg., Comment: Accepted by ECCV 2024
Published: 2023

13. InstructBooth: Instruction-following Personalized Text-to-Image Generation

Author: Chae, Daewon, Park, Nokyung, Kim, Jinkyu, and Lee, Kimin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth.
Published: 2023

14. Audio-guided implicit neural representation for local image stylization

Author: Lee, Seung Hyun, Kim, Sieun, Byeon, Wonmin, Oh, Gyeongrok, In, Sumin, Park, Hyeongcheol, Yoon, Sang Ho, Hong, Sung-Hee, Kim, Jinkyu, and Kim, Sangpil
Published: 2024
Full Text: View/download PDF

15. LRSLAM: Low-Rank Representation of Signed Distance Fields in Dense Visual SLAM System

Author: Park, Hongbeen, Park, Minjeong, Nam, Giljoo, Kim, Jinkyu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

16. VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Author: Moon, Seokha, Woo, Hyun, Park, Hongbeen, Jung, Haeji, Mahjourian, Reza, Chi, Hyung-gun, Lim, Hyerin, Kim, Sangpil, Kim, Jinkyu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

17. MEVG: Multi-event Video Generation with Text-to-Video Models

Author: Oh, Gyeongrok, Jeong, Jaehwan, Kim, Sieun, Byeon, Wonmin, Kim, Jinkyu, Kim, Sungwoong, Kim, Sangpil, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

18. Clustering-based Image-Text Graph Matching for Domain Generalization

Author: Park, Nokyung, Chae, Daewon, Shim, Jeongyong, Kim, Sangpil, Kim, Eun-Sol, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e., aligning an image embedding with sentence-level text embedding), not fully utilizing the semantic cues of given text description. In this work, we advocate for the use of local alignment between image regions and corresponding textual descriptions. To this end, we first represent image and text inputs with graphs. We subsequently cluster nodes in those graphs and match the graph-based image node features into textual graphs. This matching process is conducted globally and locally, tightly aligning visual and textual semantic sub-structures. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication.
Published: 2023

19. The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

Author: Jeong, Yujin, Ryoo, Wonjeong, Lee, Seunghyun, Seo, Dabin, Byeon, Wonmin, Kim, Sangpil, and Kim, Jinkyu
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/, Comment: ICCV2023
Published: 2023

20. Soundini: Sound-Guided Diffusion for Natural Video Editing

Author: Lee, Seung Hyun, Kim, Sieun, Yoo, Innfarn, Yang, Feng, Cho, Donghyeon, Kim, Youngseo, Chang, Huiwen, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.
Published: 2023

21. FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment

Author: Oh, Gyeongrok, Gu, Heon, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Interference between overlapping gird patterns creates moire patterns, degrading the visual quality of an image that captures a screen of a digital display device by an ordinary digital camera. Removing such moire patterns is challenging due to their complex patterns of diverse sizes and color distortions. Existing approaches mainly focus on filtering out in the spatial domain, failing to remove a large-scale moire pattern. In this paper, we propose a novel model called FPANet that learns filters in both frequency and spatial domains, improving the restoration quality by removing various sizes of moire patterns. To further enhance, our model takes multiple consecutive frames, learning to extract frame-invariant content features and outputting better quality temporally consistent images. We demonstrate the effectiveness of our proposed method with a publicly available large-scale dataset, observing that ours outperforms the state-of-the-art approaches, including ESDNet, VDmoire, MBCNN, WDNet, UNet, and DMCNN, in terms of the image and video quality metrics, such as PSNR, SSIM, LPIPS, FVD, and FSIM.
Published: 2023

22. Ensuring Visual Commonsense Morality for Text-to-Image Generation

Author: Park, Seongbeom, Moon, Suhong, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society
Abstract: Text-to-image generation methods produce high-resolution and high-quality images, but these methods should not produce immoral images that may contain inappropriate content from the perspective of commonsense morality. In this paper, we aim to automatically judge the immorality of synthesized images and manipulate these images into morally acceptable alternatives. To this end, we build a model that has three main primitives: (1) recognition of the visual commonsense immorality in a given image, (2) localization or highlighting of immoral visual (and textual) attributes that contribute to the immorality of the image, and (3) manipulation of an immoral image to create a morally-qualifying alternative. We conduct experiments and human studies using the state-of-the-art Stable Diffusion text-to-image generation model, demonstrating the effectiveness of our ethical image manipulation approach., Comment: Workshop on Challenges in Deployable Generative AI at ICML 2023
Published: 2022

23. LISA: Localized Image Stylization with Audio via Implicit Neural Representation

Author: Lee, Seung Hyun, Kim, Chanyoung, Byeon, Wonmin, Yoon, Sang Ho, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.
Published: 2022

24. Zero-shot Visual Commonsense Immorality Prediction

Author: Jeong, Yujin, Park, Seongbeom, Moon, Suhong, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society
Abstract: Artificial intelligence is currently powering diverse real-world applications. These applications have shown promising performance, but raise complicated ethical issues, i.e. how to embed ethics to make AI applications behave morally. One way toward moral AI systems is by imitating human prosocial behavior and encouraging some form of good behavior in systems. However, learning such normative ethics (especially from images) is challenging mainly due to a lack of data and labeling complexity. Here, we propose a model that predicts visual commonsense immorality in a zero-shot manner. We train our model with an ETHICS dataset (a pair of text and morality annotation) via a CLIP-based image-text joint embedding. In a testing phase, the immorality of an unseen image is predicted. We evaluate our model with existing moral/immoral image datasets and show fair prediction performance consistent with human intuitions. Further, we create a visual commonsense immorality benchmark with more general and extensive immoral visual contents. Codes and dataset are available at https://github.com/ku-vai/Zero-shot-Visual-Commonsense-Immorality-Prediction. Note that this paper might contain images and descriptions that are offensive in nature., Comment: BMVC2022
Published: 2022

25. Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling

Author: Lee, Daeun, Park, Jongwon, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: An autonomous driving system requires a 3D object detector, which must perceive all present road agents reliably to navigate an environment safely. However, real-world driving datasets often suffer from the problem of data imbalance, which causes difficulties in training a model that works well across all classes, resulting in an undesired imbalanced sub-optimal performance. In this work, we propose a method to address this data imbalance problem. Our method consists of two main components: (i) a LiDAR-based 3D object detector with per-class multiple detection heads where losses from each head are modified by dynamic weight average to be balanced. (ii) Contextual ground truth (GT) sampling, where we improve conventional GT sampling techniques by leveraging semantic information to augment point cloud with sampled ground truth GT objects. Our experiment with KITTI and nuScenes datasets confirms our proposed method's effectiveness in dealing with the data imbalance problem, producing better detection accuracy compared to existing approaches., Comment: 10 pages
Published: 2022

26. Robust Sound-Guided Image Manipulation

Author: Lee, Seung Hyun, Oh, Gyeongrok, Byeon, Wonmin, Yoon, Sang Ho, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input "raining". These approaches often utilize a StyleCLIP-based image generator, which leverages multi-modal (text and image) embedding space. However, we observe that such text inputs are often bottlenecked in providing and synthesizing rich semantic cues, e.g., differentiating heavy rain from rain with thunderstorms. To address this issue, we advocate leveraging an additional modality, sound, which has notable advantages in image manipulation as it can convey more diverse semantic cues (vivid emotions or dynamic expressions of the natural world) than texts. In this paper, we propose a novel approach that first extends the image-text joint embedding space with sound and applies a direct latent optimization method to manipulate a given image based on audio input, e.g., the sound of rain. Our extensive experiments show that our sound-guided image manipulation approach produces semantically and visually more plausible manipulation results than the state-of-the-art text and sound-guided image manipulation methods, which are further confirmed by our human evaluations. Our downstream task evaluations also show that our learned image-text-sound joint embedding space effectively encodes sound inputs., Comment: arXiv admin note: text overlap with arXiv:2112.00007
Published: 2022

27. Grounding Visual Representations with Texts for Domain Generalization

Author: Min, Seonwoo, Park, Nokyung, Kim, Siwon, Park, Seunghyun, and Kim, Jinkyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We introduce two modules to ground visual representations with texts containing typical reasoning of humans: (1) Visual and Textual Joint Embedder and (2) Textual Explanation Generator. The former learns the image-text joint embedding space where we can ground high-level class-discriminative information into the model. The latter leverages an explainable model and generates explanations justifying the rationale behind its decision. To the best of our knowledge, this is the first work to leverage the vision-and-language cross-modality approach for the domain generalization task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate that cross-modality supervision can be successfully used to ground domain-invariant visual representations and improve the model generalization. Furthermore, in the large-scale DomainBed benchmark, our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets. The dataset and codes are available at https://github.com/mswzeus/GVRT., Comment: ECCV 2022; 25 pages (including Supplementary Materials); Updated related works
Published: 2022

28. Multi-Level Branched Regularization for Federated Learning

Author: Kim, Jinkyu, Kim, Geeho, and Han, Bohyung
Subjects: Computer Science - Machine Learning
Abstract: A critical challenge of federated learning is data heterogeneity and imbalance across clients, which leads to inconsistency between local networks and unstable convergence of global models. To alleviate the limitations, we propose a novel architectural regularization technique that constructs multiple auxiliary branches in each local model by grafting local and global subnetworks at several different levels and that learns the representations of the main pathway in the local model congruent to the auxiliary hybrid pathways via online knowledge distillation. The proposed technique is effective to robustify the global model even in the non-iid setting and is applicable to various federated learning frameworks conveniently without incurring extra communication costs. We perform comprehensive empirical studies and demonstrate remarkable performance gains in terms of accuracy and efficiency compared to existing methods. The source code is available at our project page., Comment: ICML 2022
Published: 2022

29. An Embedding-Dynamic Approach to Self-supervised Learning

Author: Moon, Suhong, Buracas, Domas, Park, Seunghyun, Kim, Jinkyu, and Canny, John
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: A number of recent self-supervised learning methods have shown impressive performance on image classification and other tasks. A somewhat bewildering variety of techniques have been used, not always with a clear understanding of the reasons for their benefits, especially when used in combination. Here we treat the embeddings of images as point particles and consider model optimization as a dynamic process on this system of particles. Our dynamic model combines an attractive force for similar images, a locally dispersive force to avoid local collapse, and a global dispersive force to achieve a globally-homogeneous distribution of particles. The dynamic perspective highlights the advantage of using a delayed-parameter image embedding (a la BYOL) together with multiple views of the same image. It also uses a purely-dynamic local dispersive force (Brownian motion) that shows improved performance over other methods and does not require knowledge of other particle coordinates. The method is called MSBReg which stands for (i) a Multiview centroid loss, which applies an attractive force to pull different image view embeddings toward their centroid, (ii) a Singular value loss, which pushes the particle system toward spatially homogeneous density, (iii) a Brownian diffusive loss. We evaluate downstream classification performance of MSBReg on ImageNet as well as transfer learning tasks including fine-grained classification, multi-class object classification, object detection, and instance segmentation. In addition, we also show that applying our regularization term to other methods further improves their performance and stabilize the training by preventing a mode collapse., Comment: 24 pages, 3 figures, submitted to CVPR 2022
Published: 2022

30. ORA3D: Overlap Region Aware Multi-view 3D Object Detection

Author: Roh, Wonseok, Chang, Gyusam, Moon, Seokha, Nam, Giljoo, Kim, Chanyoung, Kim, Younghyun, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap region are often largely occluded or suffer from deformation due to camera distortion, causing a domain shift. To mitigate this issue, we propose using the following two main modules: (1) Stereo Disparity Estimation for Weak Depth Supervision and (2) Adversarial Overlap Region Discriminator. The former utilizes the traditional stereo disparity estimation method to obtain reliable disparity information from the overlap region. Given the disparity estimates as supervision, we propose regularizing the network to fully utilize the geometric potential of binocular images and improve the overall detection accuracy accordingly. Further, the latter module minimizes the representational gap between non-overlap and overlapping regions. We demonstrate the effectiveness of the proposed method with the nuScenes large-scale multi-view 3D object detection data. Our experiments show that our proposed method outperforms current state-of-the-art models, i.e., DETR3D and BEVDet., Comment: BMVC2022
Published: 2022

31. StopNet: Scalable Trajectory and Occupancy Prediction for Urban Autonomous Driving

Author: Kim, Jinkyu, Mahjourian, Reza, Ettinger, Scott, Bansal, Mayank, White, Brandyn, Sapp, Ben, and Anguelov, Dragomir
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a motion forecasting (behavior prediction) method that meets the latency requirements for autonomous driving in dense urban environments without sacrificing accuracy. A whole-scene sparse input representation allows StopNet to scale to predicting trajectories for hundreds of road agents with reliable latency. In addition to predicting trajectories, our scene encoder lends itself to predicting whole-scene probabilistic occupancy grids, a complementary output representation suitable for busy urban environments. Occupancy grids allow the AV to reason collectively about the behavior of groups of agents without processing their individual trajectories. We demonstrate the effectiveness of our sparse input representation and our model in terms of computation and accuracy over three datasets. We further show that co-training consistent trajectory and occupancy predictions improves upon state-of-the-art performance under standard metrics.
Published: 2022

32. Sound-Guided Semantic Video Generation

Author: Lee, Seung Hyun, Oh, Gyeongrok, Byeon, Wonmin, Kim, Chanyoung, Ryoo, Won Jeong, Yoon, Sang Ho, Cho, Hyunjun, Bae, Jihyun, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method.
Published: 2022

33. Occupancy Flow Fields for Motion Forecasting in Autonomous Driving

Author: Mahjourian, Reza, Kim, Jinkyu, Chai, Yuning, Tan, Mingxing, Sapp, Ben, and Anguelov, Dragomir
Subjects: Computer Science - Robotics, Computer Science - Machine Learning
Abstract: We propose Occupancy Flow Fields, a new representation for motion forecasting of multiple agents, an important task in autonomous driving. Our representation is a spatio-temporal grid with each grid cell containing both the probability of the cell being occupied by any agent, and a two-dimensional flow vector representing the direction and magnitude of the motion in that cell. Our method successfully mitigates shortcomings of the two most commonly-used representations for motion forecasting: trajectory sets and occupancy grids. Although occupancy grids efficiently represent the probabilistic location of many agents jointly, they do not capture agent motion and lose the agent identities. To this end, we propose a deep learning architecture that generates Occupancy Flow Fields with the help of a new flow trace loss that establishes consistency between the occupancy and flow predictions. We demonstrate the effectiveness of our approach using three metrics on occupancy prediction, motion estimation, and agent ID recovery. In addition, we introduce the problem of predicting speculative agents, which are currently-occluded agents that may appear in the future through dis-occlusion or by entering the field of view. We report experimental results on a large in-house autonomous driving dataset and the public INTERACTION dataset, and show that our model outperforms state-of-the-art models.
Published: 2022
Full Text: View/download PDF

34. Communication-Efficient Federated Learning with Accelerated Client Gradient

Author: Kim, Geeho, Kim, Jinkyu, and Han, Bohyung
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Federated learning often suffers from slow and unstable convergence due to the heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients has large variations. To address this challenge, we propose a simple but effective federated learning framework, which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a lookahead gradient. This strategy enables the proposed approach to convey the projected global update information to participants effectively without additional client memory and extra communication costs. We also regularize local updates by aligning each client with the overshot global model to reduce bias and improve the stability of our algorithm. We provide the theoretical convergence rate of our algorithm and demonstrate remarkable performance gains in terms of accuracy and communication efficiency compared to the state-of-the-art methods, especially with low client participation rates. The source code is available at our project page., Comment: CVPR 2024
Published: 2022

35. Sound-Guided Semantic Image Manipulation

Author: Lee, Seung Hyun, Roh, Wonseok, Byeon, Wonmin, Yoon, Sang Ho, Kim, Chan Young, Kim, Jinkyu, and Kim, Sangpil
Subjects: Computer Science - Graphics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent optimization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix text and audio modalities, which enrich the variety of the image modification. We verify the effectiveness of our sound-guided image manipulation quantitatively and qualitatively. We also show that our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification. The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods.
Published: 2021

36. A Scenario-Based Platform for Testing Autonomous Vehicle Behavior Prediction Models in Simulation

Author: Indaheng, Francis, Kim, Edward, Viswanadha, Kesav, Shenoy, Jay, Kim, Jinkyu, Fremont, Daniel J., and Seshia, Sanjit A.
Subjects: Computer Science - Artificial Intelligence
Abstract: Behavior prediction remains one of the most challenging tasks in the autonomous vehicle (AV) software stack. Forecasting the future trajectories of nearby agents plays a critical role in ensuring road safety, as it equips AVs with the necessary information to plan safe routes of travel. However, these prediction models are data-driven and trained on data collected in real life that may not represent the full range of scenarios an AV can encounter. Hence, it is important that these prediction models are extensively tested in various test scenarios involving interactive behaviors prior to deployment. To support this need, we present a simulation-based testing platform which supports (1) intuitive scenario modeling with a probabilistic programming language called Scenic, (2) specifying a multi-objective evaluation metric with a partial priority ordering, (3) falsification of the provided metric, and (4) parallelization of simulations for scalable testing. As a part of the platform, we provide a library of 25 Scenic programs that model challenging test scenarios involving interactive traffic participant behaviors. We demonstrate the effectiveness and the scalability of our platform by testing a trained behavior prediction model and searching for failure scenarios., Comment: Accepted to the NeurIPS 2021 Workshop on Machine Learning for Autonomous Driving
Published: 2021

37. Robust sound-guided image manipulation

Author: Lee, Seung Hyun, Chi, Hyung-gun, Oh, Gyeongrok, Byeon, Wonmin, Yoon, Sang Ho, Park, Hyunje, Cho, Wonjun, Kim, Jinkyu, and Kim, Sangpil
Published: 2024
Full Text: View/download PDF

38. SelfReg: Self-supervised Contrastive Regularization for Domain Generalization

Author: Kim, Daehee, Park, Seunghyun, Kim, Jinkyu, and Lee, Jaekoo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, I.5
Abstract: In general, an experimental environment for deep learning assumes that the training and the test dataset are sampled from the same distribution. However, in real-world situations, a difference in the distribution between two datasets, domain shift, may occur, which becomes a major factor impeding the generalization performance of the model. The research field to solve this problem is called domain generalization, and it alleviates the domain shift problem by extracting domain-invariant features explicitly or implicitly. In recent studies, contrastive learning-based domain generalization approaches have been proposed and achieved high performance. These approaches require sampling of the negative data pair. However, the performance of contrastive learning fundamentally depends on quality and quantity of negative data pairs. To address this issue, we propose a new regularization method for domain generalization based on contrastive learning, self-supervised contrastive regularization (SelfReg). The proposed approach use only positive data pairs, thus it resolves various problems caused by negative pair sampling. Moreover, we propose a class-specific domain perturbation layer (CDPL), which makes it possible to effectively apply mixup augmentation even when only positive data pairs are used. The experimental results show that the techniques incorporated by SelfReg contributed to the performance in a compatible manner. In the recent benchmark, DomainBed, the proposed method shows comparable performance to the conventional state-of-the-art alternatives. Codes are available at https://github.com/dnap512/SelfReg., Comment: 14 pages
Published: 2021

39. Attentional Bottleneck: Towards an Interpretable Deep Driving Network

Author: Kim, Jinkyu and Bansal, Mayank
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Deep neural networks are a key component of behavior prediction and motion generation for self-driving cars. One of their main drawbacks is a lack of transparency: they should provide easy to interpret rationales for what triggers certain behaviors. We propose an architecture called Attentional Bottleneck with the goal of improving transparency. Our key idea is to combine visual attention, which identifies what aspects of the input the model is using, with an information bottleneck that enables the model to only use aspects of the input which are important. This not only provides sparse and interpretable attention maps (e.g. focusing only on specific vehicles in the scene), but it adds this transparency at no cost to model accuracy. In fact, we find slight improvements in accuracy when applying Attentional Bottleneck to the ChauffeurNet model, whereas we find that the accuracy deteriorates with a traditional visual attention model.
Published: 2020

40. Grounding Human-to-Vehicle Advice for Self-driving Vehicles

Author: Kim, Jinkyu, Misu, Teruhisa, Chen, Yi-Ting, Tawari, Ashish, and Canny, John
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent success suggests that deep neural control networks are likely to be a key component of self-driving vehicles. These networks are trained on large datasets to imitate human actions, but they lack semantic understanding of image contents. This makes them brittle and potentially unsafe in situations that do not match training data. Here, we propose to address this issue by augmenting training data with natural language advice from a human. Advice includes guidance about what to do and where to attend. We present the first step toward advice giving, where we train an end-to-end vehicle controller that accepts advice. The controller adapts the way it attends to the scene (visual attention) and the control (steering and speed). Attention mechanisms tie controller behavior to salient objects in the advice. We evaluate our model on a novel advisable driving dataset with manually annotated human-to-vehicle advice called Honda Research Institute-Advice Dataset (HAD). We show that taking advice improves the performance of the end-to-end network, while the network cues on a variety of visual features that are provided by advice. The dataset is available at https://usa.honda-ri.com/HAD., Comment: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019
Published: 2019

41. HATS: A Hierarchical Graph Attention Network for Stock Movement Prediction

Author: Kim, Raehyun, So, Chan Ho, Jeong, Minbyul, Lee, Sanghoon, Kim, Jinkyu, and Kang, Jaewoo
Subjects: Quantitative Finance - Statistical Finance, Computer Science - Artificial Intelligence, Computer Science - Computational Engineering, Finance, and Science
Abstract: Many researchers both in academia and industry have long been interested in the stock market. Numerous approaches were developed to accurately predict future trends in stock prices. Recently, there has been a growing interest in utilizing graph-structured data in computer science research communities. Methods that use relational data for stock market prediction have been recently proposed, but they are still in their infancy. First, the quality of collected information from different types of relations can vary considerably. No existing work has focused on the effect of using different types of relations on stock market prediction or finding an effective way to selectively aggregate information on different relation types. Furthermore, existing works have focused on only individual stock prediction which is similar to the node classification task. To address this, we propose a hierarchical attention network for stock prediction (HATS) which uses relational data for stock market prediction. Our HATS method selectively aggregates information on different relation types and adds the information to the representations of each company. Specifically, node representations are initialized with features extracted from a feature extraction module. HATS is used as a relational modeling module with initialized node representations. Then, node representations with the added information are fed into a task-specific layer. Our method is used for predicting not only individual stock prices but also market index movements, which is similar to the graph classification task. The experimental results show that performance can change depending on the relational data used. HATS which can automatically select information outperformed all the existing methods.
Published: 2019

42. Periphery-Fovea Multi-Resolution Driving Model guided by Human Attention

Author: Xia, Ye, Kim, Jinkyu, Canny, John, Zipser, Karl, and Whitney, David
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by human vision, we propose a new periphery-fovea multi-resolution driving model that predicts vehicle speed from dash camera videos. The peripheral vision module of the model processes the full video frames in low resolution. Its foveal vision module selects sub-regions and uses high-resolution input from those regions to improve its driving performance. We train the fovea selection module with supervision from driver gaze. We show that adding high-resolution input from predicted human driver gaze locations significantly improves the driving accuracy of the model. Our periphery-fovea multi-resolution model outperforms a uni-resolution periphery-only model that has the same amount of floating-point operations. More importantly, we demonstrate that our driving model achieves a significantly higher performance gain in pedestrian-involved critical situations than in other non-critical situations.
Published: 2019

43. Extended framework of Hamilton's principle applied to Duffing oscillation

Author: Kim, Jinkyu, Lee, Hyeonseok, and Shin, Jinwon
Subjects: Computer Science - Numerical Analysis, Computer Science - Computational Engineering, Finance, and Science, Physics - Computational Physics
Abstract: The paper begins with a novel variational formulation of Duffing equation using the extended framework of Hamilton's principle (EHP). This formulation properly accounts for initial conditions, and it recovers all the governing differential equations as its Euler-Lagrange equation. Thus, it provides elegant structure for the development of versatile temporal finite element methods. Herein, the simplest temporal finite element method is presented by adopting linear temporal shape functions. Numerical examples are included to verify and investigate performance of non-iterative algorithm in the developed method.
Published: 2019

44. Bridging the Domain Gap Towards Generalization in Automatic Colorization

Author: Lee, Hyejin, Kim, Daehee, Lee, Daeun, Kim, Jinkyu, Lee, Jaekoo, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

45. Textual Explanations for Self-Driving Vehicles

Author: Kim, Jinkyu, Rohrbach, Anna, Darrell, Trevor, Canny, John, and Akata, Zeynep
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep neural perception and control networks have become key components of self-driving vehicles. User acceptance is likely to benefit from easy-to-interpret textual explanations which allow end-users to understand what triggered a particular behavior. Explanations may be triggered by the neural controller, namely introspective explanations, or informed by the neural controller's output, namely rationalizations. We propose a new approach to introspective explanations which consists of two parts. First, we use a visual (spatial) attention model to train a convolutional network end-to-end from images to the vehicle control commands, i.e., acceleration and change of course. The controller's attention identifies image regions that potentially influence the network's output. Second, we use an attention-based video-to-text model to produce textual explanations of model actions. The attention maps of controller and explanation model are aligned so that explanations are grounded in the parts of the scene that mattered to the controller. We explore two approaches to attention alignment, strong- and weak-alignment. Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments. We evaluate these models on a novel driving dataset with ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-X) dataset. Code is available at https://github.com/JinkyuKimUCB/explainable-deep-driving., Comment: Accepted to ECCV 2018
Published: 2018

46. Predicting Driver Attention in Critical Situations

Author: Xia, Ye, Zhang, Danqing, Kim, Jinkyu, Nakayama, Ken, Zipser, Karl, and Whitney, David
Subjects: Clinical Research, Behavioral and Social Science, Basic Behavioral and Social Science, Good Health and Well Being, Driver attention prediction, BDD-A dataset, Berkeley DeepDrive, cs.CV, Artificial Intelligence & Image Processing
Abstract: Robust driver attention prediction for critical situations is a challenging computer vision problem, yet essential for autonomous driving. Because critical driving moments are so rare, collecting enough data for these situations is difficult with the conventional in-car data collection protocol—tracking eye movements during driving. Here, we first propose a new in-lab driver attention collection protocol and introduce a new driver attention dataset, Berkeley DeepDrive Attention (BDD-A) dataset, which is built upon braking event videos selected from a large-scale, crowd-sourced driving video dataset. We further propose Human Weighted Sampling (HWS) method, which uses human gaze behavior to identify crucial frames of a driving dataset and weights them heavily during model training. With our dataset and HWS, we built a driver attention prediction model that outperforms the state-of-the-art and demonstrates sophisticated behaviors, like attending to crossing pedestrians but not giving false alarms to pedestrians safely walking on the sidewalk. Its prediction results are nearly indistinguishable from ground-truth to humans. Although only being trained with our in-lab attention data, the model also predicts in-car driver attention data of routine driving with state-of-the-art accuracy. This result not only demonstrates the performance of our model but also proves the validity and usefulness of our dataset and data collection protocol.
Published: 2019

47. Predicting Driver Attention in Critical Situations

Author: Xia, Ye, Zhang, Danqing, Kim, Jinkyu, Nakayama, Ken, Zipser, Karl, and Whitney, David
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Robust driver attention prediction for critical situations is a challenging computer vision problem, yet essential for autonomous driving. Because critical driving moments are so rare, collecting enough data for these situations is difficult with the conventional in-car data collection protocol---tracking eye movements during driving. Here, we first propose a new in-lab driver attention collection protocol and introduce a new driver attention dataset, Berkeley DeepDrive Attention (BDD-A) dataset, which is built upon braking event videos selected from a large-scale, crowd-sourced driving video dataset. We further propose Human Weighted Sampling (HWS) method, which uses human gaze behavior to identify crucial frames of a driving dataset and weights them heavily during model training. With our dataset and HWS, we built a driver attention prediction model that outperforms the state-of-the-art and demonstrates sophisticated behaviors, like attending to crossing pedestrians but not giving false alarms to pedestrians safely walking on the sidewalk. Its prediction results are nearly indistinguishable from ground-truth to humans. Although only being trained with our in-lab attention data, the model also predicts in-car driver attention data of routine driving with state-of-the-art accuracy. This result not only demonstrates the performance of our model but also proves the validity and usefulness of our dataset and data collection protocol., Comment: ACCV 2018
Published: 2017

48. Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention

Author: Kim, Jinkyu and Canny, John
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Deep neural perception and control networks are likely to be a key component of self-driving vehicles. These models need to be explainable - they should provide easy-to-interpret rationales for their behavior - so that passengers, insurance companies, law enforcement, developers etc., can understand what triggered a particular behavior. Here we explore the use of visual explanations. These explanations take the form of real-time highlighted regions of an image that causally influence the network's output (steering control). Our approach is two-stage. In the first stage, we use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network's output. Some of these are true influences, but some are spurious. We then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior. We demonstrate the effectiveness of our model on three datasets totaling 16 hours of driving. We first show that training with attention does not degrade the performance of the end-to-end network. Then we show that the network causally cues on a variety of features that are used by humans while driving.
Published: 2017

49. Study on the Estimation Method of Wind Resistance Considering Self-Induced Wind by Ship Advance Speed.

Author: Park, Hyounggil, Lee, Pyungkuk, Kim, Jinkyu, Kim, Heejung, Lee, Heedong, and Lee, Youngchul
Subjects: SHIP resistance, WIND tunnel testing, WIND speed, WIND pressure, CONTAINER ships
Abstract: A numerical analysis of the wind load for the purpose of evaluating the wind resistance acting on a ship and the validity of the wind profile applied to determine the wind load coefficient were conducted. Through the evaluation of estimation results by a wind tunnel test, CFD analysis, and present semi-empirical formulae, it was recognized that the difference in estimation of ship resistance due to wind could not be ignored. In order to identify the main causes of the difference, extensive analyses were performed for a container, tanker, and LNG carrier. In particular, the estimation results for a container ship with two islands showed unreliable results. The main reason for the difference is that each method reflects the wind speed in the vertical direction differently, and the wind profile applied when considering the self-induced wind effect is not a uniform wind profile. In the calculation of wind resistance by self-induced wind, wind resistance estimation results differed by about 1.5% to 3.4% depending on the application of uniform or non-uniform wind profile. The total wind resistance acting on the vessel shall be divided into wind resistance from a stationary vessel without speed and wind resistance caused by the forward speed of the vessel in no wind conditions. Therefore, it is reasonable to apply a uniform wind profile to estimate wind resistance caused by the ship's forward speed, while a wind profile that reflects the effect of changes in the ship's vertical speed should be applied to estimate the wind resistance caused by the ship's forward speed. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

50. Preliminary study on a composite steel slit damper

Author: Kim, Jinkyu, Kim, Min-Cheol, and Kim, Dong-Keon
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

362 results on '"Kim, Jinkyu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources