Author: "Sato, Yoichi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sato, Yoichi"' showing total 2,200 results

Start Over Author "Sato, Yoichi"

2,200 results on '"Sato, Yoichi"'

1. Pre-Training for 3D Hand Pose Estimation with Contrastive Learning on Large-Scale Hand Images in the Wild

Author: Lin, Nie, Ohkawa, Takehiko, Zhang, Mingfang, Huang, Yifei, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a contrastive learning framework based on in-the-wild hand images tailored for pre-training 3D hand pose estimators, dubbed HandCLR. Pre-training on large-scale images achieves promising results in various tasks, but prior 3D hand pose pre-training methods have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our method with contrastive learning. Specifically, we collected over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands; pairs of similar hand poses originating from different samples, and propose a novel contrastive learning method that embeds similar hand pairs closer in the latent space. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands., Comment: HANDS@ECCV24 (Extended Abstracts)
Published: 2024

2. WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Author: Kong, Quan, Kawana, Yuki, Saini, Rajat, Kumar, Ashutosh, Pan, Jingjing, Gu, Ta, Ozao, Yohei, Opra, Balazs, Anastasiu, David C., Sato, Yoichi, and Kobori, Norimasa
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development., Comment: ECCV24. Website: https://woven-visionai.github.io/wts-dataset-homepage/
Published: 2024

3. ActionVOS: Actions as Prompts for Video Object Segmentation

Author: Ouyang, Liangyang, Liu, Ruicong, Huang, Yifei, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Delving into the realm of egocentric vision, the advancement of referring video object segmentation (RVOS) stands as pivotal in understanding human activities. However, existing RVOS task primarily relies on static attributes such as object names to segment target objects, posing challenges in distinguishing target objects from background objects and in identifying objects undergoing state changes. To address these problems, this work proposes a novel action-aware RVOS setting called ActionVOS, aiming at segmenting only active objects in egocentric videos using human actions as a key language prompt. This is because human actions precisely describe the behavior of humans, thereby helping to identify the objects truly involved in the interaction and to understand possible state changes. We also build a method tailored to work under this specific setting. Specifically, we develop an action-aware labeling module with an efficient action-guided focal loss. Such designs enable ActionVOS model to prioritize active objects with existing readily-available annotations. Experimental results on VISOR dataset reveal that ActionVOS significantly reduces the mis-segmentation of inactive objects, confirming that actions help the ActionVOS model understand objects' involvement. Further evaluations on VOST and VSCOS datasets show that the novel ActionVOS setting enhances segmentation performance when encountering challenging circumstances involving object state changes. We will make our implementation available at https://github.com/ut-vision/ActionVOS., Comment: This paper is accepted by ECCV2024. Code will be released at https://github.com/ut-vision/ActionVOS
Published: 2024

4. Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Author: Zhang, Mingfang, Huang, Yifei, Liu, Ruicong, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world., Comment: ECCV 2024
Published: 2024

5. Learning Object States from Actions via Large Language Models

Author: Tateno, Masatoshi, Yagi, Takuma, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs., Comment: 19 pages of main content, 24 pages of supplementary material
Published: 2024

6. Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Author: Fan, Zicong, Ohkawa, Takehiko, Yang, Linlin, Lin, Nie, Zhou, Zhishan, Zhou, Shihao, Liang, Jiajun, Gao, Zhong, Zhang, Xuanyang, Zhang, Xue, Li, Fei, Liu, Zheng, Lu, Feng, Zeid, Karim Abou, Leibe, Bastian, On, Jeongwan, Baek, Seungryul, Prakash, Aditya, Gupta, Saurabh, He, Kun, Sato, Yoichi, Hilliges, Otmar, Chang, Hyung Jin, and Yao, Angela
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We interact with the world with our hands and see it through our own (egocentric) perspective. A holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. Accurately reconstructing such interactions in 3D is challenging due to heavy occlusion, viewpoint bias, camera distortion, and motion blur from the head movement. To this end, we designed the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks. Our analysis demonstrates the effectiveness of addressing distortion specific to egocentric cameras, adopting high-capacity transformers to learn complex hand-object interactions, and fusing predictions from different views. Our study further reveals challenging scenarios intractable with state-of-the-art methods, such as fast hand motion, object reconstruction from narrow egocentric views, and close contact between two hands and objects. Our efforts will enrich the community's knowledge foundation and facilitate future hand studies on egocentric hand-object interactions., Comment: Accepted to ECCV 2024
Published: 2024

7. Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

Author: Liu, Ruicong, Ohkawa, Takehiko, Zhang, Mingfang, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand., Comment: This paper is accepted by CVPR2024. Code will be released at https://github.com/ut-vision/S2DHand
Published: 2024

8. Simultaneous control of head pose and expressions in 3D facial keypoint-based GAN

Author: Hatakeyama, Tomoyuki, Furuta, Ryosuke, and Sato, Yoichi
Published: 2024
Full Text: View/download PDF

9. Matching Compound Prototypes for Few-Shot Action Recognition

Author: Huang, Yifei, Yang, Lijin, Chen, Guo, Zhang, Hongjie, Lu, Feng, and Sato, Yoichi
Published: 2024
Full Text: View/download PDF

10. FineBio: A Fine-Grained Video Dataset of Biological Experiments with Hierarchical Annotation

Author: Yagi, Takuma, Ohashi, Misaki, Huang, Yifei, Furuta, Ryosuke, Adachi, Shungo, Mitsuyama, Toutai, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In the development of science, accurate and reproducible documentation of the experimental process is crucial. Automatic recognition of the actions in experiments from videos would help experimenters by complementing the recording of experiments. Towards this goal, we propose FineBio, a new fine-grained video dataset of people performing biological experiments. The dataset consists of multi-view videos of 32 participants performing mock biological experiments with a total duration of 14.5 hours. One experiment forms a hierarchical structure, where a protocol consists of several steps, each further decomposed into a set of atomic operations. The uniqueness of biological experiments is that while they require strict adherence to steps described in each protocol, there is freedom in the order of atomic operations. We provide hierarchical annotation on protocols, steps, atomic operations, object locations, and their manipulation states, providing new challenges for structured activity understanding and hand-object interaction recognition. To find out challenges on activity understanding in biological experiments, we introduce baseline models and results on four different tasks, including (i) step segmentation, (ii) atomic operation detection (iii) object detection, and (iv) manipulated/affected object detection. Dataset and code are available from https://github.com/aistairc/FineBio.
Published: 2024

11. Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Author: Fan, Zicong, Ohkawa, Takehiko, Yang, Linlin, Lin, Nie, Zhou, Zhishan, Zhou, Shihao, Liang, Jiajun, Gao, Zhong, Zhang, Xuanyang, Zhang, Xue, Li, Fei, Liu, Zheng, Lu, Feng, Zeid, Karim Abou, Leibe, Bastian, On, Jeongwan, Baek, Seungryul, Prakash, Aditya, Gupta, Saurabh, He, Kun, Sato, Yoichi, Hilliges, Otmar, Chang, Hyung Jin, Yao, Angela, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

12. Masked Video and Body-Worn IMU Autoencoder for Egocentric Action Recognition

Author: Zhang, Mingfang, Huang, Yifei, Liu, Ruicong, Sato, Yoichi, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

13. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Author: Grauman, Kristen, Westbury, Andrew, Torresani, Lorenzo, Kitani, Kris, Malik, Jitendra, Afouras, Triantafyllos, Ashutosh, Kumar, Baiyya, Vijay, Bansal, Siddhant, Boote, Bikram, Byrne, Eugene, Chavis, Zach, Chen, Joya, Cheng, Feng, Chu, Fu-Jen, Crane, Sean, Dasgupta, Avijit, Dong, Jing, Escobar, Maria, Forigua, Cristhian, Gebreselasie, Abrham, Haresh, Sanjay, Huang, Jing, Islam, Md Mohaiminul, Jain, Suyog, Khirodkar, Rawal, Kukreja, Devansh, Liang, Kevin J, Liu, Jia-Wei, Majumder, Sagnik, Mao, Yongsen, Martin, Miguel, Mavroudi, Effrosyni, Nagarajan, Tushar, Ragusa, Francesco, Ramakrishnan, Santhosh Kumar, Seminara, Luigi, Somayazulu, Arjun, Song, Yale, Su, Shan, Xue, Zihui, Zhang, Edward, Zhang, Jinxu, Castillo, Angela, Chen, Changan, Fu, Xinzhu, Furuta, Ryosuke, Gonzalez, Cristina, Gupta, Prince, Hu, Jiabo, Huang, Yifei, Huang, Yiming, Khoo, Weslie, Kumar, Anush, Kuo, Robert, Lakhavani, Sach, Liu, Miao, Luo, Mi, Luo, Zhengyi, Meredith, Brighid, Miller, Austin, Oguntola, Oluwatumininu, Pan, Xiaqing, Peng, Penny, Pramanick, Shraman, Ramazanova, Merey, Ryan, Fiona, Shan, Wei, Somasundaram, Kiran, Song, Chenan, Southerland, Audrey, Tateno, Masatoshi, Wang, Huiyu, Wang, Yuchen, Yagi, Takuma, Yan, Mingfei, Yang, Xitong, Yu, Zecheng, Zha, Shengxin Cindy, Zhao, Chen, Zhao, Ziwei, Zhu, Zhifan, Zhuo, Jeff, Arbelaez, Pablo, Bertasius, Gedas, Crandall, David, Damen, Dima, Engel, Jakob, Farinella, Giovanni Maria, Furnari, Antonino, Ghanem, Bernard, Hoffman, Judy, Jawahar, C. V., Newcombe, Richard, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Savva, Manolis, Shi, Jianbo, Shou, Mike Zheng, and Wray, Michael
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/, Comment: Expanded manuscript (compared to arxiv v1 from Nov 2023 and CVPR 2024 paper from June 2024) for more comprehensive dataset and benchmark presentation, plus new results on v2 data release
Published: 2023

14. Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling

Author: Wen, Yilin, Pan, Hao, Ohkawa, Takehiko, Yang, Lei, Pan, Jia, Sato, Yoichi, Komura, Taku, and Wang, Wenping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel unified framework that concurrently tackles recognition and future prediction for human hand pose and action modeling. Previous works generally provide isolated solutions for either recognition or prediction, which not only increases the complexity of integration in practical applications, but more importantly, cannot exploit the synergy of both sides and suffer suboptimal performances in their respective domains. To address this problem, we propose a generative Transformer VAE architecture to model hand pose and action, where the encoder and decoder capture recognition and prediction respectively, and their connection through the VAE bottleneck mandates the learning of consistent hand motion from the past to the future and vice versa. Furthermore, to faithfully model the semantic dependency and different temporal granularity of hand pose and action, we decompose the framework into two cascaded VAE blocks: the first and latter blocks respectively model the short-span poses and long-span action, and are connected by a mid-level feature representing a sub-second series of hand poses. This decomposition into block cascades facilitates capturing both short-term and long-term temporal regularity in pose and action modeling, and enables training two blocks separately to fully utilize datasets with annotations of different temporal granularities. We train and evaluate our framework across multiple datasets; results show that our joint modeling of recognition and prediction improves over isolated solutions, and that our semantic and temporal hierarchy facilitates long-term pose and action modeling., Comment: Accepted by ECCV HANDS Workshop 2024
Published: 2023

15. Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection

Author: Furuta, Ryosuke and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD). In contrast to the conventional domain generalization for object detection that requires labeled data from multiple domains, SS-DGOD and WS-DGOD require labeled data only from one domain and unlabeled or weakly-labeled data from multiple domains for training. In this paper, we show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly-labeled data. We provide novel interpretations of why the Mean Teacher learning framework works well on the two settings in terms of the relationships between the generalization gap and flat minima in parameter space. On the basis of the interpretations, we also show that incorporating a simple regularization method into the Mean Teacher learning framework leads to flatter minima. The experimental results demonstrate that the regularization leads to flatter minima and boosts the performance of the detectors trained with the Mean Teacher learning framework on the two settings.
Published: 2023

16. Image Cropping under Design Constraints

Author: Nishiyasu, Takumi, Shimoda, Wataru, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image cropping is essential in image editing for obtaining a compositionally enhanced image. In display media, image cropping is a prospective technique for automatically creating media content. However, image cropping for media contents is often required to satisfy various constraints, such as an aspect ratio and blank regions for placing texts or objects. We call this problem image cropping under design constraints. To achieve image cropping under design constraints, we propose a score function-based approach, which computes scores for cropped results whether aesthetically plausible and satisfies design constraints. We explore two derived approaches, a proposal-based approach, and a heatmap-based approach, and we construct a dataset for evaluating the performance of the proposed approaches on image cropping under design constraints. In experiments, we demonstrate that the proposed approaches outperform a baseline, and we observe that the proposal-based approach is better than the heatmap-based approach under the same computation cost, but the heatmap-based approach leads to better scores by increasing computation cost. The experimental results indicate that balancing aesthetically plausible regions and satisfying design constraints is not a trivial problem and requires sensitive balance, and both proposed approaches are reasonable alternatives., Comment: ACMMM Asia accepted
Published: 2023
Full Text: View/download PDF

17. Proposal-based Temporal Action Localization with Point-level Supervision

Author: Yin, Yuan, Huang, Yifei, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets., Comment: BMVC 2023
Published: 2023

18. Correction: Effect of walkability on the physical activity of hemodialysis patients: a multicenter study

Author: Sato, Yoichi, Usui, Naoto, Abe, Yoshifumi, Okamura, Daisuke, Kuramochi, Yota, Kojima, Sho, Shinozaki, Nobuto, Shimano, Yu, Shirai, Nobuyuki, Mikami, Kenta, Yamada, Yoji, and Saitoh, Masakazu
Published: 2024
Full Text: View/download PDF

19. Effect of walkability on the physical activity of hemodialysis patients: a multicenter study

Author: Sato, Yoichi, Usui, Naoto, Abe, Yoshifumi, Okamura, Daisuke, Kuramochi, Yota, Kojima, Sho, Shinozaki, Nobuto, Shimano, Yu, Shirai, Nobuyuki, Mikami, Kenta, Yamada, Yoji, and Saitoh, Masakazu
Published: 2024
Full Text: View/download PDF

20. External validation of a deep learning model for predicting bone mineral density on chest radiographs

Author: Asamoto, Takamune, Takegami, Yasuhiko, Sato, Yoichi, Takahara, Shunsuke, Yamamoto, Norio, Inagaki, Naoya, Maki, Satoshi, Saito, Mitsuru, and Imagama, Shiro
Published: 2024
Full Text: View/download PDF

21. Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge

Author: Zia, Aneeq, Bhattacharyya, Kiran, Liu, Xi, Berniker, Max, Wang, Ziheng, Nespolo, Rogerio, Kondo, Satoshi, Kasai, Satoshi, Hirasawa, Kousuke, Liu, Bo, Austin, David, Wang, Yiheng, Futrega, Michal, Puget, Jean-Francois, Li, Zhenqiang, Sato, Yoichi, Fujii, Ryo, Hachiuma, Ryo, Masuda, Mana, Saito, Hideo, Wang, An, Xu, Mengya, Islam, Mobarakol, Bai, Long, Pang, Winnie, Ren, Hongliang, Nwoye, Chinedu, Sestini, Luca, Padoy, Nicolas, Nielsen, Maximilian, Schüttler, Samuel, Sentker, Thilo, Husseini, Hümeyra, Baltruschat, Ivo, Schmitz, Rüdiger, Werner, René, Matsun, Aleksandr, Farooq, Mugariya, Saaed, Numan, Viera, Jose Renato Restom, Yaqub, Mohammad, Getty, Neil, Xia, Fangfang, Zhao, Zixuan, Duan, Xiaotian, Yao, Xing, Lou, Ange, Yang, Hao, Han, Jintong, Noble, Jack, Wu, Jie Ying, Alshirbaji, Tamer Abdulbaki, Jalal, Nour Aldeen, Arabian, Herag, Ding, Ning, Moeller, Knut, Chen, Weiliang, He, Quan, Bilal, Muhammad, Akinosho, Taofeek, Qayyum, Adnan, Caputo, Massimo, Vohra, Hunaid, Loizou, Michael, Ajayi, Anuoluwapo, Berrou, Ilhem, Niyi-Odumosu, Faatihah, Maier-Hein, Lena, Stoyanov, Danail, Speidel, Stefanie, and Jarc, Anthony
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The ability to automatically detect and track surgical instruments in endoscopic videos can enable transformational interventions. Assessing surgical performance and efficiency, identifying skilled tool use and choreography, and planning operational and logistical aspects of OR resources are just a few of the applications that could benefit. Unfortunately, obtaining the annotations needed to train machine learning models to identify and localize surgical tools is a difficult task. Annotating bounding boxes frame-by-frame is tedious and time-consuming, yet large amounts of data with a wide variety of surgical tools and surgeries must be captured for robust training. Moreover, ongoing annotator training is needed to stay up to date with surgical instrument innovation. In robotic-assisted surgery, however, potentially informative data like timestamps of instrument installation and removal can be programmatically harvested. The ability to rely on tool installation data alone would significantly reduce the workload to train robust tool-tracking models. With this motivation in mind we invited the surgical data science community to participate in the challenge, SurgToolLoc 2022. The goal was to leverage tool presence data as weak labels for machine learning models trained to detect tools and localize them in video frames with bounding boxes. We present the results of this challenge along with many of the team's efforts. We conclude by discussing these results in the broader context of machine learning and surgical data science. The training data used for this challenge consisting of 24,695 video clips with tool presence labels is also being released publicly and can be accessed at https://console.cloud.google.com/storage/browser/isi-surgtoolloc-2022.
Published: 2023

22. Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

Author: Zhang, Mingfang, Wang, Jinglu, Li, Xiao, Huang, Yifei, Sato, Yoichi, and Lu, Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces imaged at oblique angles. We introduce the Structural MPI (S-MPI), where the plane structure approximates 3D scenes concisely. Conveying RGBA contexts with geometrically-faithful structures, the S-MPI directly bridges view synthesis and 3D reconstruction. It can not only overcome the critical limitations of MPI, i.e., discretization artifacts from sloped surfaces and abuse of redundant layers, and can also acquire planar 3D reconstruction. Despite the intuition and demand of applying S-MPI, great challenges are introduced, e.g., high-fidelity approximation for both RGBA layers and plane poses, multi-view consistency, non-planar regions modeling, and efficient rendering with intersected planes. Accordingly, we propose a transformer-based network based on a segmentation model. It predicts compact and expressive S-MPI layers with their corresponding masks, poses, and RGBA contexts. Non-planar regions are inclusively handled as a special case in our unified framework. Multi-view consistency is ensured by sharing global proxy embeddings, which encode plane-level features covering the complete 3D scenes with aligned coordinates. Intensive experiments show that our method outperforms both previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods., Comment: Accepted to CVPR2023
Published: 2023

23. Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos

Author: Yu, Zecheng, Huang, Yifei, Furuta, Ryosuke, Yagi, Takuma, Goutsu, Yusuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object affordance is an important concept in hand-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning. However, the definition of affordance in existing datasets often: 1) mix up affordance with object functionality; 2) confuse affordance with goal-related action; and 3) ignore human motor capacity. This paper proposes an efficient annotation scheme to address these issues by combining goal-irrelevant motor actions and grasp types as affordance labels and introducing the concept of mechanical action to represent the action possibilities between two objects. We provide new annotations by applying this scheme to the EPIC-KITCHENS dataset and test our annotation with tasks such as affordance recognition, hand-object interaction hotspots prediction, and cross-domain evaluation of affordance. The results show that models trained with our annotation can distinguish affordance from other concepts, predict fine-grained interaction possibilities on objects, and generalize through different domains., Comment: WACV 2023. Refined version of Workshop article arXiv:2206.05424
Published: 2023

24. ClipCrop: Conditioned Cropping Driven by Vision-Language Model

Author: Zhong, Zhihang, Cheng, Mingxi, Wu, Zhirong, Yuan, Yuhui, Zheng, Yinqiang, Li, Ji, Hu, Han, Lin, Stephen, Sato, Yoichi, and Sato, Imari
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image cropping has progressed tremendously under the data-driven paradigm. However, current approaches do not account for the intentions of the user, which is an issue especially when the composition of the input image is complex. Moreover, labeling of cropping data is costly and hence the amount of data is limited, leading to poor generalization performance of current algorithms in the wild. In this work, we take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. By adapting a transformer decoder with a pre-trained CLIP-based detection model, OWL-ViT, we develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. In addition, our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small cropping dataset, while inheriting the open-vocabulary ability acquired from millions of text-image pairs. We validate our model through extensive experiments on existing datasets as well as a new cropping test set we compiled that is characterized by content ambiguity.
Published: 2022

25. Change in phase angle is associated with improvement in activities of daily living and muscle function in patients with acute stroke

Author: Sato, Yoichi, Yoshimura, Yoshihiro, Abe, Takafumi, Nagano, Fumihiko, Matsumoto, Ayaka, and Wakabayashi, Hidetaka
Published: 2023
Full Text: View/download PDF

26. Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey

Author: Ohkawa, Takehiko, Furuta, Ryosuke, and Sato, Yoichi
Published: 2023
Full Text: View/download PDF

27. Surgical Skill Assessment via Video Semantic Aggregation

Author: Li, Zhenqiang, Gu, Lin, Wang, Weimin, Nakamura, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships by LSTMs on spatially pooled short-term CNN features. However, this practice would inevitably neglect the difference among semantic concepts such as tools, tissues, and background in the spatial dimension, impeding the subsequent temporal relationship modeling. In this paper, we propose a novel skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them across spatiotemporal dimensions. The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network's decisions. It also enables us to further incorporate auxiliary information such as the kinematic data to improve representation learning and performance. The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods. Source code is available at: bit.ly/MICCAI2022ViSA., Comment: To appear in MICCAI 2022
Published: 2022

28. CompNVS: Novel View Synthesis with Scene Completion

Author: Li, Zuoyue, Fan, Tianxing, Li, Zhenqiang, Cui, Zhaopeng, Sato, Yoichi, Pollefeys, Marc, and Oswald, Martin R.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage. While generative neural approaches have demonstrated spectacular results on 2D images, they have not yet achieved similar photorealistic results in combination with scene completion where a spatial 3D scene understanding is essential. To this end, we propose a generative pipeline performing on a sparse grid-based neural scene representation to complete unobserved scene parts via a learned distribution of scenes in a 2.5D-3D-2.5D manner. We process encoded image features in 3D space with a geometry completion network and a subsequent texture inpainting network to extrapolate the missing area. Photorealistic image sequences can be finally obtained via consistency-relevant differentiable rendering. Comprehensive experiments show that the graphical outputs of our method outperform the state of the art, especially within unobserved scene parts., Comment: ECCV 2022
Published: 2022

29. Compound Prototype Matching for Few-shot Action Recognition

Author: Huang, Yifei, Yang, Lijin, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Few-shot action recognition aims to recognize novel action classes using only a small number of labeled training samples. In this work, we propose a novel approach that first summarizes each video into compound prototypes consisting of a group of global prototypes and a group of focused prototypes, and then compares video similarity based on the prototypes. Each global prototype is encouraged to summarize a specific aspect from the entire video, for example, the start/evolution of the action. Since no clear annotation is provided for the global prototypes, we use a group of focused prototypes to focus on certain timestamps in the video. We compare video similarity by matching the compound prototypes between the support and query videos. The global prototypes are directly matched to compare videos from the same perspective, for example, to compare whether two actions start similarly. For the focused prototypes, since actions have various temporal variations in the videos, we apply bipartite matching to allow the comparison of actions with different temporal positions and shifts. Experiments demonstrate that our proposed method achieves state-of-the-art results on multiple benchmarks., Comment: ECCV 2022
Published: 2022

30. Precise Affordance Annotation for Egocentric Action Video Datasets

Author: Yu, Zecheng, Huang, Yifei, Furuta, Ryosuke, Yagi, Takuma, Goutsu, Yusuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Object affordance is an important concept in human-object interaction, providing information on action possibilities based on human motor capacity and objects' physical property thus benefiting tasks such as action anticipation and robot imitation learning. However, existing datasets often: 1) mix up affordance with object functionality; 2) confuse affordance with goal-related action; and 3) ignore human motor capacity. This paper proposes an efficient annotation scheme to address these issues by combining goal-irrelevant motor actions and grasp types as affordance labels and introducing the concept of mechanical action to represent the action possibilities between two objects. We provide new annotations by applying this scheme to the EPIC-KITCHENS dataset and test our annotation with tasks such as affordance recognition. We qualitatively verify that models trained with our annotation can distinguish affordance and mechanical actions., Comment: Technical report for CVPR 2022 EPIC-Ego4D Workshop
Published: 2022

31. Object Instance Identification in Dynamic Environments

Author: Yagi, Takuma, Hasan, Md Tasnimul, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of identifying object instances in a dynamic environment where people interact with the objects. In such an environment, objects' appearance changes dynamically by interaction with other entities, occlusion by hands, background change, etc. This leads to a larger intra-instance variation of appearance than in static environments. To discover the challenges in this setting, we newly built a benchmark of more than 1,500 instances built on the EPIC-KITCHENS dataset which includes natural activities and conducted an extensive analysis of it. Experimental results suggest that (i) robustness against instance-specific appearance change (ii) integration of low-level (e.g., color, texture) and high-level (e.g., object category) features (iii) foreground feature selection on overlapping objects are required for further improvement., Comment: Joint 1st Ego4D and 10th EPIC Workshop (EPIC@CVPR2022) Extended Abstract
Published: 2022

32. Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey

Author: Ohkawa, Takehiko, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this survey, we present a systematic review of 3D hand pose estimation from the perspective of efficient annotation and learning. 3D hand pose estimation has been an important research area owing to its potential to enable various applications, such as video understanding, AR/VR, and robotics. However, the performance of models is tied to the quality and quantity of annotated 3D hand poses. Under the status quo, acquiring such annotated 3D hand poses is challenging, e.g., due to the difficulty of 3D annotation and the presence of occlusion. To reveal this problem, we review the pros and cons of existing annotation methods classified as manual, synthetic-model-based, hand-sensor-based, and computational approaches. Additionally, we examine methods for learning 3D hand poses when annotated data are scarce, including self-supervised pretraining, semi-supervised learning, and domain adaptation. Based on the study of efficient annotation and learning, we further discuss limitations and possible future directions in this field.
Published: 2022

33. Domain Adaptive Hand Keypoint and Pixel Localization in the Wild

Author: Ohkawa, Takehiko, Li, Yu-Jhe, Fu, Qichen, Furuta, Ryosuke, Kitani, Kris M., and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors., Comment: Accepted to ECCV 2022
Published: 2022

34. Background Mixup Data Augmentation for Hand and Object-in-Contact Detection

Author: Tango, Koya, Ohkawa, Takehiko, Furuta, Ryosuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Detecting the positions of human hands and objects-in-contact (hand-object detection) in each video frame is vital for understanding human activities from videos. For training an object detector, a method called Mixup, which overlays two training images to mitigate data bias, has been empirically shown to be effective for data augmentation. However, in hand-object detection, mixing two hand-manipulation images produces unintended biases, e.g., the concentration of hands and objects in a specific region degrades the ability of the hand-object detector to identify object boundaries. We propose a data-augmentation method called Background Mixup that leverages data-mixing regularization while reducing the unintended effects in hand-object detection. Instead of mixing two images where a hand and an object in contact appear, we mix a target training image with background images without hands and objects-in-contact extracted from external image sources, and use the mixed images for training the detector. Our experiments demonstrated that the proposed method can effectively reduce false positives and improve the performance of hand-object detection in both supervised and semi-supervised learning settings., Comment: 5 pages, 4 figures
Published: 2022

35. Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

Author: Yang, Lijin, Huang, Yifei, Sugano, Yusuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: First-person action recognition is a challenging task in video understanding. Because of strong ego-motion and a limited field of view, many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. To encode more discriminative features, the model needs to have the ability to focus on the most relevant part of the video for action recognition. Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video, which is critical for determining the relatively significant parts. In this work, we propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips for emphasizing the most discriminative features. We achieve this by stacking multiple self-attention layers. Instead of naive stacking, which is experimentally proven to be ineffective, we carefully design the input to each self-attention layer so that both the local and global context of the video is considered during generating the temporal attention weights. Experiments demonstrate that our proposed STAM can be built on top of most existing backbones and boost the performance in various datasets., Comment: BMVC 2021
Published: 2021

36. Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data

Author: Huang, Yifei, Li, Xiaoxiao, Yang, Lijin, Gu, Lin, Zhu, Yingying, Seo, Hirofumi, Meng, Qiuming, Harada, Tatsuya, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: The human gaze is a cost-efficient physiological data that reveals human underlying attentional patterns. The selective attention mechanism helps the cognition system focus on task-relevant visual clues by ignoring the presence of distractors. Thanks to this ability, human beings can efficiently learn from a very limited number of training samples. Inspired by this mechanism, we aim to leverage gaze for medical image analysis tasks with small training data. Our proposed framework includes a backbone encoder and a Selective Attention Network (SAN) that simulates the underlying attention. The SAN implicitly encodes information such as suspicious regions that is relevant to the medical diagnose tasks by estimating the actual human gaze. Then we design a novel Auxiliary Attention Block (AAB) to allow information from SAN to be utilized by the backbone encoder to focus on selective areas. Specifically, this block uses a modified version of a multi-head attention layer to simulate the human visual search procedure. Note that the SAN and AAB can be plugged into different backbones, and the framework can be used for multiple medical image analysis tasks when equipped with task-specific heads. Our method is demonstrated to achieve superior performance on both 3D tumor segmentation and 2D chest X-ray classification tasks. We also show that the estimated gaze probability map of the SAN is consistent with an actual gaze fixation map obtained by board-certified doctors., Comment: BMVC 2021
Published: 2021

37. Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and Guided Progressive Label Correction

Author: Yagi, Takuma, Hasan, Md Tasnimul, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Every hand-object interaction begins with contact. Despite predicting the contact state between hands and objects is useful in understanding hand-object interactions, prior methods on hand-object analysis have assumed that the interacting hands and objects are known, and were not studied in detail. In this study, we introduce a video-based method for predicting contact between a hand and an object. Specifically, given a video and a pair of hand and object tracks, we predict a binary contact state (contact or no-contact) for each frame. However, annotating a large number of hand-object tracks and contact labels is costly. To overcome the difficulty, we propose a semi-supervised framework consisting of (i) automatic collection of training data with motion-based pseudo-labels and (ii) guided progressive label correction (gPLC), which corrects noisy pseudo-labels with a small amount of trusted data. We validated our framework's effectiveness on a newly built benchmark dataset for hand-object contact prediction and showed superior performance against existing baseline methods. Code and data are available at https://github.com/takumayagi/hand_object_contact_prediction., Comment: BMVC 2021
Published: 2021

38. Ego4D: Around the World in 3,000 Hours of Egocentric Video

Author: Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, Martin, Miguel, Nagarajan, Tushar, Radosavovic, Ilija, Ramakrishnan, Santhosh Kumar, Ryan, Fiona, Sharma, Jayant, Wray, Michael, Xu, Mengmeng, Xu, Eric Zhongcong, Zhao, Chen, Bansal, Siddhant, Batra, Dhruv, Cartillier, Vincent, Crane, Sean, Do, Tien, Doulaty, Morrie, Erapalli, Akshay, Feichtenhofer, Christoph, Fragomeni, Adriano, Fu, Qichen, Gebreselasie, Abrham, Gonzalez, Cristina, Hillis, James, Huang, Xuhua, Huang, Yifei, Jia, Wenqi, Khoo, Weslie, Kolar, Jachym, Kottur, Satwik, Kumar, Anurag, Landini, Federico, Li, Chao, Li, Yanghao, Li, Zhenqiang, Mangalam, Karttikeya, Modhugu, Raghava, Munro, Jonathan, Murrell, Tullie, Nishiyasu, Takumi, Price, Will, Puentes, Paola Ruiz, Ramazanova, Merey, Sari, Leda, Somasundaram, Kiran, Southerland, Audrey, Sugano, Yusuke, Tao, Ruijie, Vo, Minh, Wang, Yuchen, Wu, Xindi, Yagi, Takuma, Zhao, Ziwei, Zhu, Yunyi, Arbelaez, Pablo, Crandall, David, Damen, Dima, Farinella, Giovanni Maria, Fuegen, Christian, Ghanem, Bernard, Ithapu, Vamsi Krishna, Jawahar, C. V., Joo, Hanbyul, Kitani, Kris, Li, Haizhou, Newcombe, Richard, Oliva, Aude, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Shi, Jianbo, Shou, Mike Zheng, Torralba, Antonio, Torresani, Lorenzo, Yan, Mingfei, and Malik, Jitendra
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/, Comment: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)
Published: 2021

39. Spatio-Temporal Perturbations for Video Attribution

Author: Li, Zhenqiang, Wang, Weimin, Li, Zuoyue, Huang, Yifei, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The attribution method provides a direction for interpreting opaque neural networks in a visual way by identifying and visualizing the input regions/pixels that dominate the output of a network. Regarding the attribution method for visually explaining video understanding networks, it is challenging because of the unique spatiotemporal dependencies existing in video inputs and the special 3D convolutional or recurrent structures of video understanding networks. However, most existing attribution methods focus on explaining networks taking a single image as input and a few works specifically devised for video attribution come short of dealing with diversified structures of video understanding networks. In this paper, we investigate a generic perturbation-based attribution method that is compatible with diversified video understanding networks. Besides, we propose a novel regularization term to enhance the method by constraining the smoothness of its attribution results in both spatial and temporal dimensions. In order to assess the effectiveness of different video attribution methods without relying on manual judgement, we introduce reliable objective metrics which are checked by a newly proposed reliability measurement. We verified the effectiveness of our method by both subjective and objective evaluation and comparison with multiple significant attribution methods.
Published: 2021
Full Text: View/download PDF

40. Demonstration test of gas contaminants adsorbent for space Stirling cooler

Author: Sato, Yoichi, Tanaka, Kosuke, and Shinozaki, Keisuke
Published: 2024
Full Text: View/download PDF

41. Foreground-Aware Stylization and Consensus Pseudo-Labeling for Domain Adaptation of First-Person Hand Segmentation

Author: Ohkawa, Takehiko, Yagi, Takuma, Hashimoto, Atsushi, Ushiku, Yoshitaka, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-aware image stylization and (ii) consensus pseudo-labeling for domain adaptation of hand segmentation. We stylize source images independently for the foreground and background using target images as style. To resolve the domain shift that the stylization has not addressed, we apply careful pseudo-labeling by taking a consensus between the models trained on the source and stylized source images. We validated our method on domain adaptation of hand segmentation from real and simulation images. Our method achieved state-of-the-art performance in both settings. We also demonstrated promising results in challenging multi-target domain adaptation and domain generalization settings. Code is available at https://github.com/ut-vision/FgSty-CPL., Comment: Accepted to IEEE Access 2021
Published: 2021
Full Text: View/download PDF

42. EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report

Author: Yang, Lijin, Huang, Yifei, Sugano, Yusuke, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. Leveraging multiple modalities has been proved to benefit the Unsupervised Domain Adaptation (UDA) task. In this work, we present Multi-Modal Mutual Enhancement Module (M3EM), a deep module for jointly considering information from multiple modalities to find the most transferable representations across domains. We achieve this by implementing two sub-modules for enhancing each modality using the context of other modalities. The first sub-module exchanges information across modalities through the semantic space, while the second sub-module finds the most transferable spatial region based on the consensus of all modalities.
Published: 2021

43. GO-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects via Hand-Held Object Discovery

Author: Yagi, Takuma, Nishiyasu, Takumi, Kawasaki, Kunimasa, Matsuki, Moe, and Sato, Yoichi
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Computer Vision and Pattern Recognition
Abstract: People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. This requirement imposes a cumbersome burden on the users, and the system cannot help remind them of unexpectedly lost objects. We propose GO-Finder ("Generic Object Finder"), a registration-free wearable camera based system for assisting people in finding an arbitrary number of objects based on two key features: automatic discovery of hand-held objects and image-based candidate selection. Given a video taken from a wearable camera, Go-Finder automatically detects and groups hand-held objects to form a visual timeline of the objects. Users can retrieve the last appearance of the object by browsing the timeline through a smartphone app. We conducted a user study to investigate how users benefit from using GO-Finder and confirmed improved accuracy and reduced mental load regarding the object search task by providing clear visual cues on object locations., Comment: 13 pages, 13 figures, ACM IUI 2021
Published: 2021
Full Text: View/download PDF

44. Sustainable Development Goals from the Perspective of Photographic Archives: A Case Study on Photographs from Occupied Japan

Author: Sato, Yoichi, Urata, Shujiro, editor, Akao, Ken-Ichi, editor, and Washizu, Ayu, editor
Published: 2023
Full Text: View/download PDF

45. Relationship between nutritional status and clinical outcomes among older individuals using long-term care services: A systematic review and meta-analysis

Author: Ogawa, Masato, Okamura, Masatsugu, Inoue, Tatsuro, Sato, Yoichi, Momosaki, Ryo, and Maeda, Keisuke
Published: 2024
Full Text: View/download PDF

46. Towards Visually Explaining Video Understanding Networks with Perturbation

Author: Li, Zhenqiang, Wang, Weimin, Li, Zuoyue, Huang, Yifei, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: ''Making black box models explainable'' is a vital problem that accompanies the development of deep learning networks. For networks taking visual information as input, one basic but challenging explanation method is to identify and visualize the input pixels/regions that dominate the network's prediction. However, most existing works focus on explaining networks taking a single image as input and do not consider the temporal relationship that exists in videos. Providing an easy-to-use visual explanation method that is applicable to diversified structures of video understanding networks still remains an open challenge. In this paper, we investigate a generic perturbation-based method for visually explaining video understanding networks. Besides, we propose a novel loss function to enhance the method by constraining the smoothness of its results in both spatial and temporal dimensions. The method enables the comparison of explanation results between different network structures to become possible and can also avoid generating the pathological adversarial explanations for video inputs. Experimental comparison results verified the effectiveness of our method., Comment: Accepted by WACV2021
Published: 2020

47. A Computer-Aided Diagnosis System Using Artificial Intelligence for Hip Fractures -Multi-Institutional Joint Development Research-

Author: Sato, Yoichi, Takegami, Yasuhiko, Asamoto, Takamune, Ono, Yutaro, Hidetoshi, Tsugeno, Goto, Ryosuke, Kitamura, Akira, and Honda, Seiwa
Subjects: Physics - Medical Physics, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Quantitative Biology - Tissues and Organs, 68-T01
Abstract: [Objective] To develop a Computer-aided diagnosis (CAD) system for plane frontal hip X-rays with a deep learning model trained on a large dataset collected at multiple centers. [Materials and Methods]. We included 5295 cases with neck fracture or trochanteric fracture who were diagnosed and treated by orthopedic surgeons using plane X-rays or computed tomography (CT) or magnetic resonance imaging (MRI) who visited each institution between April 2009 and March 2019 were enrolled. Cases in which both hips were not included in the photographing range, femoral shaft fractures, and periprosthetic fractures were excluded, and 5242 plane frontal pelvic X-rays obtained from 4,851 cases were used for machine learning. These images were divided into 5242 images including the fracture side and 5242 images without the fracture side, and a total of 10484 images were used for machine learning. A deep convolutional neural network approach was used for machine learning. Pytorch 1.3 and Fast.ai 1.0 were used as frameworks, and EfficientNet-B4, which is pre-trained ImageNet model, was used. In the final evaluation, accuracy, sensitivity, specificity, F-value and area under the curve (AUC) were evaluated. Gradient-weighted class activation mapping (Grad-CAM) was used to conceptualize the diagnostic basis of the CAD system. [Results] The diagnostic accuracy of the learning model was accuracy of 96. 1 %, sensitivity of 95.2 %, specificity of 96.9 %, F-value of 0.961, and AUC of 0.99. The cases who were correct for the diagnosis showed generally correct diagnostic basis using Grad-CAM. [Conclusions] The CAD system using deep learning model which we developed was able to diagnose hip fracture in the plane X-ray with the high accuracy, and it was possible to present the decision reason., Comment: 9 pages, 4 tables, 7 figures. / author's homepage : https://www.fracture-ai.org
Published: 2020

48. Hospital-associated sarcopenia and the preventive effect of high energy intake along with intensive rehabilitation in patients with acute stroke

Author: Sato, Yoichi, Yoshimura, Yoshihiro, Abe, Takafumi, Nagano, Fumihiko, and Matsumoto, Ayaka
Published: 2023
Full Text: View/download PDF

49. Nutrient uptake characteristics of Cladosiphon okamuranus (Phaeophyceae) from the Ryukyu Islands of Japan

Author: Sato, Yoichi, Inomata, Eri, Nagoe, Hikari, Ito, Michihiro, Konishi, Teruko, Fujimura, Hiroyuki, Tanaka, Atsuko, and Nishihara, Gregory N.
Published: 2023
Full Text: View/download PDF

50. Manipulation-skill Assessment from Videos with Spatial Attention Network

Author: Li, Zhenqiang, Huang, Yifei, Cai, Minjie, and Sato, Yoichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in computer vision have made it possible to automatically assess from videos the manipulation skills of humans in performing a task, which breeds many important applications in domains such as health rehabilitation and manufacturing. Previous methods of video-based skill assessment did not consider the attention mechanism humans use in assessing videos, limiting their performance as only a small part of video regions is informative for skill assessment. Our motivation here is to estimate attention in videos that helps to focus on critically important video regions for better skill assessment. In particular, we propose a novel RNN-based spatial attention model that considers accumulated attention state from previous frames as well as high-level knowledge about the progress of an undergoing task. We evaluate our approach on a newly collected dataset of infant grasping task and four existing datasets of hand manipulation tasks. Experiment results demonstrate that state-of-the-art performance can be achieved by considering attention in automatic skill assessment.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,200 results on '"Sato, Yoichi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources