Author: "Otani, Mayu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Otani, Mayu"' showing total 156 results

Start Over Author "Otani, Mayu"

156 results on '"Otani, Mayu"'

1. Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer

Author: Masui, Kento, Otani, Mayu, Nomura, Masahiro, and Nakayama, Hideki
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Diffusion models have recently shown the ability to generate high-quality images. However, controlling its generation process still poses challenges. The image style transfer task is one of those challenges that transfers the visual attributes of a style image to another content image. Typical obstacle of this task is the requirement of additional training of a pre-trained model. We propose a training-free style transfer algorithm, Style Tracking Reverse Diffusion Process (STRDP) for a pretrained Latent Diffusion Model (LDM). Our algorithm employs Adaptive Instance Normalization (AdaIN) function in a distinct manner during the reverse diffusion process of an LDM while tracking the encoding history of the style image. This algorithm enables style transfer in the latent space of LDM for reduced computational cost, and provides compatibility for various LDM models. Through a series of experiments and a user study, we show that our method can quickly transfer the style of an image without additional training. The speed, compatibility, and training-free aspect of our algorithm facilitates agile experiments with combinations of styles and LDMs for extensive application.
Published: 2024

2. Multimodal Markup Document Models for Graphic Design Completion

Author: Kikuchi, Kotaro, Inoue, Naoto, Otani, Mayu, Simo-Serra, Edgar, and Yamaguchi, Kota
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: This paper presents multimodal markup document models (MarkupDM) that can generate both markup language and images within interleaved multimodal documents. Unlike existing vision-and-language multimodal models, our MarkupDM tackles unique challenges critical to graphic design tasks: generating partial images that contribute to the overall appearance, often involving transparency and varying sizes, and understanding the syntax and semantics of markup languages, which play a fundamental role as a representational format of graphic designs. To address these challenges, we design an image quantizer to tokenize images of diverse sizes with transparency and modify a code language model to process markup languages and incorporate image modalities. We provide in-depth evaluations of our approach on three graphic design completion tasks: generating missing attribute values, images, and texts in graphic design templates. Results corroborate the effectiveness of our MarkupDM for graphic design tasks. We also discuss the strengths and weaknesses in detail, providing insights for future research on multimodal document generation., Comment: Project page: https://cyberagentailab.github.io/MarkupDM/
Published: 2024

3. LTSim: Layout Transportation-based Similarity Measure for Evaluating Layout Generation

Author: Otani, Mayu, Inoue, Naoto, Kikuchi, Kotaro, and Togashi, Riku
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a layout similarity measure designed to evaluate the results of layout generation. While several similarity measures have been proposed in prior research, there has been a lack of comprehensive discussion about their behaviors. Our research uncovers that the majority of these measures are unable to handle various layout differences, primarily due to their dependencies on strict element matching, that is one-by-one matching of elements within the same category. To overcome this limitation, we propose a new similarity measure based on optimal transport, which facilitates a more flexible matching of elements. This approach allows us to quantify the similarity between any two layouts even those sharing no element categories, making our measure highly applicable to a wide range of layout generation tasks. For tasks such as unconditional layout generation, where FID is commonly used, we also extend our measure to deal with collection-level similarities between groups of layouts. The empirical result suggests that our collection-level measure offers more reliable comparisons than existing ones like FID and Max.IoU., Comment: 26 pages
Published: 2024

4. Would Deep Generative Models Amplify Bias in Future Models?

Author: Chen, Tianwei, Hirota, Yusuke, Otani, Mayu, Garcia, Noa, and Nakashima, Yuta
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images, concerns arise regarding inherent biases that may accompany them, potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop, resulting in bias amplification, would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models, which we evaluate in terms of quality and bias. Contrary to expectations, our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead, instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena, such as artifacts in image generation (e.g., blurry faces) or pre-existing biases in the original datasets., Comment: This paper has been accepted to CVPR 2024
Published: 2024

5. LayoutFlow: Flow Matching for Layout Generation

Author: Guerreiro, Julian Jorge Andrade, Inoue, Naoto, Masui, Kento, Otani, Mayu, and Nakayama, Hideki
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Finding a suitable layout represents a crucial task for diverse applications in graphic design. Motivated by simpler and smoother sampling trajectories, we explore the use of Flow Matching as an alternative to current diffusion-based layout generation models. Specifically, we propose LayoutFlow, an efficient flow-based model capable of generating high-quality layouts. Instead of progressively denoising the elements of a noisy layout, our method learns to gradually move, or flow, the elements of an initial sample until it reaches its final prediction. In addition, we employ a conditioning scheme that allows us to handle various generation tasks with varying degrees of conditioning with a single model. Empirically, LayoutFlow performs on par with state-of-the-art models while being significantly faster., Comment: Accepted to ECCV 2024, Project Page: https://julianguerreiro.github.io/layoutflow/
Published: 2024

6. LayoutFlow: Flow Matching for Layout Generation

Author: Guerreiro, Julian Jorge Andrade, Inoue, Naoto, Masui, Kento, Otani, Mayu, Nakayama, Hideki, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

7. Multimodal Color Recommendation in Vector Graphic Documents

Author: Qiu, Qianru, Wang, Xueting, and Otani, Mayu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Color selection plays a critical role in graphic document design and requires sufficient consideration of various contexts. However, recommending appropriate colors which harmonize with the other colors and textual contexts in documents is a challenging task, even for experienced designers. In this study, we propose a multimodal masked color model that integrates both color and textual contexts to provide text-aware color recommendation for graphic documents. Our proposed model comprises self-attention networks to capture the relationships between colors in multiple palettes, and cross-attention networks that incorporate both color and CLIP-based text representations. Our proposed method primarily focuses on color palette completion, which recommends colors based on the given colors and text. Additionally, it is applicable for another color recommendation task, full palette generation, which generates a complete color palette corresponding to the given text. Experimental results demonstrate that our proposed approach surpasses previous color palette completion methods on accuracy, color distribution, and user experience, as well as full palette generation methods concerning color diversity and similarity to the ground truth palettes., Comment: Accepted to ACM MM 2023
Published: 2023

8. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Author: Otani, Mayu, Togashi, Riku, Sawai, Yu, Ishigami, Ryosuke, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, and Satoh, Shin'ichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations., Comment: CVPR 2023
Published: 2023

9. Towards Flexible Multi-modal Document Models

Author: Inoue, Naoto, Kikuchi, Kotaro, Simo-Serra, Edgar, Otani, Mayu, and Yamaguchi, Kota
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines., Comment: To be published in CVPR2023 (highlight), project page: https://cyberagentailab.github.io/flex-dm
Published: 2023

10. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Author: Inoue, Naoto, Kikuchi, Kotaro, Simo-Serra, Edgar, Otani, Mayu, and Yamaguchi, Kota
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Controllable layout generation aims at synthesizing plausible arrangement of element bounding boxes with optional constraints, such as type or position of a specific element. In this work, we try to solve a broad range of layout generation tasks in a single model that is based on discrete state-space diffusion models. Our model, named LayoutDM, naturally handles the structured layout data in the discrete representation and learns to progressively infer a noiseless layout from the initial input, where we model the layout corruption process by modality-wise discrete diffusion. For conditional generation, we propose to inject layout constraints in the form of masking or logit adjustment during inference. We show in the experiments that our LayoutDM successfully generates high-quality layouts and outperforms both task-specific and task-agnostic baselines on several layout tasks., Comment: To be published in CVPR2023, project page: https://cyberagentailab.github.io/layout-dm/
Published: 2023

11. Generative Colorization of Structured Mobile Web Pages

Author: Kikuchi, Kotaro, Inoue, Naoto, Otani, Mayu, Simo-Serra, Edgar, and Yamaguchi, Kota
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Color is a critical design factor for web pages, affecting important factors such as viewer emotions and the overall trust and satisfaction of a website. Effective coloring requires design knowledge and expertise, but if this process could be automated through data-driven modeling, efficient exploration and alternative workflows would be possible. However, this direction remains underexplored due to the lack of a formalization of the web page colorization problem, datasets, and evaluation protocols. In this work, we propose a new dataset consisting of e-commerce mobile web pages in a tractable format, which are created by simplifying the pages and extracting canonical color styles with a common web browser. The web page colorization problem is then formalized as a task of estimating plausible color styles for a given web page content with a given hierarchical structure of the elements. We present several Transformer-based methods that are adapted to this task by prepending structural message passing to capture hierarchical relationships between elements. Experimental results, including a quantitative evaluation designed for this task, demonstrate the advantages of our methods over statistical and image colorization methods. The code is available at https://github.com/CyberAgentAILab/webcolor., Comment: Accepted to WACV 2023
Published: 2022

12. Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Author: Pang, Zongshang, Nakashima, Yuta, Otani, Mayu, and Nagahara, Hajime
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on the image classification task, the metrics can already yield high-quality importance scores, demonstrating competitive or better performance than past heavily-trained methods. We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved, and the model can also leverage a large number of random videos and generalize to test videos with decent performance. Code available at https://github.com/pangzss/pytorch-CTVSUM., Comment: To appear in WACV2023
Published: 2022

13. Video Summarization Overview

Author: Otani, Mayu, Song, Yale, and Wang, Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: With the broad growth of video capturing devices and applications on the web, it is more demanding to provide desired video content for users efficiently. Video summarization facilitates quickly grasping video content by creating a compact summary of videos. Much effort has been devoted to automatic video summarization, and various problem settings and approaches have been proposed. Our goal is to provide an overview of this field. This survey covers early studies as well as recent approaches which take advantage of deep learning techniques. We describe video summarization approaches and their underlying concepts. We also discuss benchmarks and evaluations. We overview how prior work addressed evaluation and detail the pros and cons of the evaluation protocols. Last but not least, we discuss open challenges in this field., Comment: 53 pages
Published: 2022
Full Text: View/download PDF

14. Color Recommendation for Vector Graphic Documents based on Multi-Palette Representation

Author: Qiu, Qianru, Wang, Xueting, Otani, Mayu, and Iwazaki, Yuki
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vector graphic documents present multiple visual elements, such as images, shapes, and texts. Choosing appropriate colors for multiple visual elements is a difficult but crucial task for both amateurs and professional designers. Instead of creating a single color palette for all elements, we extract multiple color palettes from each visual element in a graphic document, and then combine them into a color sequence. We propose a masked color model for color sequence completion and recommend the specified colors based on color context in multi-palette with high probability. We train the model and build a color recommendation system on a large-scale dataset of vector graphic documents. The proposed color recommendation method outperformed other state-of-the-art methods by both quantitative and qualitative evaluations on color prediction and our color recommendation system received positive feedback from professional designers in an interview study., Comment: Accepted to WACV 2023
Published: 2022

15. Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Author: Chen, Tianwei, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Nagahara, Hajime
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conduct an exhaustive analysis based on hundreds of cross-experiments on 12 vision-and-language tasks categorized in 4 groups. Whereas tasks in the same group are prone to improve each other, results show that this is not always the case. Other factors such as dataset size or pre-training stage have also a great impact on how well the knowledge is transferred.
Published: 2022

16. Does Robustness on ImageNet Transfer to Downstream Tasks?

Author: Yamada, Yutaro and Otani, Mayu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: As clean ImageNet accuracy nears its ceiling, the research community is increasingly more concerned about robust accuracy under distributional shifts. While a variety of methods have been proposed to robustify neural networks, these techniques often target models trained on ImageNet classification. At the same time, it is a common practice to use ImageNet pretrained backbones for downstream tasks such as object detection, semantic segmentation, and image classification from different domains. This raises a question: Can these robust image classifiers transfer robustness to downstream tasks? For object detection and semantic segmentation, we find that a vanilla Swin Transformer, a variant of Vision Transformer tailored for dense prediction tasks, transfers robustness better than Convolutional Neural Networks that are trained to be robust to the corrupted version of ImageNet. For CIFAR10 classification, we find that models that are robustified for ImageNet do not retain robustness when fully fine-tuned. These findings suggest that current robustification techniques tend to emphasize ImageNet evaluations. Moreover, network architecture is a strong source of robustness when we consider transfer learning., Comment: CVPR 2022
Published: 2022

17. AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval

Author: Togashi, Riku, Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, Heikkila, Janne, and Sakai, Tetsuya
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval
Abstract: Evaluation measures have a crucial impact on the direction of research. Therefore, it is of utmost importance to develop appropriate and reliable evaluation measures for new applications where conventional measures are not well suited. Video Moment Retrieval (VMR) is one such application, and the current practice is to use R@$K,\theta$ for evaluating VMR systems. However, this measure has two disadvantages. First, it is rank-insensitive: It ignores the rank positions of successfully localised moments in the top-$K$ ranked list by treating the list as a set. Second, it binarizes the Intersection over Union (IoU) of each retrieved video moment using the threshold $\theta$ and thereby ignoring fine-grained localisation quality of ranked moments. We propose an alternative measure for evaluating VMR, called Average Max IoU (AxIoU), which is free from the above two problems. We show that AxIoU satisfies two important axioms for VMR evaluation, namely, \textbf{Invariance against Redundant Moments} and \textbf{Monotonicity with respect to the Best Moment}, and also that R@$K,\theta$ satisfies the first axiom only. We also empirically examine how AxIoU agrees with R@$K,\theta$, as well as its stability with respect to change in the test data and human-annotated temporal boundaries., Comment: Accepted by CVPR2022
Published: 2022

18. Optimal Correction Cost for Object Detection Evaluation

Author: Otani, Mayu, Togashi, Riku, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, and Satoh, Shin'ichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Mean Average Precision (mAP) is the primary evaluation measure for object detection. Although object detection has a broad range of applications, mAP evaluates detectors in terms of the performance of ranked instance retrieval. Such the assumption for the evaluation task does not suit some downstream tasks. To alleviate the gap between downstream tasks and the evaluation scenario, we propose Optimal Correction Cost (OC-cost), which assesses detection accuracy at image level. OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy. The cost is obtained by solving an optimal transportation problem between the detections and the ground truths. Unlike mAP, OC-cost is designed to penalize false positive and false negative detections properly, and every image in a dataset is treated equally. Our experimental result validates that OC-cost has better agreement with human preference than a ranking-based measure, i.e., mAP for a single image. We also show that detectors' rankings by OC-cost are more consistent on different data splits than mAP. Our goal is not to replace mAP with OC-cost but provide an additional tool to evaluate detectors from another aspect. To help future researchers and developers choose a target measure, we provide a series of experiments to clarify how mAP and OC-cost differ., Comment: CVPR 2022
Published: 2022

19. Transferring Domain-Agnostic Knowledge in Video Question Answering

Author: Wu, Tianran, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Takemura, Haruo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge. First, we develop a novel transfer learning framework, which finetunes the pre-trained model by applying domain-agnostic knowledge as the medium. Second, we construct a new VideoQA dataset with 21,412 human-generated question-answer samples for comparable transfer of knowledge. Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.
Published: 2021

20. Constrained Graphic Layout Generation via Latent Optimization

Author: Kikuchi, Kotaro, Simo-Serra, Edgar, Otani, Mayu, and Yamaguchi, Kota
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: It is common in graphic design humans visually arrange various elements according to their design intent and semantics. For example, a title text almost always appears on top of other elements in a document. In this work, we generate graphic layouts that can flexibly incorporate such design semantics, either specified implicitly or explicitly by a user. We optimize using the latent space of an off-the-shelf layout generation model, allowing our approach to be complementary to and used with existing layout generation models. Our approach builds on a generative layout model based on a Transformer architecture, and formulates the layout generation as a constrained optimization problem where design constraints are used for element alignment, overlap avoidance, or any other user-specified relationship. We show in the experiments that our approach is capable of generating realistic layouts in both constrained and unconstrained generation tasks with a single model. The code is available at https://github.com/ktrk115/const_layout ., Comment: Accepted by ACM Multimedia 2021
Published: 2021
Full Text: View/download PDF

21. A Picture May Be Worth a Hundred Words for Visual Question Answering

Author: Hirota, Yusuke, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, Taniguchi, Ittetsu, and Onoye, Takao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.
Published: 2021

22. Scalable Personalised Item Ranking through Parametric Density Estimation

Author: Togashi, Riku, Kato, Masahiro, Otani, Mayu, Sakai, Tetsuya, and Satoh, Shin'ichi
Subjects: Computer Science - Machine Learning, Computer Science - Information Retrieval
Abstract: Learning from implicit feedback is challenging because of the difficult nature of the one-class problem: we can observe only positive examples. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. However, such methods have two main drawbacks particularly in large-scale applications; (1) the pairwise approach is severely inefficient due to the quadratic computational cost; and (2) even recent model-based samplers (e.g. IRGAN) cannot achieve practical efficiency due to the training of an extra model. In this paper, we propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart while performing similarly to the pairwise counterpart in terms of ranking effectiveness. Our approach estimates the probability densities of positive items for each user within a rich class of distributions, viz. \emph{exponential family}. In our formulation, we derive a loss function and the appropriate negative sampling distribution based on maximum likelihood estimation. We also develop a practical technique for risk approximation and a regularisation scheme. We then discuss that our single-model approach is equivalent to an IRGAN variant under a certain condition. Through experiments on real-world datasets, our approach outperforms the pointwise and pairwise counterparts in terms of effectiveness and efficiency., Comment: Accepted by SIGIR'21
Published: 2021
Full Text: View/download PDF

23. Density-Ratio Based Personalised Ranking from Implicit Feedback

Author: Togashi, Riku, Kato, Masahiro, Otani, Mayu, and Satoh, Shin'ichi
Subjects: Computer Science - Information Retrieval
Abstract: Learning from implicit user feedback is challenging as we can only observe positive samples but never access negative ones. Most conventional methods cope with this issue by adopting a pairwise ranking approach with negative sampling. However, the pairwise ranking approach has a severe disadvantage in the convergence time owing to the quadratically increasing computational cost with respect to the sample size; it is problematic, particularly for large-scale datasets and complex models such as neural networks. By contrast, a pointwise approach does not directly solve a ranking problem, and is therefore inferior to a pairwise counterpart in top-K ranking tasks; however, it is generally advantageous in regards to the convergence time. This study aims to establish an approach to learn personalised ranking from implicit feedback, which reconciles the training efficiency of the pointwise approach and ranking effectiveness of the pairwise counterpart. The key idea is to estimate the ranking of items in a pointwise manner; we first reformulate the conventional pointwise approach based on density ratio estimation and then incorporate the essence of ranking-oriented approaches (e.g. the pairwise approach) into our formulation. Through experiments on three real-world datasets, we demonstrate that our approach not only dramatically reduces the convergence time (one to two orders of magnitude faster) but also significantly improving the ranking performance., Comment: Accepted by WWW 2021
Published: 2021

24. Alleviating Cold-Start Problems in Recommendation through Pseudo-Labelling over Knowledge Graph

Author: Togashi, Riku, Otani, Mayu, and Satoh, Shin'ichi
Subjects: Computer Science - Information Retrieval
Abstract: Solving cold-start problems is indispensable to provide meaningful recommendation results for new users and items. Under sparsely observed data, unobserved user-item pairs are also a vital source for distilling latent users' information needs. Most present works leverage unobserved samples for extracting negative signals. However, such an optimisation strategy can lead to biased results toward already popular items by frequently handling new items as negative instances. In this study, we tackle the cold-start problems for new users/items by appropriately leveraging unobserved samples. We propose a knowledge graph (KG)-aware recommender based on graph neural networks, which augments labelled samples through pseudo-labelling. Our approach aggressively employs unobserved samples as positive instances and brings new items into the spotlight. To avoid exhaustive label assignments to all possible pairs of users and items, we exploit a KG for selecting probably positive items for each user. We also utilise an improved negative sampling strategy and thereby suppress the exacerbation of popularity biases. Through experiments, we demonstrate that our approach achieves improvements over the state-of-the-art KG-aware recommenders in a variety of scenarios; in particular, our methodology successfully improves recommendation performance for cold-start users/items., Comment: WSDM 2021
Published: 2020

25. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, and Heikkilä, Janne
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR ., Comment: British Machine Vision Conference (BMVC), 2020. (v2) added references
Published: 2020

26. A Dataset and Baselines for Visual Question Answering on Art

Author: Garcia, Noa, Ye, Chentao, Liu, Zihua, Hu, Qingtao, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Mitamura, Teruko
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers' correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.
Published: 2020

27. Knowledge-Based Visual Question Answering in Videos

Author: Garcia, Noa, Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations., Comment: arXiv admin note: substantial text overlap with arXiv:1910.10706
Published: 2020

28. KnowIT VQA: Answering Knowledge-Based Questions about Videos

Author: Garcia, Noa, Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.
Published: 2019

29. Rethinking the Evaluation of Video Summaries

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, and Heikkilä, Janne
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments. Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations., Comment: CVPR'19 poster
Published: 2019

30. iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

Author: Chu, Chenhui, Otani, Mayu, and Nakashima, Yuta
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning, Computer Science - Multimedia
Abstract: A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential to improve language and image multimodal tasks such as visual question answering and image captioning. How to model the similarity between VGPs is the key of iParaphrasing. We apply various existing methods as well as propose a novel neural network-based method with image attention, and report the results of the first attempt toward iParaphrasing., Comment: COLING 2018
Published: 2018

31. A Picture May Be Worth a Hundred Words for Visual Question Answering †.

Author: Hirota, Yusuke, Garcia, Noa, Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
Subjects: RECOGNITION (Psychology), LANGUAGE models, DATA augmentation, DECISION making, STATISTICAL bias
Abstract: How far can textual representations go in understanding images? In image understanding, effective representations are essential. Deep visual features from object recognition models currently dominate various tasks, especially Visual Question Answering (VQA). However, these conventional features often struggle to capture image details in ways that match human understanding, and their decision processes lack interpretability. Meanwhile, the recent progress in language models suggests that descriptive text could offer a viable alternative. This paper investigated the use of descriptive text as an alternative to deep visual features in VQA. We propose to process description–question pairs rather than visual features, utilizing a language-only Transformer model. We also explored data augmentation strategies to enhance training set diversity and mitigate statistical bias. Extensive evaluation shows that textual representations using approximately a hundred words can effectively compete with deep visual features on both the VQA 2.0 and VQA-CP v2 datasets. Our qualitative experiments further reveal that these textual representations enable clearer investigation of VQA model decision processes, thereby improving interpretability. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization †.

Author: Pang, Zongshang, Nakashima, Yuta, Otani, Mayu, and Nagahara, Hajime
Subjects: VIDEO summarization, AUTOMATIC summarization, TRAINING needs, HEURISTIC, VIDEOS
Abstract: Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

33. The semantic typology of visually grounded paraphrases

Author: Chu, Chenhui, Oliveira, Vinicius, Virgo, Felix Giovanni, Otani, Mayu, Garcia, Noa, and Nakashima, Yuta
Published: 2022
Full Text: View/download PDF

34. Video Summarization using Deep Semantic Features

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, and Yokoya, Naokazu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique., Comment: 16 pages, the 13th Asian Conference on Computer Vision (ACCV'16)
Published: 2016

35. Learning Joint Representations of Videos and Sentences with Web Image Search

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, and Yokoya, Naokazu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks., Comment: 16 pages, 4th Workshop on Web-scale Vision and Social Media (VSM), ECCV 2016
Published: 2016

36. A comparative study of language transformers for video question answering

Author: Yang, Zekun, Garcia, Noa, Chu, Chenhui, Otani, Mayu, Nakashima, Yuta, and Takemura, Haruo
Published: 2021
Full Text: View/download PDF

37. Visually grounded paraphrase identification via gating and phrase localization

Author: Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
Published: 2020
Full Text: View/download PDF

38. Revisiting Pixel-Level Contrastive Pre-Training on Scene Images

Author: Pang, Zongshang, primary, Nakashima, Yuta, additional, Otani, Mayu, additional, and Nagahara, Hajime, additional
Published: 2024
Full Text: View/download PDF

39. Multimodal Color Recommendation in Vector Graphic Documents

Author: Qiu, Qianru, primary, Wang, Xueting, additional, and Otani, Mayu, additional
Published: 2023
Full Text: View/download PDF

40. A Dataset and Baselines for Visual Question Answering on Art

Author: Garcia, Noa, primary, Ye, Chentao, additional, Liu, Zihua, additional, Hu, Qingtao, additional, Otani, Mayu, additional, Chu, Chenhui, additional, Nakashima, Yuta, additional, and Mitamura, Teruko, additional
Published: 2020
Full Text: View/download PDF

41. Video Summarization Using Deep Semantic Features

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, Yokoya, Naokazu, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Lai, Shang-Hong, editor, Lepetit, Vincent, editor, Nishino, Ko, editor, and Sato, Yoichi, editor
Published: 2017
Full Text: View/download PDF

42. Towards Flexible Multi-modal Document Models

Author: Inoue, Naoto, primary, Kikuchi, Kotaro, additional, Simo-Serra, Edgar, additional, Otani, Mayu, additional, and Yamaguchi, Kota, additional
Published: 2023
Full Text: View/download PDF

43. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Author: Inoue, Naoto, primary, Kikuchi, Kotaro, additional, Simo-Serra, Edgar, additional, Otani, Mayu, additional, and Yamaguchi, Kota, additional
Published: 2023
Full Text: View/download PDF

44. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Author: Otani, Mayu, primary, Togashi, Riku, additional, Sawai, Yu, additional, Ishigami, Ryosuke, additional, Nakashima, Yuta, additional, Rahtu, Esa, additional, Heikkilä, Janne, additional, and Satoh, Shin'ichi, additional
Published: 2023
Full Text: View/download PDF

45. Coarse-to-fine font recommendation for banner designs

Author: Haraguchi, Daichi, primary and Otani, Mayu, additional
Published: 2023
Full Text: View/download PDF

46. Video summarization using textual descriptions for authoring video blogs

Author: Otani, Mayu, Nakashima, Yuta, Sato, Tomokazu, Yokoya, Naokazu, Otani, Mayu, Nakashima, Yuta, Sato, Tomokazu, and Yokoya, Naokazu
Abstract: Authoring video blogs requires a video editing process, which is cumbersome for ordinary users. Video summarization can automate this process by extracting important segments from original videos. Because bloggers typically have certain stories for their blog posts, video summaries of a blog post should take the author’s intentions into account. However, most prior works address video summarization by mining patterns from the original videos without considering the blog author’s intentions. To generate a video summary that reflects the blog author’s intention, we focus on supporting texts in video blog posts and present a text-based method, in which the supporting text serves as a prior to the video summary. Given video and text that describe scenes of interest, our method segments videos and assigns to each video segment its priority in the summary based on its relevance to the input text. Our method then selects a subset of segments with content that is similar to the input text. Accordingly, our method produces different video summaries from the same set of videos, depending on the input text. We evaluated summaries generated from both blog viewers’ and authors’ perspectives in a user study. Experimental results demonstrate the advantages to the proposed text-based method for video blog authoring.
Published: 2023

47. Textual description-based video summarization for video blogs

Author: Otani, Mayu, Nakashima, Yuta, Sato, Tomokazu, Yokoya, Naokazu, Otani, Mayu, Nakashima, Yuta, Sato, Tomokazu, and Yokoya, Naokazu
Abstract: Recent popularization of camera devices, including action cams and smartphones, enables us to record videos in everyday life and share them through the Internet. Video blog is a recent approach for sharing videos, in which users enjoy expressing themselves in blog posts with attractive videos. Generating such videos, however, requires users to review vast amount of raw videos and edit them appropriately, which keeps users away from doing so. In this paper, we propose a novel video summarization method for helping users to create a video blog post. Unlike typical video summarization methods, the proposed method utilizes the text, which is written for a video blog post, and makes the video summary consistent with the content of the text. For this, we perform video summarization by solving an optimization problem, in which an objective function involves the content similarity between the summarized video and the text. Our user study with 20 participants has demonstrated that our proposed method is suitable to create video blog posts compared with conventional methods for video summarization., ICME 2015 : IEEE International Conference on Multimedia and Expo , Jun 29-Jul 3, 2015 , Torino, Italy
Published: 2023

48. Purification of the subcellular compartment in which exogenous antigens undergo endoplasmic reticulum-associated degradation from dendritic cells

Author: Imai, Jun, Otani, Mayu, Sakai, Takahiro, and Hatta, Shinichi
Published: 2016
Full Text: View/download PDF

49. Generative Colorization of Structured Mobile Web Pages

Author: Kikuchi, Kotaro, primary, Inoue, Naoto, additional, Otani, Mayu, additional, Simo-Serra, Edgar, additional, and Yamaguchi, Kota, additional
Published: 2023
Full Text: View/download PDF

50. Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Author: Pang, Zongshang, primary, Nakashima, Yuta, additional, Otani, Mayu, additional, and Nagahara, Hajime, additional
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

156 results on '"Otani, Mayu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources