Author: "Nakashima, Yuta" / Database: OpenAIRE - Searchworks@Jio Institute Digital Library Search Results

1. Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

Author: Garcia, Noa, Hirota, Yusuke, Wu, Yankun, and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Computer Science - Computers and Society, Computer Vision and Pattern Recognition (cs.CV), Computers and Society (cs.CY), Computer Science - Computer Vision and Pattern Recognition
Abstract: The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of tools to analyze societal bias in big collections of images makes addressing the problem extremely challenging. Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models, with four demographic and two contextual attributes. Our second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented. Our last contribution lies in evaluating three prevailing vision-and-language tasks: image captioning, text-image CLIP embeddings, and text-to-image generation, showing that societal bias is a persistent problem in all of them., CVPR 2023
Published: 2023

2. Inference Time Evidences of Adversarial Attacks for Forensic on Transformers

Author: Lemarchant, Hugo, Li, Liangzi, Qian, Yiming, Nakashima, Yuta, and Nagahara, Hajime
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work.
Published: 2023
Full Text: View/download PDF

3. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Author: Otani, Mayu, Togashi, Riku, Sawai, Yu, Ishigami, Ryosuke, Nakashima, Yuta, Rahtu, Esa, Heikkilä, Janne, and Satoh, Shin'ichi
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations., Comment: CVPR 2023
Published: 2023
Full Text: View/download PDF

4. Learning Bottleneck Concepts in Image Classification

Author: Wang, Bowen, Li, Liangzhi, Nakashima, Yuta, and Nagahara, Hajime
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL's potential to rebuild neural networks for better interpretability. Code is available at https://github.com/wbw520/BotCL and a simple demo is available at https://botcl.liangzhili.com/., Comment: Accepted in CVPR 2023
Published: 2023
Full Text: View/download PDF

5. Model-Agnostic Gender Debiased Image Captioning

Author: Hirota, Yusuke, Nakashima, Yuta, and Garcia, Noa
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate such gender bias in image captioning models. While prior work has addressed this problem by forcing models to focus on people to reduce gender misclassification, it conversely generates gender-stereotypical words at the expense of predicting the correct gender. From this observation, we hypothesize that there are two types of gender bias affecting image captioning models: 1) bias that exploits context to predict gender, and 2) bias in the probability of generating certain (often stereotypical) words because of gender. To mitigate both types of gender biases, we propose a framework, called LIBRA, that learns from synthetically biased samples to decrease both types of biases, correcting gender misclassification and changing gender-stereotypical words to more neutral ones., Comment: CVPR 2023
Published: 2023
Full Text: View/download PDF

6. Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis

Author: Wu, Yankun, Nakashima, Yuta, and Garcia, Noa
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers to the objects and concepts in the piece of art, and style to the way it is expressed. This duality poses an important challenge for computer vision. The visual appearance of objects and concepts is modulated by the style that may reflect the author's emotions, social trends, artistic movement, etc., and their deep comprehension undoubtfully requires to handle both. A promising step towards a general paradigm for art analysis is to disentangle content and style, whereas relying on human annotations to cull a single aspect of artworks has limitations in learning semantic concepts and the visual appearance of paintings. We thus present GOYA, a method that distills the artistic knowledge captured in a recent generative model to disentangle content and style. Experiments show that synthetically generated images sufficiently serve as a proxy of the real distribution of artworks, allowing GOYA to separately represent the two elements of art while keeping more information than existing methods.
Published: 2023
Full Text: View/download PDF

7. Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Author: Pang, Zongshang, Nakashima, Yuta, Otani, Mayu, and Nagahara, Hajime
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on the image classification task, the metrics can already yield high-quality importance scores, demonstrating competitive or better performance than past heavily-trained methods. We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved, and the model can also leverage a large number of random videos and generalize to test videos with decent performance. Code available at https://github.com/pangzss/pytorch-CTVSUM., To appear in WACV2023
Published: 2022

8. Deep Gesture Generation for Social Robots Using Type-Specific Libraries

Author: Teshima, Hitoshi, Wake, Naoki, Thomas, Diego, Nakashima, Yuta, Kawasaki, Hiroshi, and Ikeuchi, Katsushi
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Body language such as conversational gesture is a powerful way to ease communication. Conversational gestures do not only make a speech more lively but also contain semantic meaning that helps to stress important information in the discussion. In the field of robotics, giving conversational agents (humanoid robots or virtual avatars) the ability to properly use gestures is critical, yet remain a task of extraordinary difficulty. This is because given only a text as input, there are many possibilities and ambiguities to generate an appropriate gesture. Different to previous works we propose a new method that explicitly takes into account the gesture types to reduce these ambiguities and generate human-like conversational gestures. Key to our proposed system is a new gesture database built on the TED dataset that allows us to map a word to one of three types of gestures: "Imagistic" gestures, which express the content of the speech, "Beat" gestures, which emphasize words, and "No gestures." We propose a system that first maps the words in the input text to their corresponding gesture type, generate type-specific gestures and combine the generated gestures into one final smooth gesture. In our comparative experiments, the effectiveness of the proposed method was confirmed in user studies for both avatar and humanoid robot.
Published: 2022
Full Text: View/download PDF

9. Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Author: Chen, Tianwei, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Nagahara, Hajime
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conduct an exhaustive analysis based on hundreds of cross-experiments on 12 vision-and-language tasks categorized in 4 groups. Whereas tasks in the same group are prone to improve each other, results show that this is not always the case. Other factors such as dataset size or pre-training stage have also a great impact on how well the knowledge is transferred.
Published: 2022

10. Quantifying Societal Bias Amplification in Image Captioning

Author: Hirota, Yusuke, Nakashima, Yuta, and Garcia, Noa
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: We study societal bias amplification in image captioning. Image captioning models have been shown to perpetuate gender and racial biases, however, metrics to measure, quantify, and evaluate the societal bias in captions are not yet standardized. We provide a comprehensive study on the strengths and limitations of each metric, and propose LIC, a metric to study captioning bias amplification. We argue that, for image captioning, it is not enough to focus on the correct prediction of the protected attribute, and the whole context should be taken into account. We conduct extensive evaluation on traditional and state-of-the-art image captioning models, and surprisingly find that, by only focusing on the protected attribute prediction, bias mitigation models are unexpectedly amplifying bias., Comment: CVPR 2022
Published: 2022
Full Text: View/download PDF

11. R&D of the KEK Linac Accelerator Tuning Using Machine Learning

Author: Hisano, Akihiro, Iwasaki, Masako, Nagahara, Hajime, Nakano, Takashi, Nakashima, Yuta, Satake, Itsuka, Satoh, Masanori, and Takemura, Noriko
Subjects: Feedback Control, Machine Tuning and Optimization, Accelerator Physics
Abstract: We have developed a machine-learning-based operation tuning scheme for the KEK e��/e�� injector linac (Linac), to improve the injection efficiency. The tuning scheme is based on the various accelerator operation data (control parameters, monitoring data and environmental data) of Linac. For the studies, we use the accumulated Linac operation data from 2018 to 2021. To solve the problems on the accelerator tuning of, 1. A lot of parameters (~1000) should be tuned, and these parameters are intricately correlated with each other; and 2. Continuous environmental change, due to temperature change, ground motion, tidal force, etc., affects to the operation tuning; We have developed, 1. Visualization of the accelerator parameters (~1000) trend/correlation distribution based on the dimensionality reduction using Variational Autoencoder (VAE), to see the long-term correlation between the accelerator operation parameters and the environmental data, and 2. Accelerator tuning method using the deep neural network, which is continuously updated with the short-term accelerator data to adapt the environment changes. In this presentation, we report the current status of the R, Proceedings of the 18th International Conference on Accelerator and Large Experimental Physics Control Systems, ICALEPCS2021, Shanghai, China
Published: 2022
Full Text: View/download PDF

12. A Picture May Be Worth a Hundred Words for Visual Question Answering

Author: Hirota, Yusuke, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, Taniguchi, Ittetsu, and Onoye, Takao
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.
Published: 2021

13. Understanding the Role of Scene Graphs in Visual Question Answering

Author: Damodaran, Vinay, Chakravarthy, Sharanya, Kumar, Akshay, Umapathy, Anjana, Mitamura, Teruko, Nakashima, Yuta, Garcia, Noa, and Chu, Chenhui
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Machine Learning (cs.LG)
Abstract: Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the VQA task. We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability, and provides scene graphs for a large number of images. We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs, and build late fusion architectures to learn from multiple image representations. We present a multi-faceted study into the use of scene graphs for VQA, making this work the first of its kind.
Published: 2021
Full Text: View/download PDF

14. Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Author: Bai, Zechen, Nakashima, Yuta, and Garcia, Noa
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: Have you ever looked at a painting and wondered what is the story behind it? This work presents a framework to bring art closer to people by generating comprehensive descriptions of fine-art paintings. Generating informative descriptions for artworks, however, is extremely challenging, as it requires to 1) describe multiple aspects of the image such as its style, content, or composition, and 2) provide background and contextual knowledge about the artist, their influences, or the historical period. To address these challenges, we introduce a multi-topic and knowledgeable art description framework, which modules the generated sentences according to three artistic topics and, additionally, enhances each description with external knowledge. The framework is validated through an exhaustive analysis, both quantitative and qualitative, as well as a comparative human evaluation, demonstrating outstanding results in terms of both topic diversity and information veracity., Comment: ICCV 2021
Published: 2021
Full Text: View/download PDF

15. Transferring Domain-Agnostic Knowledge in Video Question Answering

Author: Wu, Tianran, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Takemura, Haruo
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge. First, we develop a novel transfer learning framework, which finetunes the pre-trained model by applying domain-agnostic knowledge as the medium. Second, we construct a new VideoQA dataset with 21,412 human-generated question-answer samples for comparable transfer of knowledge. Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.
Published: 2021
Full Text: View/download PDF

16. Constructing a Visual Relationship Authenticity Dataset

Author: Chu, Chenhui, Takebayashi, Yuto, Vipul, Mishra, and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: A visual relationship denotes a relationship between two objects in an image, which can be represented as a triplet of (subject; predicate; object). Visual relationship detection is crucial for scene understanding in images. Existing visual relationship detection datasets only contain true relationships that correctly describe the content in an image. However, distinguishing false visual relationships from true ones is also crucial for image understanding and grounded natural language processing. In this paper, we construct a visual relationship authenticity dataset, where both true and false relationships among all objects appeared in the captions in the Flickr30k entities image caption dataset are annotated. The dataset is available at https://github.com/codecreator2053/VR_ClassifiedDataset. We hope that this dataset can promote the study on both vision and language understanding.
Published: 2020

17. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, and Heikkil��, Janne
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR ., British Machine Vision Conference (BMVC), 2020. (v2) added references
Published: 2020

18. Knowledge-Based Visual Question Answering in Videos

Author: Garcia, Noa, Otani, Mayu, Chu, Chenhui, and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations., arXiv admin note: substantial text overlap with arXiv:1910.10706
Published: 2020

19. Joint Learning of Vessel Segmentation and Artery/Vein Classification with Post-processing

Author: Li, Liangzhi, Verma, Manisha, Nakashima, Yuta, Kawasaki, Ryo, and Nagahara, Hajime
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Retinal imaging serves as a valuable tool for diagnosis of various diseases. However, reading retinal images is a difficult and time-consuming task even for experienced specialists. The fundamental step towards automated retinal image analysis is vessel segmentation and artery/vein classification, which provide various information on potential disorders. To improve the performance of the existing automated methods for retinal image analysis, we propose a two-step vessel classification. We adopt a UNet-based model, SeqNet, to accurately segment vessels from the background and make prediction on the vessel type. Our model does segmentation and classification sequentially, which alleviates the problem of label distribution bias and facilitates training. To further refine classification results, we post-process them considering the structural information among vessels to propagate highly confident prediction to surrounding vessels. Our experiments show that our method improves AUC to 0.98 for segmentation and the accuracy to 0.92 in classification over DRIVE dataset., Comment: Accepted in Medical Imaging with Deep Learning (MIDL) 2020
Published: 2020
Full Text: View/download PDF

20. Improving Topic Modeling through Homophily for Legal Documents

Author: CHEKIKH BRAHIM EL VAIGH, Chenhui Chu, Renoust, Benjamin, Nakashima, Yuta, and Ashihara, Kazuki
Subjects: ComputingMilieux_LEGALASPECTSOFCOMPUTING, InformationSystems_MISCELLANEOUS
Abstract: Code for the paper Improving Topic Modeling through Homophily for Legal Documents
Published: 2020
Full Text: View/download PDF

21. Demographic Influences on Contemporary Art with Unsupervised Style Embeddings

Author: Huckle, Nikolai, Garcia, Noa, and Nakashima, Yuta
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Social and Information Networks
Abstract: Computational art analysis has, through its reliance on classification tasks, prioritised historical datasets in which the artworks are already well sorted with the necessary annotations. Art produced today, on the other hand, is numerous and easily accessible, through the internet and social networks that are used by professional and amateur artists alike to display their work. Although this art, yet unsorted in terms of style and genre, is less suited for supervised analysis, the data sources come with novel information that may help frame the visual content in equally novel ways. As a first step in this direction, we present contempArt, a multi-modal dataset of exclusively contemporary artworks. contempArt is a collection of paintings and drawings, a detailed graph network based on social connections on Instagram and additional socio-demographic information; all attached to 442 artists at the beginning of their career. We evaluate three methods suited for generating unsupervised style embeddings of images and correlate them with the remaining data. We find no connections between visual style on the one hand and social proximity, gender, and nationality on the other., Comment: To be published in Proceedings of the European Conference in Computer Vision Workshops 2020
Published: 2020
Full Text: View/download PDF

22. A Dataset and Baselines for Visual Question Answering on Art

Author: Garcia, Noa, Ye, Chentao, Liu, Zihua, Hu, Qingtao, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, and Mitamura, Teruko
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: Answering questions related to art pieces (paintings) is a difficult task, as it implies the understanding of not only the visual information that is shown in the picture, but also the contextual knowledge that is acquired through the study of the history of art. In this work, we introduce our first attempt towards building a new dataset, coined AQUA (Art QUestion Answering). The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers' correctness. Our dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions. We also present a two-branch model as baseline, where the visual and knowledge questions are handled independently. We extensively compare our baseline model against the state-of-the-art models for question answering, and we provide a comprehensive study about the challenges and potential future directions for visual question answering on art.
Published: 2020
Full Text: View/download PDF

23. Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

Author: Garcia, Noa and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.
Published: 2020
Full Text: View/download PDF

24. Retrospective analysis of the efficacy and safety of eribulin therapy for metastatic breast cancer in daily practice

Author: Tanaka, Toshihiro, Ueno, Miho, Nakashima, Yuta, Chinen, Shotaro, Sato, Eiichi, Masaki, Michio, Mogi, Ai, Sasaki, Hidenori, Tamura, Kazuo, and Takamatsu, Yasushi
Subjects: Adult, post‐treatment therapy, Antineoplastic Agents, Breast Neoplasms, Original Articles, Ketones, Middle Aged, Disease-Free Survival, Drug Administration Schedule, clinical practice, Treatment Outcome, Chemotherapy, Humans, Original Article, Female, metastatic breast cancer, Neoplasm Metastasis, Furans, eribulin, Aged, Retrospective Studies
Abstract: Background Evidence of eribulin therapy for metastatic breast cancer (MBC) in clinical practice is not well documented. Methods We retrospectively analyzed the safety and efficacy of eribulin in 29 MBC patients from 2011 to 2016 at Fukuoka University Hospital. Results The median patient age, number of courses, total dose, and relative dose intensity were as follows: 65 years, five courses, 8.6 mg/m2, and 75%, respectively. One patient achieved a complete response, (CR) six a partial response (PR), eight stable disease (SD) and 14 patients exhibited progressive disease. The objective response rate (ORR: CR + PR) was 24.1%, and the clinical benefit rate (CBR: CR + PR + SD) was 51.7%. The median progression‐free survival was 90 days (95% confidence interval [CI] 67–126) and median overall survival was 264 days (95% CI 198–357). In patients who previously received 2–4 regimens, the ORR was 28.5% and the CBR was 57.1%. In patients who received 5–12 regimens, the ORR was 20% and the CBR was 45%. Chemotherapy was administered to 20 patients (69%) after eribulin administration, and the median overall survival rate of cases that achieved greater than a PR was 1088 days. The most frequent treatment‐related grade 3/4 adverse events were neutropenia (55.2%), and febrile neutropenia (20.1%). Grade 3 peripheral neuropathy occurred in 13.8% of patients, but was not exacerbated even if present before treatment. Conclusion Eribulin is effective for MBC patients who have received multiple chemotherapies. Neutropenia and febrile neutropenia may develop after heavy prior therapy.
Published: 2017

25. Understanding Art through Multi-Modal Retrieval in Paintings

Author: Garcia, Noa, Renoust, Benjamin, and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: In computer vision, visual arts are often studied from a purely aesthetics perspective, mostly by analysing the visual appearance of an artistic reproduction to infer its style, its author, or its representative features. In this work, however, we explore art from both a visual and a language perspective. Our aim is to bridge the gap between the visual appearance of an artwork and its underlying meaning, by jointly analysing its aesthetics and its semantics. We introduce the use of multi-modal techniques in the field of automatic art analysis by 1) collecting a multi-modal dataset with fine-art paintings and comments, and 2) exploring robust visual and textual representations in artistic images.
Published: 2019
Full Text: View/download PDF

26. BUDA.ART: A Multimodal Content-Based Analysis and Retrieval System for Buddha Statues

Author: Renoust, Benjamin, Franca, Matheus Oliveira, Chan, Jacob, Le, Van, Uesaka, Ayaka, Nakashima, Yuta, Nagahara, Hajime, Wang, Jueren, and Fujioka, Yutaka
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Computer Science - Multimedia, Information Retrieval (cs.IR), Computer Science - Information Retrieval, ComputingMethodologies_COMPUTERGRAPHICS, Human-Computer Interaction (cs.HC), Multimedia (cs.MM)
Abstract: We introduce BUDA.ART, a system designed to assist researchers in Art History, to explore and analyze an archive of pictures of Buddha statues. The system combines different CBIR and classical retrieval techniques to assemble 2D pictures, 3D statue scans and meta-data, that is focused on the Buddha facial characteristics. We build the system from an archive of 50,000 Buddhism pictures, identify unique Buddha statues, extract contextual information, and provide specific facial embedding to first index the archive. The system allows for mobile, on-site search, and to explore similarities of statues in the archive. In addition, we provide search visualization and 3D analysis of the statues, Comment: Demo video at: https://www.youtube.com/watch?v=3XJvLjSWieY
Published: 2019
Full Text: View/download PDF

27. iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

Author: Chu, Chenhui, Otani, Mayu, and Nakashima, Yuta
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL), Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential to improve language and image multimodal tasks such as visual question answering and image captioning. How to model the similarity between VGPs is the key of iParaphrasing. We apply various existing methods as well as propose a novel neural network-based method with image attention, and report the results of the first attempt toward iParaphrasing., Comment: COLING 2018
Published: 2018
Full Text: View/download PDF

28. The Cases of Two Patients Who Developed Neutropenic Enterocolitis During Induction Therapy for Acute Myelogenous Leukemia

Author: Naito, Yoshiko, Sato, Eiichi, Ikari, Yousuke, Nakashima, Yuta, Kunami, Naoko, Katsuya, Hiroo, Matsuoka, Nobuhide, Takamatsu, Yasushi, Takeshita, Morishige, and Tamura, Kazuo
Subjects: 化学療法, 手術, ガイドライン, Chemotherapy, Neutropenic enterocolitis(NE), Guidelines for febrile neutropenia(FN), 好中球減少性腸炎, Operation, 発熱性好中球減少症
Published: 2013

29. Myosin-II-Mediated Directional Migration of Dictyostelium Cells in Response to Cyclic Stretching of Substratum

Author: Sato, Katsuya, Nakashima, Yuta, and Tsujioka, Masatsune
Published: 2013

30. Learning Joint Representations of Videos and Sentences with Web Image Search

Author: Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, Heikkil��, Janne, and Yokoya, Naokazu
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks., 16 pages, 4th Workshop on Web-scale Vision and Social Media (VSM), ECCV 2016
Published: 2016

31. Multimedia Signal Processing for Copyright and Privacy Protection

Author: Nakashima, Yuta
Subjects: Background estimation, Intentionally captured human object detection, Multimedia signal processing, privacy protection, Digital audio watermarking, Copyright protection

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

31 results on '"Nakashima, Yuta"'

1. Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

2. Inference Time Evidences of Adversarial Attacks for Forensic on Transformers

3. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

4. Learning Bottleneck Concepts in Image Classification

5. Model-Agnostic Gender Debiased Image Captioning

6. Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis

7. Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

8. Deep Gesture Generation for Social Robots Using Type-Specific Libraries

9. Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

10. Quantifying Societal Bias Amplification in Image Captioning

11. R&D of the KEK Linac Accelerator Tuning Using Machine Learning

12. A Picture May Be Worth a Hundred Words for Visual Question Answering

13. Understanding the Role of Scene Graphs in Visual Question Answering

14. Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

15. Transferring Domain-Agnostic Knowledge in Video Question Answering

16. Constructing a Visual Relationship Authenticity Dataset

17. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

18. Knowledge-Based Visual Question Answering in Videos

19. Joint Learning of Vessel Segmentation and Artery/Vein Classification with Post-processing

20. Improving Topic Modeling through Homophily for Legal Documents

21. Demographic Influences on Contemporary Art with Unsupervised Style Embeddings

22. A Dataset and Baselines for Visual Question Answering on Art

23. Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

24. Retrospective analysis of the efficacy and safety of eribulin therapy for metastatic breast cancer in daily practice

25. Understanding Art through Multi-Modal Retrieval in Paintings

26. BUDA.ART: A Multimodal Content-Based Analysis and Retrieval System for Buddha Statues

27. iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

28. The Cases of Two Patients Who Developed Neutropenic Enterocolitis During Induction Therapy for Acute Myelogenous Leukemia

29. Myosin-II-Mediated Directional Migration of Dictyostelium Cells in Response to Cyclic Stretching of Substratum

30. Learning Joint Representations of Videos and Sentences with Web Image Search

31. Multimedia Signal Processing for Copyright and Privacy Protection

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

31 results on '"Nakashima, Yuta"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources