Publication Year Range: Last 10 years / Topic: computer science - computer vision and pattern recognition - Searchworks@Jio Institute Digital Library Search Results

Showing total 160,610 results

Start Over Topic computer science - computer vision and pattern recognition Publication Year Range Last 10 years

160,610 results

1. Position paper: Do not explain (vision models) without context

Author: Tomaszewska, Paulina and Biecek, Przemysław
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Does the stethoscope in the picture make the adjacent person a doctor or a patient? This, of course, depends on the contextual relationship of the two objects. If it is obvious, why don not explanation methods for vision models use contextual information? In this paper, we (1) review the most popular methods of explaining computer vision models by pointing out that they do not take into account context information, (2) provide examples of real-world use cases where spatial context plays a significant role, (3) propose new research directions that may lead to better use of context information in explaining computer vision models, (4) argue that a change in approach to explanations is needed from 'where' to 'how'., Comment: Accepted for ICML 2024
Published: 2024

2. NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers?

Author: Zhang, Ran, Kostikova, Aida, Leiter, Christoph, Belouadi, Jonas, Larionov, Daniil, Chen, Yanran, Fresen, Vivian, and Eger, Steffen
Subjects: Computer Science - Digital Libraries, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society, Computer Science - Machine Learning
Abstract: Artificial Intelligence (AI) has witnessed rapid growth, especially in the subfields Natural Language Processing (NLP), Machine Learning (ML) and Computer Vision (CV). Keeping pace with this rapid progress poses a considerable challenge for researchers and professionals in the field. In this arXiv report, the second of its kind, which covers the period from January to September 2023, we aim to provide insights and analysis that help navigate these dynamic areas of AI. We accomplish this by 1) identifying the top-40 most cited papers from arXiv in the given period, comparing the current top-40 papers to the previous report, which covered the period January to June; 2) analyzing dataset characteristics and keyword popularity; 3) examining the global sectoral distribution of institutions to reveal differences in engagement across geographical areas. Our findings highlight the continued dominance of NLP: while only 16% of all submitted papers have NLP as primary category (more than 25% have CV and ML as primary category), 50% of the most cited papers have NLP as primary category, 90% of which target LLMs. Additionally, we show that i) the US dominates among both top-40 and top-9k papers, followed by China; ii) Europe clearly lags behind and is hardly represented in the top-40 most cited papers; iii) US industry is largely overrepresented in the top-40 most influential papers.
Published: 2023

3. Training CLIP models on Data from Scientific Papers

Author: Metzger, Calvin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction., Comment: ICCV 2023 Workshop
Published: 2023

4. Tender Notice Extraction from E-papers Using Neural Network

Author: Bhattarai, Ashmin, Sedhai, Anuj, Neupane, Devraj, Khadka, Manish, and Bastola, Rama
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Tender notices are usually sought by most of the companies at regular intervals as a means for obtaining the contracts of various projects. These notices consist of all the required information like description of the work, period of construction, estimated amount of project, etc. In the context of Nepal, tender notices are usually published in national as well as local newspapers. The interested bidders should search all the related tender notices in newspapers. However, it is very tedious for these companies to manually search tender notices in every newspaper and figure out which bid is best suited for them. This project is built with the purpose of solving this tedious task of manually searching the tender notices. Initially, the newspapers are downloaded in PDF format using the selenium library of python. After downloading the newspapers, the e-papers are scanned and tender notices are automatically extracted using a neural network. For extraction purposes, different architectures of CNN namely ResNet, GoogleNet and Xception are used and a model with highest performance has been implemented. Finally, these extracted notices are then published on the website and are accessible to the users. This project is helpful for construction companies as well as contractors assuring quality and efficiency. This project has great application in the field of competitive bidding as well as managing them in a systematic manner.
Published: 2023

5. Discussion Paper: The Threat of Real Time Deepfakes

Author: Frankovits, Guy and Mirsky, Yisroel
Subjects: Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Generative deep learning models are able to create realistic audio and video. This technology has been used to impersonate the faces and voices of individuals. These ``deepfakes'' are being used to spread misinformation, enable scams, perform fraud, and blackmail the innocent. The technology continues to advance and today attackers have the ability to generate deepfakes in real-time. This new capability poses a significant threat to society as attackers begin to exploit the technology in advances social engineering attacks. In this paper, we discuss the implications of this emerging threat, identify the challenges with preventing these attacks and suggest a better direction for researching stronger defences.
Published: 2023

6. Automatic Detection and Rectification of Paper Receipts on Smartphones

Author: Whittaker, Edward, Tanaka, Masashi, and Kitagishi, Ikuo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction
Abstract: We describe the development of a real-time smartphone app that allows the user to digitize paper receipts in a novel way by "waving" their phone over the receipts and letting the app automatically detect and rectify the receipts for subsequent text recognition. We show that traditional computer vision algorithms for edge and corner detection do not robustly detect the non-linear and discontinuous edges and corners of a typical paper receipt in real-world settings. This is particularly the case when the colors of the receipt and background are similar, or where other interfering rectangular objects are present. Inaccurate detection of a receipt's corner positions then results in distorted images when using an affine projective transformation to rectify the perspective. We propose an innovative solution to receipt corner detection by treating each of the four corners as a unique "object", and training a Single Shot Detection MobileNet object detection model. We use a small amount of real data and a large amount of automatically generated synthetic data that is designed to be similar to real-world imaging scenarios. We show that our proposed method robustly detects the four corners of a receipt, giving a receipt detection accuracy of 85.3% on real-world data, compared to only 36.9% with a traditional edge detection-based approach. Our method works even when the color of the receipt is virtually indistinguishable from the background. Moreover, our method is trained to detect only the corners of the central target receipt and implicitly learns to ignore other receipts, and other rectangular objects. Including synthetic data allows us to train an even better model. These factors are a major advantage over traditional edge detection-based approaches, allowing us to deliver a much better experience to the user.
Published: 2023

7. Hidden Knowledge: Mathematical Methods for the Extraction of the Fingerprint of Medieval Paper from Digital Images

Author: Grossmann, Tamara G., Schönlieb, Carola-Bibiane, and Da Rold, Orietta
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Mathematics - Numerical Analysis
Abstract: Medieval paper, a handmade product, is made with a mould which leaves an indelible imprint on the sheet of paper. This imprint includes chain lines, laid lines and watermarks which are often visible on the sheet. Extracting these features allows the identification of paper stock and gives information about chronology, localisation and movement of books and people. Most computational work for feature extraction of paper analysis has so far focused on radiography or transmitted light images. While these imaging methods provide clear visualisation for the features of interest, they are expensive and time consuming in their acquisition and not feasible for smaller institutions. However, reflected light images of medieval paper manuscripts are abundant and possibly cheaper in their acquisition. In this paper, we propose algorithms to detect and extract the laid and chain lines from reflected light images. We tackle the main drawback of reflected light images, that is, the low contrast attenuation of lines and intensity jumps due to noise and degradation, by employing the spectral total variation decomposition and develop methods for subsequent line extraction. Our results clearly demonstrate the feasibility of using reflected light images in paper analysis. This work enables the feature extraction for paper manuscripts that have otherwise not been analysed due to a lack of appropriate images. We also open the door for paper stock identification at scale.
Published: 2023

8. Representation Learning for Tablet and Paper Domain Adaptation in Favor of Online Handwriting Recognition

Author: Ott, Felix, Rügamer, David, Heublein, Lucas, Bischl, Bernd, and Mutschler, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, 49Q22, 62M10, I.2.4
Abstract: The performance of a machine learning model degrades when it is applied to data from a similar but different domain than the data it has initially been trained on. The goal of domain adaptation (DA) is to mitigate this domain shift problem by searching for an optimal feature transformation to learn a domain-invariant representation. Such a domain shift can appear in handwriting recognition (HWR) applications where the motion pattern of the hand and with that the motion pattern of the pen is different for writing on paper and on tablet. This becomes visible in the sensor data for online handwriting (OnHW) from pens with integrated inertial measurement units. This paper proposes a supervised DA approach to enhance learning for OnHW recognition between tablet and paper data. Our method exploits loss functions such as maximum mean discrepancy and correlation alignment to learn a domain-invariant feature representation (i.e., similar covariances between tablet and paper features). We use a triplet loss that takes negative samples of the auxiliary domain (i.e., paper samples) to increase the amount of samples of the tablet dataset. We conduct an evaluation on novel sequence-based OnHW datasets (i.e., words) and show an improvement on the paper domain with an early fusion strategy by using pairwise learning., Comment: Accepted at IAPR Intl. Workshop on Multimodal Pattern Recognition of Social Signals in Human Computer Interaction (MPRSS), Montreal, Canada, August 2022
Published: 2023

9. Image Denoising: The Deep Learning Revolution and Beyond -- A Survey Paper --

Author: Elad, Michael, Kawar, Bahjat, and Vaksman, Gregory
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Image denoising (removal of additive white Gaussian noise from an image) is one of the oldest and most studied problems in image processing. An extensive work over several decades has led to thousands of papers on this subject, and to many well-performing algorithms for this task. Indeed, 10 years ago, these achievements have led some researchers to suspect that "Denoising is Dead", in the sense that all that can be achieved in this domain has already been obtained. However, this turned out to be far from the truth, with the penetration of deep learning (DL) into image processing. The era of DL brought a revolution to image denoising, both by taking the lead in today's ability for noise removal in images, and by broadening the scope of denoising problems being treated. Our paper starts by describing this evolution, highlighting in particular the tension and synergy that exist between classical approaches and modern DL-based alternatives in design of image denoisers. The recent transitions in the field of image denoising go far beyond the ability to design better denoisers. In the 2nd part of this paper we focus on recently discovered abilities and prospects of image denoisers. We expose the possibility of using denoisers to serve other problems, such as regularizing general inverse problems and serving as the prime engine in diffusion-based image synthesis. We also unveil the idea that denoising and other inverse problems might not have a unique solution as common algorithms would have us believe. Instead, we describe constructive ways to produce randomized and diverse high quality results for inverse problems, all fueled by the progress that DL brought to image denoising. This survey paper aims to provide a broad view of the history of image denoising and closely related topics. Our aim is to give a better context to recent discoveries, and to the influence of DL in our domain.
Published: 2023

10. From Malware Samples to Fractal Images: A New Paradigm for Classification. (Version 2.0, Previous version paper name: Have you ever seen malware?)

Author: Zelinka, Ivan, Szczypka, Miloslav, Plucar, Jan, and Kuznetsov, Nikolay
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition, 68M25, I.2.0, I.4.9, I.m, J.m
Abstract: To date, a large number of research papers have been written on the classification of malware, its identification, classification into different families and the distinction between malware and goodware. These works have been based on captured malware samples and have attempted to analyse malware and goodware using various techniques, including techniques from the field of artificial intelligence. For example, neural networks have played a significant role in these classification methods. Some of this work also deals with analysing malware using its visualisation. These works usually convert malware samples capturing the structure of malware into image structures, which are then the object of image processing. In this paper, we propose a very unconventional and novel approach to malware visualisation based on dynamic behaviour analysis, with the idea that the images, which are visually very interesting, are then used to classify malware concerning goodware. Our approach opens an extensive topic for future discussion and provides many new directions for research in malware analysis and classification, as discussed in conclusion. The results of the presented experiments are based on a database of 6 589 997 goodware, 827 853 potentially unwanted applications and 4 174 203 malware samples provided by ESET and selected experimental data (images, generating polynomial formulas and software generating images) are available on GitHub for interested readers. Thus, this paper is not a comprehensive compact study that reports the results obtained from comparative experiments but rather attempts to show a new direction in the field of visualisation with possible applications in malware analysis., Comment: This paper is under review; the section describing conversion from malware structure to fractal figure is temporarily erased here to protect our idea. It will be replaced by a full version when accepted
Published: 2022

11. Auto Lead Extraction and Digitization of ECG Paper Records using cGAN

Author: Patil, Rupali, Narkhede, Bhairav, Varma, Shubham, Suraliya, Shreyans, and Mehendale, Ninad
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Purpose: An Electrocardiogram (ECG) is the simplest and fastest bio-medical test that is used to detect any heart-related disease. ECG signals are generally stored in paper form, which makes it difficult to store and analyze the data. While capturing ECG leads from paper ECG records, a lot of background information is also captured, which results in incorrect data interpretation. Methods: We propose a deep learning-based model for individually extracting all 12 leads from 12-lead ECG images captured using a camera. To simplify the analysis of the ECG and the calculation of complex parameters, we also propose a method to convert the paper ECG format into a storable digital format. The You Only Look Once, Version 3 (YOLOv3) algorithm has been used to extract the leads present in the image. These leads are then passed on to another deep learning model which separates the ECG signal and background from the single-lead image. After that, vertical scanning is performed on the ECG signal to convert it into a 1-Dimensional (1D) digital form. To perform the task of digitalization, we used the pix-2-pix deep learning model and binarized the ECG signals. Results: Our proposed method was able to achieve an accuracy of 97.4 %. Conclusion: The information on the paper ECG fades away over time. Hence, the digitized ECG signals make it possible to store the records and access them anytime. This proves highly beneficial for heart patients who require frequent ECG reports. The stored data can also be useful for research purposes, as this data can be used to develop computer algorithms that are capable of analyzing the data.
Published: 2022

12. On the use of learning-based forecasting methods for ameliorating fashion business processes: A position paper

Author: Skenderi, Geri, Joppi, Christian, Denitto, Matteo, and Cristani, Marco
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The fashion industry is one of the most active and competitive markets in the world, manufacturing millions of products and reaching large audiences every year. A plethora of business processes are involved in this large-scale industry, but due to the generally short life-cycle of clothing items, supply-chain management and retailing strategies are crucial for good market performance. Correctly understanding the wants and needs of clients, managing logistic issues and marketing the correct products are high-level problems with a lot of uncertainty associated to them given the number of influencing factors, but most importantly due to the unpredictability often associated with the future. It is therefore straightforward that forecasting methods, which generate predictions of the future, are indispensable in order to ameliorate all the various business processes that deal with the true purpose and meaning of fashion: having a lot of people wear a particular product or style, rendering these items, people and consequently brands fashionable. In this paper, we provide an overview of three concrete forecasting tasks that any fashion company can apply in order to improve their industrial and market impact. We underline advances and issues in all three tasks and argue about their importance and the impact they can have at an industrial level. Finally, we highlight issues and directions of future work, reflecting on how learning-based forecasting methods can further aid the fashion industry., Comment: 2nd International Workshop on Industrial Machine Learning @ ICPR 2022
Published: 2022

13. Attention is All They Need: Exploring the Media Archaeology of the Computer Vision Research Paper

Author: Goree, Samuel, Appleby, Gabriel, Crandall, David, and Su, Norman
Subjects: Computer Science - Computers and Society, Computer Science - Computer Vision and Pattern Recognition, K.7.m
Abstract: Research papers, in addition to textual documents, are a designed interface through which researchers communicate. Recently, rapid growth has transformed that interface in many fields of computing. In this work, we examine the effects of this growth from a media archaeology perspective, through the changes to figures and tables in research papers. Specifically, we study these changes in computer vision over the past decade, as the deep learning revolution has driven unprecedented growth in the discipline. We ground our investigation through interviews with veteran researchers spanning computer vision, graphics and visualization. Our analysis focuses on the research attention economy: how research paper elements contribute towards advertising, measuring and disseminating an increasingly commodified ``contribution.'' Through this work, we seek to motivate future discussion surrounding the design of both the research paper itself as well as the larger sociotechnical research publishing system, including tools for finding, reading and writing research papers.
Published: 2022

14. FPSRS: A Fusion Approach for Paper Submission Recommendation System

Author: Huynh, Son T., Dang, Nhi, Nguyen, Dac H., Huynh, Phong T., and Nguyen, Binh T.
Subjects: Computer Science - Information Retrieval, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recommender systems have been increasingly popular in entertainment and consumption and are evident in academics, especially for applications that suggest submitting scientific articles to scientists. However, because of the various acceptance rates, impact factors, and rankings in different publishers, searching for a proper venue or journal to submit a scientific work usually takes a lot of time and effort. In this paper, we aim to present two newer approaches extended from our paper [13] presented at the conference IAE/AIE 2021 by employing RNN structures besides using Conv1D. In addition, we also introduce a new method, namely DistilBertAims, using DistillBert for two cases of uppercase and lower-case words to vectorize features such as Title, Abstract, and Keywords, and then use Conv1d to perform feature extraction. Furthermore, we propose a new calculation method for similarity score for Aim & Scope with other features; this helps keep the weights of similarity score calculation continuously updated and then continue to fit more data. The experimental results show that the second approach could obtain a better performance, which is 62.46% and 12.44% higher than the best of the previous study [13] in terms of the Top 1 accuracy., Comment: 24 pages, 10 figures, 8 tables
Published: 2022

15. SimCPSR: Simple Contrastive Learning for Paper Submission Recommendation System

Author: Le, Duc H., Doan, Tram T., Huynh, Son T., and Nguyen, Binh T.
Subjects: Computer Science - Information Retrieval, Computer Science - Computer Vision and Pattern Recognition
Abstract: The recommendation system plays a vital role in many areas, especially academic fields, to support researchers in submitting and increasing the acceptance of their work through the conference or journal selection process. This study proposes a transformer-based model using transfer learning as an efficient approach for the paper submission recommendation system. By combining essential information (such as the title, the abstract, and the list of keywords) with the aims and scopes of journals, the model can recommend the Top K journals that maximize the acceptance of the paper. Our model had developed through two states: (i) Fine-tuning the pre-trained language model (LM) with a simple contrastive learning framework. We utilized a simple supervised contrastive objective to fine-tune all parameters, encouraging the LM to learn the document representation effectively. (ii) The fine-tuned LM was then trained on different combinations of the features for the downstream task. This study suggests a more advanced method for enhancing the efficiency of the paper submission recommendation system compared to previous approaches when we respectively achieve 0.5173, 0.8097, 0.8862, 0.9496 for Top 1, 3, 5, and 10 accuracies on the test set for combining the title, abstract, and keywords as input features. Incorporating the journals' aims and scopes, our model shows an exciting result by getting 0.5194, 0.8112, 0.8866, and 0.9496 respective to Top 1, 3, 5, and 10., Comment: 13 pages, 1 table, 4 figures
Published: 2022

16. Community-Driven Comprehensive Scientific Paper Summarization: Insight from cvpaper.challenge

Author: Yamamoto, Shintaro, Kataoka, Hirokatsu, Suzuki, Ryota, Shinagawa, Seitaro, and Morishima, Shigeo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: The present paper introduces a group activity involving writing summaries of conference proceedings by volunteer participants. The rapid increase in scientific papers is a heavy burden for researchers, especially non-native speakers, who need to survey scientific literature. To alleviate this problem, we organized a group of non-native English speakers to write summaries of papers presented at a computer vision conference to share the knowledge of the papers read by the group. We summarized a total of 2,000 papers presented at the Conference on Computer Vision and Pattern Recognition, a top-tier conference on computer vision, in 2019 and 2020. We quantitatively analyzed participants' selection regarding which papers they read among the many available papers. The experimental results suggest that we can summarize a wide range of papers without asking participants to read papers unrelated to their interests.
Published: 2022

17. Best Practices and Scoring System on Reviewing A.I. based Medical Imaging Papers: Part 1 Classification

Author: Kline, Timothy L., Kitamura, Felipe, Pan, Ian, Korchi, Amine M., Tenenholtz, Neil, Moy, Linda, Gichoya, Judy Wawira, Santos, Igor, Blumer, Steven, Hwang, Misha Ysabel, Git, Kim-Ann, Shroff, Abishek, Walach, Elad, Shih, George, and Langer, Steve
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: With the recent advances in A.I. methodologies and their application to medical imaging, there has been an explosion of related research programs utilizing these techniques to produce state-of-the-art classification performance. Ultimately, these research programs culminate in submission of their work for consideration in peer reviewed journals. To date, the criteria for acceptance vs. rejection is often subjective; however, reproducible science requires reproducible review. The Machine Learning Education Sub-Committee of SIIM has identified a knowledge gap and a serious need to establish guidelines for reviewing these studies. Although there have been several recent papers with this goal, this present work is written from the machine learning practitioners standpoint. In this series, the committee will address the best practices to be followed in an A.I.-based study and present the required sections in terms of examples and discussion of what should be included to make the studies cohesive, reproducible, accurate, and self-contained. This first entry in the series focuses on the task of image classification. Elements such as dataset curation, data pre-processing steps, defining an appropriate reference standard, data partitioning, model architecture and training are discussed. The sections are presented as they would be detailed in a typical manuscript, with content describing the necessary information that should be included to make sure the study is of sufficient quality to be considered for publication. The goal of this series is to provide resources to not only help improve the review process for A.I.-based medical imaging papers, but to facilitate a standard for the information that is presented within all components of the research study. We hope to provide quantitative metrics in what otherwise may be a qualitative review process.
Published: 2022

18. A new baseline for retinal vessel segmentation: Numerical identification and correction of methodological inconsistencies affecting 100+ papers

Author: Kovács, György and Fazekas, Attila
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In the last 15 years, the segmentation of vessels in retinal images has become an intensively researched problem in medical imaging, with hundreds of algorithms published. One of the de facto benchmarking data sets of vessel segmentation techniques is the DRIVE data set. Since DRIVE contains a predefined split of training and test images, the published performance results of the various segmentation techniques should provide a reliable ranking of the algorithms. Including more than 100 papers in the study, we performed a detailed numerical analysis of the coherence of the published performance scores. We found inconsistencies in the reported scores related to the use of the field of view (FoV), which has a significant impact on the performance scores. We attempted to eliminate the biases using numerical techniques to provide a more realistic picture of the state of the art. Based on the results, we have formulated several findings, most notably: despite the well-defined test set of DRIVE, most rankings in published papers are based on non-comparable figures; in contrast to the near-perfect accuracy scores reported in the literature, the highest accuracy score achieved to date is 0.9582 in the FoV region, which is 1% higher than that of human annotators. The methods we have developed for identifying and eliminating the evaluation biases can be easily applied to other domains where similar problems may arise.
Published: 2021

19. Starkit: RoboCup Humanoid KidSize 2021 Worldwide Champion Team Paper

Author: Davydenko, Egor, Khokhlov, Ivan, Litvinenko, Vladimir, Ryakin, Ilya, Osokin, Ilya, and Babaev, Azer
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, 68T40, D.2.0
Abstract: This article is devoted to the features that were under development between RoboCup 2019 Sydney and RoboCup 2021 Worldwide. These features include vision-related matters, such as detection and localization, mechanical and algorithmic novelties. Since the competition was held virtually, the simulation-specific features are also considered in the article. We give an overview of the approaches that were tried out along with the analysis of their preconditions, perspectives and the evaluation of their performance., Comment: 15 pages, 10 figures
Published: 2021

20. White Paper Assistance: A Step Forward Beyond the Shortcut Learning

Author: Cheng, Xuan, Xie, Tianshu, Wang, Xiaomin, Deng, Jiali, Liu, Minghui, and Liu, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The promising performances of CNNs often overshadow the need to examine whether they are doing in the way we are actually interested. We show through experiments that even over-parameterized models would still solve a dataset by recklessly leveraging spurious correlations, or so-called 'shortcuts'. To combat with this unintended propensity, we borrow the idea of printer test page and propose a novel approach called White Paper Assistance. Our proposed method involves the white paper to detect the extent to which the model has preference for certain characterized patterns and alleviates it by forcing the model to make a random guess on the white paper. We show the consistent accuracy improvements that are manifest in various architectures, datasets and combinations with other techniques. Experiments have also demonstrated the versatility of our approach on fine-grained recognition, imbalanced classification and robustness to corruptions., Comment: 10 pages, 4 figures
Published: 2021

21. A White Paper on Neural Network Quantization

Author: Nagel, Markus, Fournarakis, Marios, Amjad, Rana Ali, Bondarenko, Yelysei, van Baalen, Mart, and Blankevoort, Tijmen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
Published: 2021

22. A remark on a paper of Krotov and Hopfield [arXiv:2008.06996]

Author: Tang, Fei and Kopp, Michael
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In their recent paper titled "Large Associative Memory Problem in Neurobiology and Machine Learning" [arXiv:2008.06996] the authors gave a biologically plausible microscopic theory from which one can recover many dense associative memory models discussed in the literature. We show that the layers of the recent "MLP-mixer" [arXiv:2105.01601] as well as the essentially equivalent model in [arXiv:2105.02723] are amongst them., Comment: 1 page, 8 formulae
Published: 2021

23. Quantum-Classical Hybrid Machine Learning for Image Classification (ICCAD Special Session Paper)

Author: Alam, Mahabubul, Kundu, Satwik, Topaloglu, Rasit Onur, and Ghosh, Swaroop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Image classification is a major application domain for conventional deep learning (DL). Quantum machine learning (QML) has the potential to revolutionize image classification. In any typical DL-based image classification, we use convolutional neural network (CNN) to extract features from the image and multi-layer perceptron network (MLP) to create the actual decision boundaries. On one hand, QML models can be useful in both of these tasks. Convolution with parameterized quantum circuits (Quanvolution) can extract rich features from the images. On the other hand, quantum neural network (QNN) models can create complex decision boundaries. Therefore, Quanvolution and QNN can be used to create an end-to-end QML model for image classification. Alternatively, we can extract image features separately using classical dimension reduction techniques such as, Principal Components Analysis (PCA) or Convolutional Autoencoder (CAE) and use the extracted features to train a QNN. We review two proposals on quantum-classical hybrid ML models for image classification namely, Quanvolutional Neural Network and dimension reduction using a classical algorithm followed by QNN. Particularly, we make a case for trainable filters in Quanvolution and CAE-based feature extraction for image datasets (instead of dimension reduction using linear transformations such as, PCA). We discuss various design choices, potential opportunities, and drawbacks of these models. We also release a Python-based framework to create and explore these hybrid models with a variety of design choices.
Published: 2021

24. The MICCAI Hackathon on reproducibility, diversity, and selection of papers at the MICCAI conference

Author: Balsiger, Fabian, Jungo, Alain, J, Naren Akash R, Chen, Jianan, Ezhov, Ivan, Liu, Shengnan, Ma, Jun, Paetzold, Johannes C., R, Vishva Saravanan, Sekuboyina, Anjany, Shit, Suprosanna, Suter, Yannick, Yekini, Moshood, Zeng, Guodong, and Rempfler, Markus
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The MICCAI conference has encountered tremendous growth over the last years in terms of the size of the community, as well as the number of contributions and their technical success. With this growth, however, come new challenges for the community. Methods are more difficult to reproduce and the ever-increasing number of paper submissions to the MICCAI conference poses new questions regarding the selection process and the diversity of topics. To exchange, discuss, and find novel and creative solutions to these challenges, a new format of a hackathon was initiated as a satellite event at the MICCAI 2020 conference: The MICCAI Hackathon. The first edition of the MICCAI Hackathon covered the topics reproducibility, diversity, and selection of MICCAI papers. In the manner of a small think-tank, participants collaborated to find solutions to these challenges. In this report, we summarize the insights from the MICCAI Hackathon into immediate and long-term measures to address these challenges. The proposed measures can be seen as starting points and guidelines for discussions and actions to possibly improve the MICCAI conference with regards to reproducibility, diversity, and selection of papers., Comment: Revision of discussion; update e-mail address of one author
Published: 2021

25. White Paper: Challenges and Considerations for the Creation of a Large Labelled Repository of Online Videos with Questionable Content

Author: Solorio, Thamar, Shafaei, Mahsa, Smailis, Christos, Diab, Mona, Giannakopoulos, Theodore, Ji, Heng, Liu, Yang, Mihalcea, Rada, Muresan, Smaranda, and Kakadiaris, Ioannis
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society
Abstract: This white paper presents a summary of the discussions regarding critical considerations to develop an extensive repository of online videos annotated with labels indicating questionable content. The main discussion points include: 1) the type of appropriate labels that will result in a valuable repository for the larger AI community; 2) how to design the collection and annotation process, as well as the distribution of the corpus to maximize its potential impact; and, 3) what actions we can take to reduce risk of trauma to annotators.
Published: 2021

26. Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience Paper)

Author: Haq, Fitash Ul, Shin, Donghwan, Briand, Lionel C., Stifter, Thomas, and Wang, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Software Engineering
Abstract: Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time -- where each individual key-point may be critical in the targeted application -- and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining., Comment: to appear in ISSTA 2021
Published: 2020
Full Text: View/download PDF

27. Recognizing Families In the Wild: White Paper for the 4th Edition Data Challenge

Author: Robinson, Joseph P., Yin, Yu, Khan, Zaid, Shao, Ming, Xia, Siyu, Stopa, Michael, Timoner, Samson, Turk, Matthew A., Chellappa, Rama, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recognizing Families In the Wild (RFIW): an annual large-scale, multi-track automatic kinship recognition evaluation that supports various visual kin-based problems on scales much higher than ever before. Organized in conjunction with the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG) as a Challenge, RFIW provides a platform for publishing original work and the gathering of experts for a discussion of the next steps. This paper summarizes the supported tasks (i.e., kinship verification, tri-subject verification, and search & retrieval of missing children) in the evaluation protocols, which include the practical motivation, technical background, data splits, metrics, and benchmark results. Furthermore, top submissions (i.e., leader-board stats) are listed and reviewed as a high-level analysis on the state of the problem. In the end, the purpose of this paper is to describe the 2020 RFIW challenge, end-to-end, along with forecasts in promising future directions., Comment: White Paper for challenge in conjunction with 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
Published: 2020
Full Text: View/download PDF

28. Medical Imaging with Deep Learning: MIDL 2020 -- Short Paper Track

Author: Arbel, Tal, Ayed, Ismail Ben, de Bruijne, Marleen, Descoteaux, Maxime, Lombaert, Herve, and Pal, Chris
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This compendium gathers all the accepted extended abstracts from the Third International Conference on Medical Imaging with Deep Learning (MIDL 2020), held in Montreal, Canada, 6-9 July 2020. Note that only accepted extended abstracts are listed here, the Proceedings of the MIDL 2020 Full Paper Track are published in the Proceedings of Machine Learning Research (PMLR)., Comment: Accepted extended abstracts can also be found at https://openreview.net/group?id=MIDL.io/2020/Conference#abstract-accept-papers
Published: 2020

29. Modelling curvature of a bent paper leaf

Author: Goteti, Sasikanth Raghava
Subjects: Computer Science - Graphics, Computer Science - Computer Vision and Pattern Recognition
Abstract: In this article, we briefly describe various tools and approaches that algebraic geometry has to offer to straighten bent objects. Throughout this article we will consider a specific example of a bent or curved piece of paper which in our case acts very much like an elastica curve. We conclude this article with a suggestion to algebraic geometry as a viable and fast performance alternative of neural networks in vision and machine learning. The purpose of this article is not to build a full blown framework but to show possibility of using algebraic geometry as an alternative to neural networks for recognizing or extracting features on manifolds., Comment: 5 pages , 5 figures
Published: 2019
Full Text: View/download PDF

30. Image Matching across Wide Baselines: From Paper to Practice

Author: Jin, Yuhe, Mishkin, Dmytro, Mishchuk, Anastasiia, Matas, Jiri, Fua, Pascal, Yi, Kwang Moo, and Trulls, Eduard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task -- the accuracy of the reconstructed camera pose -- as our primary metric. Our pipeline's modular structure allows easy integration, configuration, and combination of different methods and heuristics. This is demonstrated by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the conducted experiments reveal unexpected properties of Structure from Motion (SfM) pipelines that can help improve their performance, for both algorithmic and learned methods. Data and code are online https://github.com/vcg-uvic/image-matching-benchmark, providing an easy-to-use and flexible framework for the benchmarking of local features and robust estimation methods, both alongside and against top-performing methods. This work provides a basis for the Image Matching Challenge https://vision.uvic.ca/image-matching-challenge., Comment: Added: KeyNet-SOSNet, AffNet-HardNet, TFeat, MKD from kornia
Published: 2020
Full Text: View/download PDF

31. Additional Baseline Metrics for the paper 'Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification'

Author: Ferrari, Claudio, Berretti, Stefano, and Del Bimbo, Alberto
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this report, we provide additional and corrected results for the paper "Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification". After further investigations, we discovered and corrected wrongly labeled images and incorrect identities. This forced us to re-generate the evaluation protocol for the new data; in doing so, we also reproduced and extended the experimental results with other standard metrics and measures used in the literature. The reader can refer to the original paper for additional details regarding the data collection procedure and recognition pipeline., Comment: 3 pages, 2 figures
Published: 2019

32. Deep Paper Gestalt

Author: Huang, Jia-Bin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have witnessed a significant increase in the number of paper submissions to computer vision conferences. The sheer volume of paper submissions and the insufficient number of competent reviewers cause a considerable burden for the current peer review system. In this paper, we learn a classifier to predict whether a paper should be accepted or rejected based solely on the visual appearance of the paper (i.e., the gestalt of a paper). Experimental results show that our classifier can safely reject 50% of the bad papers while wrongly reject only 0.4% of the good papers, and thus dramatically reduce the workload of the reviewers. We also provide tools for providing suggestions to authors so that they can improve the gestalt of their papers., Comment: Project page: https://github.com/vt-vl-lab/paper-gestalt
Published: 2018

33. Automatic Paper Summary Generation from Visual and Textual Information

Author: Yamamoto, Shintaro, Fukuhara, Yoshihiro, Suzuki, Ryota, Morishima, Shigeo, and Kataoka, Hirokatsu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Due to the recent boom in artificial intelligence (AI) research, including computer vision (CV), it has become impossible for researchers in these fields to keep up with the exponentially increasing number of manuscripts. In response to this situation, this paper proposes the paper summary generation (PSG) task using a simple but effective method to automatically generate an academic paper summary from raw PDF data. We realized PSG by combination of vision-based supervised components detector and language-based unsupervised important sentence extractor, which is applicable for a trained format of manuscripts. We show the quantitative evaluation of ability of simple vision-based components extraction, and the qualitative evaluation that our system can extract both visual item and sentence that are helpful for understanding. After processing via our PSG, the 979 manuscripts accepted by the Conference on Computer Vision and Pattern Recognition (CVPR) 2018 are available. It is believed that the proposed method will provide a better way for researchers to stay caught with important academic papers., Comment: International Conference on Machine Vision 2018, Munich, Germany
Published: 2018

34. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper)

Author: Castro, Santiago, Hazarika, Devamanyu, Pérez-Rosas, Verónica, Zimmermann, Roger, Mihalcea, Rada, and Poria, Soujanya
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone, overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. As a first step towards enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue, which provides additional information on the scenario where the utterance occurs. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to the use of individual modalities. The full dataset is publicly available for use at https://github.com/soujanyaporia/MUStARD, Comment: Accepted at ACL 2019
Published: 2019

35. Sketch2code: Generating a website from a paper mockup

Author: Robinson, Alex
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: An early stage of developing user-facing applications is creating a wireframe to layout the interface. Once a wireframe has been created it is given to a developer to implement in code. Developing boiler plate user interface code is time consuming work but still requires an experienced developer. In this dissertation we present two approaches which automates this process, one using classical computer vision techniques, and another using a novel application of deep semantic segmentation networks. We release a dataset of websites which can be used to train and evaluate these approaches. Further, we have designed a novel evaluation framework which allows empirical evaluation by creating synthetic sketches. Our evaluation illustrates that our deep learning approach outperforms our classical computer vision approach and we conclude that deep learning is the most promising direction for future research.
Published: 2019

36. Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease

Author: Samper-González, Jorge, Burgos, Ninon, Fontanella, Sabrina, Bertin, Hugo, Habert, Marie-Odile, Durrleman, Stanley, Evgeniou, Theodoros, and Colliot, Olivier
Subjects: Statistics - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Quantitative Biology - Neurons and Cognition, Quantitative Biology - Quantitative Methods
Abstract: In recent years, the number of papers on Alzheimer's disease classification has increased dramatically, generating interesting methodological ideas on the use machine learning and feature extraction methods. However, practical impact is much more limited and, eventually, one could not tell which of these approaches are the most efficient. While over 90\% of these works make use of ADNI an objective comparison between approaches is impossible due to variations in the subjects included, image pre-processing, performance metrics and cross-validation procedures. In this paper, we propose a framework for reproducible classification experiments using multimodal MRI and PET data from ADNI. The core components are: 1) code to automatically convert the full ADNI database into BIDS format; 2) a modular architecture based on Nipype in order to easily plug-in different classification and feature extraction tools; 3) feature extraction pipelines for MRI and PET data; 4) baseline classification approaches for unimodal and multimodal features. This provides a flexible framework for benchmarking different feature extraction and classification tools in a reproducible manner. We demonstrate its use on all (1519) baseline T1 MR images and all (1102) baseline FDG PET images from ADNI 1, GO and 2 with SPM-based feature extraction pipelines and three different classification techniques (linear SVM, anatomically regularized SVM and multiple kernel learning SVM). The highest accuracies achieved were: 91% for AD vs CN, 83% for MCIc vs CN, 75% for MCIc vs MCInc, 94% for AD-A$\beta$+ vs CN-A$\beta$- and 72% for MCIc-A$\beta$+ vs MCInc-A$\beta$+. The code is publicly available at https://gitlab.icm-institute.org/aramislab/AD-ML (depends on the Clinica software platform, publicly available at http://www.clinica.run).
Published: 2017
Full Text: View/download PDF

37. cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey

Author: Kataoka, Hirokatsu, Shirakabe, Soma, He, Yun, Ueta, Shunya, Suzuki, Teppei, Abe, Kaori, Kanezaki, Asako, Morita, Shin'ichiro, Yabe, Toshiyuki, Kanehara, Yoshihiro, Yatsuyanagi, Hiroya, Maruyama, Shinya, Takasawa, Ryosuke, Fuchida, Masataka, Miyashita, Yudai, Okayasu, Kazushige, and Matsuzaki, Yuta
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The paper gives futuristic challenges disscussed in the cvpaper.challenge. In 2015 and 2016, we thoroughly study 1,600+ papers in several conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV.
Published: 2017

38. A comment on the paper Prediction of Kidney Function from Biopsy Images using Convolutional Neural Networks

Author: dos-Santos, Washington LC, Duarte, Angelo A, and de Freitas, Luiz AR
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This letter presente a comment on the paper Prediction of Kidney Function from Biopsy Images using Convolutional Neural Networks by Ledbetter et al. (2017), Comment: 2 pages, 1 figure
Published: 2017

39. Visual Recognition of Paper Analytical Device Images for Detection of Falsified Pharmaceuticals

Author: Banerjee, Sandipan, Sweet, James, Sweet, Christopher, and Lieberman, Marya
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Falsification of medicines is a big problem in many developing countries, where technological infrastructure is inadequate to detect these harmful products. We have developed a set of inexpensive paper cards, called Paper Analytical Devices (PADs), which can efficiently classify drugs based on their chemical composition, as a potential solution to the problem. These cards have different reagents embedded in them which produce a set of distinctive color descriptors upon reacting with the chemical compounds that constitute pharmaceutical dosage forms. If a falsified version of the medicine lacks the active ingredient or includes substitute fillers, the difference in color is perceivable by humans. However, reading the cards with accuracy takes training and practice, which may hamper their scaling and implementation in low resource settings. To deal with this, we have developed an automatic visual recognition system to read the results from the PAD images. At first, the optimal set of reagents was found by running singular value decomposition on the intensity values of the color tones in the card images. A dataset of cards embedded with these reagents is produced to generate the most distinctive results for a set of 26 different active pharmaceutical ingredients (APIs) and excipients. Then, we train two popular convolutional neural network (CNN) models, with the card images. We also extract some "hand-crafted" features from the images and train a nearest neighbor classifier and a non-linear support vector machine with them. On testing, higher-level features performed much better in accurately classifying the PAD images, with the CNN models reaching the highest average accuracy of over 94\%., Comment: in Proc. IEEE Winter Conference on Applications of Computer Vision (WACV), 2016
Published: 2017

40. Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models

Author: Qiang, Yu-ting, Fu, Yanwei, Yu, Xiao, Guo, Yanwen, Zhou, Zhi-Hua, and Sigal, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Human-Computer Interaction, Computer Science - Multimedia
Abstract: Researchers often summarize their work in the form of scientific posters. Posters provide a coherent and efficient way to convey core ideas expressed in scientific papers. Generating a good scientific poster, however, is a complex and time consuming cognitive task, since such posters need to be readable, informative, and visually aesthetic. In this paper, for the first time, we study the challenging problem of learning to generate posters from scientific papers. To this end, a data-driven framework, that utilizes graphical models, is proposed. Specifically, given content to display, the key elements of a good poster, including attributes of each panel and arrangements of graphical elements are learned and inferred from data. During the inference stage, an MAP inference framework is employed to incorporate some design principles. In order to bridge the gap between panel attributes and the composition within each panel, we also propose a recursive page splitting algorithm to generate the panel layout for a poster. To learn and validate our model, we collect and release a new benchmark dataset, called NJU-Fudan Paper-Poster dataset, which consists of scientific papers and corresponding posters with exhaustively labelled panels and attributes. Qualitative and quantitative results indicate the effectiveness of our approach., Comment: 10 pages, submission to IEEE TPAMI. arXiv admin note: text overlap with arXiv:1604.01219
Published: 2017

41. Fusion of Heterogeneous Data in Convolutional Networks for Urban Semantic Labeling (Invited Paper)

Author: Audebert, Nicolas, Saux, Bertrand Le, and Lefèvre, Sébastien
Subjects: Computer Science - Neural and Evolutionary Computing, Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we present a novel module to perform fusion of heterogeneous data using fully convolutional networks for semantic labeling. We introduce residual correction as a way to learn how to fuse predictions coming out of a dual stream architecture. Especially, we perform fusion of DSM and IRRG optical data on the ISPRS Vaihingen dataset over a urban area and obtain new state-of-the-art results., Comment: Joint Urban Remote Sensing Event (JURSE), Mar 2017, Dubai, United Arab Emirates. Joint Urban Remote Sensing Event 2017
Published: 2017

42. Position paper: Towards an observer-oriented theory of shape comparison

Author: Frosini, Patrizio
Subjects: Computer Science - Computational Geometry, Computer Science - Computer Vision and Pattern Recognition, Mathematics - Algebraic Topology, 55N35 (Primary), 22F99, 47H09, 54H15, 57S10, 68U05, 65D18 (Secondary), I.4.7, I.5.1
Abstract: In this position paper we suggest a possible metric approach to shape comparison that is based on a mathematical formalization of the concept of observer, seen as a collection of suitable operators acting on a metric space of functions. These functions represent the set of data that are accessible to the observer, while the operators describe the way the observer elaborates the data and enclose the invariance that he/she associates with them. We expose this model and illustrate some theoretical reasons that justify its possible use for shape comparison., Comment: Preprint of the position paper submitted to the Eurographics Workshop on 3D Object Retrieval (2016)
Published: 2016

43. A Review Paper: Noise Models in Digital Image Processing

Author: Boyat, Ajay Kumar and Joshi, Brijendra Kumar
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Noise is always presents in digital images during image acquisition, coding, transmission, and processing steps. Noise is very difficult to remove it from the digital images without the prior knowledge of noise model. That is why, review of noise models are essential in the study of image denoising techniques. In this paper, we express a brief overview of various noise models. These noise models can be selected by analysis of their origin. In this way, we present a complete and quantitative analysis of noise models available in digital images.
Published: 2015

44. Spiking-Fer: Spiking Neural Network for Facial Expression Recognition With Event Cameras: Facial Expression Recognition (FER) is an active research domain that has shown great progress recently, notably thanks to the use of large deep learning models. However, such approaches are particularly energy intensive, which makes their deployment difficult for edge devices. To address this issue, Spiking Neural Networks (SNNs) coupled with event cameras are a promising alternative, capable of processing sparse and asynchronous events with lower energy consumption. In this paper, we establish the first use of event cameras for FER, named 'Event-based FER', and propose the first related benchmarks by converting popular video FER datasets to event streams. To deal with this new task, we propose 'Spiking-FER', a deep convolutional SNN model, and compare it against a similar Artificial Neural Network (ANN). Experiments show that the proposed approach achieves comparable performance to the ANN architecture, while consuming less energy by orders of magnitude (up to 65.39x). In addition, an experimental study of various event-based data augmentation techniques is performed to provide insights into the efficient transformations specific to event-based FER

Author: Barchid, Sami, Allaert, Benjamin, Aissaoui, Amel, Mennesson, José, Djéraba, Chaabane, Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Institut de Recherche sur les Composants logiciels et matériels pour l'Information et la Communication Avancé - USR 3380 (IRCICA), Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre for Digital Systems (CERI SN - IMT Nord Europe), Ecole nationale supérieure Mines-Télécom Lille Douai (IMT Nord Europe), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), and Institut Mines-Télécom [Paris] (IMT)
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV]
Abstract: International audience; Facial Expression Recognition (FER) is an active research domain that has shown great progress recently, notably thanks to the use of large deep learning models. However, such approaches are particularly energy intensive, which makes their deployment difficult for edge devices. To address this issue, Spiking Neural Networks (SNNs) coupled with event cameras are a promising alternative, capable of processing sparse and asynchronous events with lower energy consumption. In this paper, we establish the first use of event cameras for FER, named "Event-based FER", and propose the first related benchmarks by converting popular video FER datasets to event streams. To deal with this new task, we propose "Spiking-FER", a deep convolutional SNN model, and compare it against a similar Artificial Neural Network (ANN). Experiments show that the proposed approach achieves comparable performance to the ANN architecture, while consuming less energy by orders of magnitude (up to 65.39x). In addition, an experimental study of various event-based data augmentation techniques is performed to provide insights into the efficient transformations specific to event-based FER.
Published: 2023

45. FPSRS: A Fusion Approach for Paper Submission Recommendation System

Author: Son T. Huynh, Nhi Dang, Dac H. Nguyen, Phong T. Huynh, and Binh T. Nguyen
Subjects: FOS: Computer and information sciences, Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Information Retrieval (cs.IR), Computer Science - Information Retrieval
Abstract: Recommender systems have been increasingly popular in entertainment and consumption and are evident in academics, especially for applications that suggest submitting scientific articles to scientists. However, because of the various acceptance rates, impact factors, and rankings in different publishers, searching for a proper venue or journal to submit a scientific work usually takes a lot of time and effort. In this paper, we aim to present two newer approaches extended from our paper [13] presented at the conference IAE/AIE 2021 by employing RNN structures besides using Conv1D. In addition, we also introduce a new method, namely DistilBertAims, using DistillBert for two cases of uppercase and lower-case words to vectorize features such as Title, Abstract, and Keywords, and then use Conv1d to perform feature extraction. Furthermore, we propose a new calculation method for similarity score for Aim & Scope with other features; this helps keep the weights of similarity score calculation continuously updated and then continue to fit more data. The experimental results show that the second approach could obtain a better performance, which is 62.46% and 12.44% higher than the best of the previous study [13] in terms of the Top 1 accuracy., 24 pages, 10 figures, 8 tables
Published: 2022

46. Image Matching Across Wide Baselines: From Paper to Practice

Author: Yuhe Jin, Kwang Moo Yi, Pascal Fua, Eduard Trulls, Jiri Matas, Dmytro Mishkin, and Anastasiia Mishchuk
Subjects: FOS: Computer and information sciences, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, computer.software_genre, benchmark, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, dataset, Structure from motion, local features, 3d reconstruction, structure from motion, stereo, Benchmarking, Pipeline (software), Pattern recognition (psychology), Metric (mathematics), Benchmark (computing), Embedding, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Data mining, Heuristics, computer, performance, Software
Abstract: We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task -- the accuracy of the reconstructed camera pose -- as our primary metric. Our pipeline's modular structure allows easy integration, configuration, and combination of different methods and heuristics. This is demonstrated by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the conducted experiments reveal unexpected properties of Structure from Motion (SfM) pipelines that can help improve their performance, for both algorithmic and learned methods. Data and code are online https://github.com/vcg-uvic/image-matching-benchmark, providing an easy-to-use and flexible framework for the benchmarking of local features and robust estimation methods, both alongside and against top-performing methods. This work provides a basis for the Image Matching Challenge https://vision.uvic.ca/image-matching-challenge., Comment: Added: KeyNet-SOSNet, AffNet-HardNet, TFeat, MKD from kornia
Published: 2020

47. 3D Convolution Neural Network based Person Identification using Gait cycles

Author: Tiwari, Ravi Shekhar, P, Supraja, and Tom, Rijo Jackson
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, This paper tells us how human can be identified by their Gait cycle using any simple camera
Abstract: Human identification plays a prominent role in terms of security. In modern times security is becoming the key term for an individual or a country, especially for countries which are facing internal or external threats. Gait analysis is interpreted as the systematic study of the locomotive in humans. It can be used to extract the exact walking features of individuals. Walking features depends on biological as well as the physical feature of the object; hence, it is unique to every individual. In this work, gait features are used to identify an individual. The steps involve object detection, background subtraction, silhouettes extraction, skeletonization, and training 3D Convolution Neural Network on these gait features. The model is trained and evaluated on the dataset acquired by CASIA B Gait, which consists of 15000 videos of 124 subjects walking pattern captured from 11 different angles carrying objects such as bag and coat. The proposed method focuses more on the lower body part to extract features such as the angle between knee and thighs, hip angle, angle of contact, and many other features. The experimental results are compared with amongst accuracies of silhouettes as datasets for training and skeletonized image as training data. The results show that extracting the information from skeletonized data yields improved accuracy.
Published: 2021

48. Automatic test suite generation for key-points detection DNNs using many-objective search (experience paper)

Author: Donghwan Shin, Jun Wang, Fitash Ul Haq, Lionel C. Briand, and Thomas Stifter
Subjects: FOS: Computer and information sciences, Test data generation, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Key-point detection, Automotive industry, 02 engineering and technology, Machine learning, computer.software_genre, Image (mathematics), Computer Science - Software Engineering, Random search, Search algorithm, 0202 electrical engineering, electronic engineering, information engineering, Test suite, Computer science [C05] [Engineering, computing & technology], business.industry, deep neural network, software testing, 020207 software engineering, Sciences informatiques [C05] [Ingénierie, informatique & technologie], Software Engineering (cs.SE), Key (cryptography), many-objective search algorithm, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Test data
Abstract: Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time -- where each individual key-point may be critical in the targeted application -- and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining., to appear in ISSTA 2021
Published: 2021

49. Quantum-Classical Hybrid Machine Learning for Image Classification (ICCAD Special Session Paper)

Author: Mahabubul Alam, Satwik Kundu, Rasit Onur Topaloglu, and Swaroop Ghosh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science::Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, Machine Learning (cs.LG), 020202 computer hardware & architecture
Abstract: Image classification is a major application domain for conventional deep learning (DL). Quantum machine learning (QML) has the potential to revolutionize image classification. In any typical DL-based image classification, we use convolutional neural network (CNN) to extract features from the image and multi-layer perceptron network (MLP) to create the actual decision boundaries. On one hand, QML models can be useful in both of these tasks. Convolution with parameterized quantum circuits (Quanvolution) can extract rich features from the images. On the other hand, quantum neural network (QNN) models can create complex decision boundaries. Therefore, Quanvolution and QNN can be used to create an end-to-end QML model for image classification. Alternatively, we can extract image features separately using classical dimension reduction techniques such as, Principal Components Analysis (PCA) or Convolutional Autoencoder (CAE) and use the extracted features to train a QNN. We review two proposals on quantum-classical hybrid ML models for image classification namely, Quanvolutional Neural Network and dimension reduction using a classical algorithm followed by QNN. Particularly, we make a case for trainable filters in Quanvolution and CAE-based feature extraction for image datasets (instead of dimension reduction using linear transformations such as, PCA). We discuss various design choices, potential opportunities, and drawbacks of these models. We also release a Python-based framework to create and explore these hybrid models with a variety of design choices.
Published: 2021

50. A new baseline for retinal vessel segmentation: Numerical identification and correction of methodological inconsistencies affecting 100+ papers

Author: Attila Fazekas and György Kovács
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Health Informatics, computer.software_genre, Machine Learning (cs.LG), Medical imaging, Image Processing, Computer-Assisted, FOS: Electrical engineering, electronic engineering, information engineering, Humans, Radiology, Nuclear Medicine and imaging, Segmentation, Radiological and Ultrasound Technology, Image and Video Processing (eess.IV), Contrast (statistics), Retinal Vessels, Benchmarking, Electrical Engineering and Systems Science - Image and Video Processing, Computer Graphics and Computer-Aided Design, Data set, Identification (information), Ranking, Test set, Computer Vision and Pattern Recognition, Data mining, computer, Algorithms
Abstract: In the last 15 years, the segmentation of vessels in retinal images has become an intensively researched problem in medical imaging, with hundreds of algorithms published. One of the de facto benchmarking data sets of vessel segmentation techniques is the DRIVE data set. Since DRIVE contains a predefined split of training and test images, the published performance results of the various segmentation techniques should provide a reliable ranking of the algorithms. Including more than 100 papers in the study, we performed a detailed numerical analysis of the coherence of the published performance scores. We found inconsistencies in the reported scores related to the use of the field of view (FoV), which has a significant impact on the performance scores. We attempted to eliminate the biases using numerical techniques to provide a more realistic picture of the state of the art. Based on the results, we have formulated several findings, most notably: despite the well-defined test set of DRIVE, most rankings in published papers are based on non-comparable figures; in contrast to the near-perfect accuracy scores reported in the literature, the highest accuracy score achieved to date is 0.9582 in the FoV region, which is 1% higher than that of human annotators. The methods we have developed for identifying and eliminating the evaluation biases can be easily applied to other domains where similar problems may arise.
Published: 2021
Full Text: View/download PDF

51. Conference paper

Author: Shangzhe Wu, Andrea Vedaldi, and Christian Rupprecht
Subjects: FOS: Computer and information sciences, Symmetric structure, Exploit, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Iterative reconstruction, 010501 environmental sciences, 01 natural sciences, Facial recognition system, Image (mathematics), Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, Pose, 0105 earth and related environmental sciences, business.industry, Applied Mathematics, 020207 software engineering, Object (computer science), Autoencoder, Computational Theory and Mathematics, Unsupervised learning, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, Symmetry (geometry), business, Software
Abstract: We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences., CVPR 2020 Oral. Project page: https://elliottwu.com/projects/unsup3d/
Published: 2020

52. Conference paper

Author: Qizhu Li, Philip H. S. Torr, and Xiaojuan Qi
Subjects: FOS: Computer and information sciences, Pixel, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), 05 social sciences, Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Inference, Image segmentation, 010501 environmental sciences, Object (computer science), 01 natural sciences, Pipeline (software), Object detection, 0502 economics and business, Segmentation, Artificial intelligence, 050207 economics, business, 0105 earth and related environmental sciences
Abstract: We present an end-to-end network to bridge the gap between training and inference pipeline for panoptic segmentation, a task that seeks to partition an image into semantic regions for "stuff" and object instances for "things". In contrast to recent works, our network exploits a parametrised, yet lightweight panoptic segmentation submodule, powered by an end-to-end learnt dense instance affinity, to capture the probability that any pair of pixels belong to the same instance. This panoptic submodule gives rise to a novel propagation mechanism for panoptic logits and enables the network to output a coherent panoptic segmentation map for both "stuff" and "thing" classes, without any post-processing. Reaping the benefits of end-to-end training, our full system sets new records on the popular street scene dataset, Cityscapes, achieving 61.4 PQ with a ResNet-50 backbone using only the fine annotations. On the challenging COCO dataset, our ResNet-50-based network also delivers state-of-the-art accuracy of 43.4 PQ. Moreover, our network flexibly works with and without object mask cues, performing competitively under both settings, which is of interest for applications with computation budgets., CVPR 2020
Published: 2020

53. Conference paper

Author: Bastian Leibe, Philip H. S. Torr, Paul Voigtlaender, and Jonathon Luiten
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), 010401 analytical chemistry, Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 01 natural sciences, Object detection, 0104 chemical sciences, Robustness (computer science), Video tracking, 0202 electrical engineering, electronic engineering, information engineering, Eye tracking, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and potential distractor objects. This enables our approach to make better tracking decisions, as well as to re-detect tracked objects after long occlusion. Finally, we propose a novel hard example mining strategy to improve Siam R-CNN's robustness to similar looking objects. Siam R-CNN achieves the current best performance on ten tracking benchmarks, with especially strong results for long-term tracking. We make our code and models available at www.vision.rwth-aachen.de/page/siamrcnn., Comment: CVPR 2020 camera-ready version
Published: 2020
Full Text: View/download PDF

54. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper)

Author: Soujanya Poria, Verónica Pérez-Rosas, Roger Zimmermann, Devamanyu Hazarika, Rada Mihalcea, and Santiago Castro
Subjects: FOS: Computer and information sciences, Computer science, media_common.quotation_subject, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Face (sociological concept), Context (language use), 02 engineering and technology, computer.software_genre, 050105 experimental psychology, 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, media_common, Computer Science - Computation and Language, Sarcasm, business.industry, 05 social sciences, Tone (literature), 020201 artificial intelligence & image processing, Artificial intelligence, Syllable, business, computer, Computation and Language (cs.CL), Natural language processing, Utterance
Abstract: Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone, overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. As a first step towards enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue, which provides additional information on the scenario where the utterance occurs. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to the use of individual modalities. The full dataset is publicly available for use at https://github.com/soujanyaporia/MUStARD, Comment: Accepted at ACL 2019
Published: 2019
Full Text: View/download PDF

55. Automatic Paper Summary Generation from Visual and Textual Information

Author: Yoshihiro Fukuhara, Shigeo Morishima, Hirokatsu Kataoka, Ryota Suzuki, and Shintaro Yamamoto
Subjects: FOS: Computer and information sciences, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, computer.software_genre, Task (project management), Extractor, 03 medical and health sciences, 0302 clinical medicine, 0202 electrical engineering, electronic engineering, information engineering, Effective method, Computer Science - Computation and Language, business.industry, Document analysis, Textual information, Artificial Intelligence (cs.AI), Pattern recognition (psychology), 030221 ophthalmology & optometry, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Computation and Language (cs.CL), Natural language processing, Sentence
Abstract: Due to the recent boom in artificial intelligence (AI) research, including computer vision (CV), it has become impossible for researchers in these fields to keep up with the exponentially increasing number of manuscripts. In response to this situation, this paper proposes the paper summary generation (PSG) task using a simple but effective method to automatically generate an academic paper summary from raw PDF data. We realized PSG by combination of vision-based supervised components detector and language-based unsupervised important sentence extractor, which is applicable for a trained format of manuscripts. We show the quantitative evaluation of ability of simple vision-based components extraction, and the qualitative evaluation that our system can extract both visual item and sentence that are helpful for understanding. After processing via our PSG, the 979 manuscripts accepted by the Conference on Computer Vision and Pattern Recognition (CVPR) 2018 are available. It is believed that the proposed method will provide a better way for researchers to stay caught with important academic papers., Comment: International Conference on Machine Vision 2018, Munich, Germany
Published: 2018
Full Text: View/download PDF

56. Yet Another ADNI Machine Learning Paper? Paving The Way Towards Fully-reproducible Research on Classification of Alzheimer's Disease

Author: Jorge Samper-Gonzalez, Ninon Burgos, Sabrina Fontanella, Hugo Bertin, Marie-Odile Habert, Stanley Durrleman, Theodoros Evgeniou, Olivier Colliot, Algorithms, models and methods for images and signals of the human brain (ARAMIS), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut du Cerveau et de la Moëlle Epinière = Brain and Spine Institute (ICM), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-CHU Pitié-Salpêtrière [AP-HP], Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-CHU Pitié-Salpêtrière [AP-HP], Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'Imagerie Biomédicale [Paris] (LIB), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS), Institut Européen d'administration des Affaires (INSEAD), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-CHU Pitié-Salpêtrière [AP-HP], Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre National de la Recherche Scientifique (CNRS)-CHU Pitié-Salpêtrière [AP-HP], Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Inria de Paris, Laboratoire d'Imagerie Biomédicale (LIB), Centre National de la Recherche Scientifique (CNRS)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Université Pierre et Marie Curie - Paris 6 (UPMC), Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-CHU Pitié-Salpêtrière [AP-HP], Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut National de la Santé et de la Recherche Médicale (INSERM)-CHU Pitié-Salpêtrière [AP-HP], and Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Sorbonne Université (SU)-Assistance publique - Hôpitaux de Paris (AP-HP) (AP-HP)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris
Subjects: FOS: Computer and information sciences, Statistics - Machine Learning, Quantitative Biology - Neurons and Cognition, Computer Vision and Pattern Recognition (cs.CV), FOS: Biological sciences, Computer Science - Computer Vision and Pattern Recognition, [INFO.INFO-IM]Computer Science [cs]/Medical Imaging, Machine Learning (stat.ML), Neurons and Cognition (q-bio.NC), [INFO]Computer Science [cs], Quantitative Biology - Quantitative Methods, Quantitative Methods (q-bio.QM)
Abstract: International audience; In recent years, the number of papers on Alzheimer's disease classification has increased dramatically, generating interesting methodological ideas on the use machine learning and feature extraction methods. However, practical impact is much more limited and, eventually, one could not tell which of these approaches are the most efficient. While over 90% of these works make use of ADNI an objective comparison between approaches is impossible due to variations in the subjects included, image pre-processing, performance metrics and cross-validation procedures. In this paper, we propose a framework for reproducible classification experiments using multimodal MRI and PET data from ADNI. The core components are: 1) code to automatically convert the full ADNI database into BIDS format; 2) a modular architecture based on Nipype in order to easily plug-in different classification and feature extraction tools; 3) feature extraction pipelines for MRI and PET data; 4) baseline classification approaches for unimodal and multimodal features. This provides a flexible framework for benchmarking different feature extraction and classification tools in a reproducible manner. Data management tools for obtaining the lists of subjects in AD, MCI converter, MCI non-converters, CN classes are also provided. We demonstrate its use on all (1519) baseline T1 MR images and all (1102) baseline FDG PET images from ADNI 1, GO and 2 with SPM-based feature extraction pipelines and three different classification techniques (linear SVM, anatomically regularized SVM and multiple kernel learning SVM). The highest accuracies achieved were: 91% for AD vs CN, 83% for MCIc vs CN, 75% for MCIc vs MCInc, 94% for AD-ABeta+ vs CN-ABeta- and 72% for MCIc-ABeta+ vs MCInc-ABeta+. The code is publicly available at : https://github.com/aramis-lab/AD-ML.
Published: 2017

57. Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models

Author: Yanwen Guo, Leonid Sigal, Zhi-Hua Zhou, Yuting Qiang, Yanwei Fu, and Xiao Yu
Subjects: FOS: Computer and information sciences, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Inference, 02 engineering and technology, Machine learning, computer.software_genre, Bridge (nautical), Human-Computer Interaction (cs.HC), Theoretical Computer Science, Task (project management), Computer Science - Graphics, Benchmark (surveying), 0202 electrical engineering, electronic engineering, information engineering, Graphical model, Composition (language), Information retrieval, business.industry, 020207 software engineering, Graphics (cs.GR), Multimedia (cs.MM), Computer Science Applications, Computational Theory and Mathematics, Hardware and Architecture, Theory of computation, Key (cryptography), Artificial intelligence, business, computer, Software, Computer Science - Multimedia
Abstract: Researchers often summarize their work in the form of scientific posters. Posters provide a coherent and efficient way to convey core ideas expressed in scientific papers. Generating a good scientific poster, however, is a complex and time consuming cognitive task, since such posters need to be readable, informative, and visually aesthetic. In this paper, for the first time, we study the challenging problem of learning to generate posters from scientific papers. To this end, a data-driven framework, that utilizes graphical models, is proposed. Specifically, given content to display, the key elements of a good poster, including attributes of each panel and arrangements of graphical elements are learned and inferred from data. During the inference stage, an MAP inference framework is employed to incorporate some design principles. In order to bridge the gap between panel attributes and the composition within each panel, we also propose a recursive page splitting algorithm to generate the panel layout for a poster. To learn and validate our model, we collect and release a new benchmark dataset, called NJU-Fudan Paper-Poster dataset, which consists of scientific papers and corresponding posters with exhaustively labelled panels and attributes. Qualitative and quantitative results indicate the effectiveness of our approach., 10 pages, submission to IEEE TPAMI. arXiv admin note: text overlap with arXiv:1604.01219
Published: 2017

58. VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

Author: Ghosh, Sreyan, Evuru, Chandra Kiran Reddy, Kumar, Sonal, Tyagi, Utkarsh, Nieto, Oriol, Jin, Zeyu, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e.g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e.g., prompts that require additional skills like reasoning on contents of the image). (2) LVLMs lack visual perception, i.e., they can see but not necessarily understand or perceive the input image. We analyze responses to cognitive prompts and show that LVLMs hallucinate due to a perception gap: although LVLMs accurately recognize visual elements in the input image and possess sufficient cognitive skills, they struggle to respond accurately and hallucinate. To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations. Specifically, we first describe the image and add it as a prefix to the instruction. Next, during auto-regressive decoding, we sample from the plausible candidates according to their KL-Divergence (KLD) to the description, where lower KLD is given higher preference. Experimental results on several benchmarks and LVLMs show that VDGD improves significantly over other baselines in reducing hallucinations. We also propose VaLLu, a benchmark for the comprehensive evaluation of the cognitive capabilities of LVLMs., Comment: Preprint. Under review. Code will be released on paper acceptance
Published: 2024

59. Erase to Enhance: Data-Efficient Machine Unlearning in MRI Reconstruction

Author: Xue, Yuyang, Liu, Jingshuai, McDonagh, Steven, and Tsaftaris, Sotirios A.
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Machine unlearning is a promising paradigm for removing unwanted data samples from a trained model, towards ensuring compliance with privacy regulations and limiting harmful biases. Although unlearning has been shown in, e.g., classification and recommendation systems, its potential in medical image-to-image translation, specifically in image recon-struction, has not been thoroughly investigated. This paper shows that machine unlearning is possible in MRI tasks and has the potential to benefit for bias removal. We set up a protocol to study how much shared knowledge exists between datasets of different organs, allowing us to effectively quantify the effect of unlearning. Our study reveals that combining training data can lead to hallucinations and reduced image quality in the reconstructed data. We use unlearning to remove hallucinations as a proxy exemplar of undesired data removal. Indeed, we show that machine unlearning is possible without full retraining. Furthermore, our observations indicate that maintaining high performance is feasible even when using only a subset of retain data. We have made our code publicly accessible., Comment: The paper is accpeted by MIDL 2024
Published: 2024

60. V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Author: Rahman, Abdur, Chawla, Rajat, Kumar, Muskaan, Datta, Arkajit, Jha, Adarsh, NS, Mukunda, and Bhola, Ishaan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text, images, and Graphical User Interfaces (GUIs). Despite these advancements, the nuanced interaction and understanding of GUIs pose a significant challenge, limiting the potential of existing models to enhance automation levels. To bridge this gap, this paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding. Equipped with dual-resolution image encoders, V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems. Complementing V-Zen is the GUIDE dataset, an extensive collection of real-world GUI elements and task-based sequences, serving as a catalyst for specialised fine-tuning. The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences. This paper extends an invitation to the research community to join this exciting journey, shaping the future of GUI automation. In the spirit of open science, our code, data, and model will be made publicly available, paving the way for multimodal dialogue scenarios with intricate and precise interactions.
Published: 2024

61. Survey on Visual Signal Coding and Processing with Generative Models: Technologies, Standards and Optimization

Author: Chen, Zhibo, Sun, Heming, Zhang, Li, and Zhang, Fan
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper provides a survey of the latest developments in visual signal coding and processing with generative models. Specifically, our focus is on presenting the advancement of generative models and their influence on research in the domain of visual signal coding and processing. This survey study begins with a brief introduction of well-established generative models, including the Variational Autoencoder (VAE) models, Generative Adversarial Network (GAN) models, Autoregressive (AR) models, Normalizing Flows and Diffusion models. The subsequent section of the paper explores the advancements in visual signal coding based on generative models, as well as the ongoing international standardization activities. In the realm of visual signal processing, our focus lies on the application and development of various generative models in the research of visual signal restoration. We also present the latest developments in generative visual signal synthesis and editing, along with visual signal quality assessment using generative models and quality assessment for generative models. The practical implementation of these studies is closely linked to the investigation of fast optimization. This paper additionally presents the latest advancements in fast optimization on visual signal coding and processing with generative models. We hope to advance this field by providing researchers and practitioners a comprehensive literature review on the topic of visual signal coding and processing with generative models.
Published: 2024
Full Text: View/download PDF

62. Tele-Aloha: A Low-budget and High-authenticity Telepresence System Using Sparse RGB Cameras

Author: Tu, Hanzhang, Shao, Ruizhi, Dong, Xue, Zheng, Shunyuan, Zhang, Hao, Chen, Lili, Wang, Meili, Li, Wenyu, Ma, Siyan, Zhang, Shengping, Zhou, Boyao, and Liu, Yebin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present a low-budget and high-authenticity bidirectional telepresence system, Tele-Aloha, targeting peer-to-peer communication scenarios. Compared to previous systems, Tele-Aloha utilizes only four sparse RGB cameras, one consumer-grade GPU, and one autostereoscopic screen to achieve high-resolution (2048x2048), real-time (30 fps), low-latency (less than 150ms) and robust distant communication. As the core of Tele-Aloha, we propose an efficient novel view synthesis algorithm for upper-body. Firstly, we design a cascaded disparity estimator for obtaining a robust geometry cue. Additionally a neural rasterizer via Gaussian Splatting is introduced to project latent features onto target view and to decode them into a reduced resolution. Further, given the high-quality captured data, we leverage weighted blending mechanism to refine the decoded image into the final resolution of 2K. Exploiting world-leading autostereoscopic display and low-latency iris tracking, users are able to experience a strong three-dimensional sense even without any wearable head-mounted display device. Altogether, our telepresence system demonstrates the sense of co-presence in real-life experiments, inspiring the next generation of communication., Comment: Paper accepted by SIGGRAPH 2024. Project page: http://118.178.32.38/c/Tele-Aloha/
Published: 2024

63. Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation

Author: Kienzle, Daniel, Kantonis, Marco, Schön, Robin, and Lienhart, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Utilizing transformer architectures for semantic segmentation of high-resolution images is hindered by the attention's quadratic computational complexity in the number of tokens. A solution to this challenge involves decreasing the number of tokens through token merging, which has exhibited remarkable enhancements in inference speed, training efficiency, and memory utilization for image classification tasks. In this paper, we explore various token merging strategies within the framework of the Segformer architecture and perform experiments on multiple semantic segmentation and human pose estimation datasets. Notably, without model re-training, we, for example, achieve an inference acceleration of 61% on the Cityscapes dataset while maintaining the mIoU performance. Consequently, this paper facilitates the deployment of transformer-based architectures on resource-constrained devices and in real-time applications., Comment: 7 pages, to be published in IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR) 2024
Published: 2024

64. Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

Author: Kothandaraman, Divya, Sohn, Kihyuk, Villegas, Ruben, Voigtlaender, Paul, Manocha, Dinesh, and Babaeizadeh, Mohammad
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024., Comment: Paper accepted to AI4CC Workshop at CVPR 2024
Published: 2024

65. A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis

Author: Rao, Haocong, Zeng, Minlin, Zhao, Xuejiao, and Miao, Chunyan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have witnessed an increasing global population affected by neurodegenerative diseases (NDs), which traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring. As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs. The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification, opening a new avenue to facilitate faster and more cost-effective diagnosis of NDs. In this paper, we provide a comprehensive survey on recent progress of machine learning and deep learning based AI techniques applied to diagnosis of five typical NDs through gait. We provide an overview of the process of AI-assisted NDs diagnosis, and present a systematic taxonomy of existing gait data and AI models. Through an extensive review and analysis of 164 studies, we identify and discuss the challenges, potential solutions, and future directions in this field. Finally, we envision the prospective utilization of 3D skeleton data for human gait representation and the development of more efficient AI models for NDs diagnosis. We provide a public resource repository to track and facilitate developments in this emerging field: https://github.com/Kali-Hac/AI4NDD-Survey., Comment: 35 pages, 9 figures, 5 tables, citing 272 papers, under review at ACM Computing Survey (CSUR) journal. A up-to-date resource (papers, data, etc.) of this survey (AI4NDD) is provided at https://github.com/Kali-Hac/AI4NDD-Survey
Published: 2024

66. PoseGravity: Pose Estimation from Points and Lines with Axis Prior

Author: Chandrasekhar, Akshay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a new algorithm to estimate absolute camera pose given an axis of the camera's rotation matrix. Current algorithms solve the problem via algebraic solutions on limited input domains. This paper shows that the problem can be solved efficiently by finding the intersection points of a hyperbola and the unit circle. The solution can flexibly accommodate combinations of point and line features in minimal and overconstrained configurations. In addition, the two special cases of planar and minimal configurations are identified to yield simpler closed-form solutions. Extensive experiments validate the approach., Comment: 10 pages
Published: 2024

67. Enhancing Active Learning for Sentinel 2 Imagery through Contrastive Learning and Uncertainty Estimation

Author: Pogorzelski, David and Arlinghaus, Peter
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In this paper, we introduce a novel method designed to enhance label efficiency in satellite imagery analysis by integrating semi-supervised learning (SSL) with active learning strategies. Our approach utilizes contrastive learning together with uncertainty estimations via Monte Carlo Dropout (MC Dropout), with a particular focus on Sentinel-2 imagery analyzed using the Eurosat dataset. We explore the effectiveness of our method in scenarios featuring both balanced and unbalanced class distributions. Our results show that for unbalanced classes, our method is superior to the random approach, enabling significant savings in labeling effort while maintaining high classification accuracy. These findings highlight the potential of our approach to facilitate scalable and cost-effective satellite image analysis, particularly advantageous for extensive environmental monitoring and land use classification tasks. Note on preliminary results: This paper presents a new method for active learning and includes results from an initial experiment comparing random selection with our proposed method. We acknowledge that these results are preliminary. We are currently conducting further experiments and will update this paper with additional findings, including comparisons with other methods, in the coming weeks.
Published: 2024

68. Rethinking Overlooked Aspects in Vision-Language Models

Author: Liu, Yuan, Tian, Le, Zhou, Xiao, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets. Our research indicates that merely increasing the size of pre-training data does not guarantee improved performance and may, in fact, lead to its degradation. Furthermore, we have established a pipeline to pinpoint the most efficient instruction tuning (SFT) dataset, implying that not all SFT data utilized in existing studies are necessary. The primary objective of this paper is not to introduce a state-of-the-art model, but rather to serve as a roadmap for future research, aiming to optimize data usage during pre-training and fine-tuning processes to enhance the performance of vision-language models.
Published: 2024

69. Improving the Explain-Any-Concept by Introducing Nonlinearity to the Trainable Surrogate Model

Author: Zaval, Mounes and Ozer, Sedat
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In the evolving field of Explainable AI (XAI), interpreting the decisions of deep neural networks (DNNs) in computer vision tasks is an important process. While pixel-based XAI methods focus on identifying significant pixels, existing concept-based XAI methods use pre-defined or human-annotated concepts. The recently proposed Segment Anything Model (SAM) achieved a significant step forward to prepare automatic concept sets via comprehensive instance segmentation. Building upon this, the Explain Any Concept (EAC) model emerged as a flexible method for explaining DNN decisions. EAC model is based on using a surrogate model which has one trainable linear layer to simulate the target model. In this paper, by introducing an additional nonlinear layer to the original surrogate model, we show that we can improve the performance of the EAC model. We compare our proposed approach to the original EAC model and report improvements obtained on both ImageNet and MS COCO datasets., Comment: This paper is accepted for publication at IEEE SIU conference, 2024
Published: 2024

70. Perturbing the Gradient for Alleviating Meta Overfitting

Author: Gogoi, Manas, Tiwari, Sambhavi, and Verma, Shekhar
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: The reason for Meta Overfitting can be attributed to two factors: Mutual Non-exclusivity and the Lack of diversity, consequent to which a single global function can fit the support set data of all the meta-training tasks and fail to generalize to new unseen tasks. This issue is evidenced by low error rates on the meta-training tasks, but high error rates on new tasks. However, there can be a number of novel solutions to this problem keeping in mind any of the two objectives to be attained, i.e. to increase diversity in the tasks and to reduce the confidence of the model for some of the tasks. In light of the above, this paper proposes a number of solutions to tackle meta-overfitting on few-shot learning settings, such as few-shot sinusoid regression and few shot classification. Our proposed approaches demonstrate improved generalization performance compared to state-of-the-art baselines for learning in a non-mutually exclusive task setting. Overall, this paper aims to provide insights into tackling overfitting in meta-learning and to advance the field towards more robust and generalizable models.
Published: 2024

71. Advancing 6-DoF Instrument Pose Estimation in Variable X-Ray Imaging Geometries

Author: Viviers, Christiaan G. A., Filatova, Lena, Termeer, Maurice, de With, Peter H. N., and van der Sommen, Fons
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Accurate 6-DoF pose estimation of surgical instruments during minimally invasive surgeries can substantially improve treatment strategies and eventual surgical outcome. Existing deep learning methods have achieved accurate results, but they require custom approaches for each object and laborious setup and training environments often stretching to extensive simulations, whilst lacking real-time computation. We propose a general-purpose approach of data acquisition for 6-DoF pose estimation tasks in X-ray systems, a novel and general purpose YOLOv5-6D pose architecture for accurate and fast object pose estimation and a complete method for surgical screw pose estimation under acquisition geometry consideration from a monocular cone-beam X-ray image. The proposed YOLOv5-6D pose model achieves competitive results on public benchmarks whilst being considerably faster at 42 FPS on GPU. In addition, the method generalizes across varying X-ray acquisition geometry and semantic image complexity to enable accurate pose estimation over different domains. Finally, the proposed approach is tested for bone-screw pose estimation for computer-aided guidance during spine surgeries. The model achieves a 92.41% by the 0.1 ADD-S metric, demonstrating a promising approach for enhancing surgical precision and patient outcomes. The code for YOLOv5-6D is publicly available at https://github.com/cviviers/YOLOv5-6D-Pose, Comment: Early author version of paper. Refer to the full paper at https://ieeexplore.ieee.org/document/10478293
Published: 2024
Full Text: View/download PDF

72. Searching Realistic-Looking Adversarial Objects For Autonomous Driving Systems

Author: Sun, Shengxiang and Zhu, Shenzhe
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Numerous studies on adversarial attacks targeting self-driving policies fail to incorporate realistic-looking adversarial objects, limiting real-world applicability. Building upon prior research that facilitated the transition of adversarial objects from simulations to practical applications, this paper discusses a modified gradient-based texture optimization method to discover realistic-looking adversarial objects. While retaining the core architecture and techniques of the prior research, the proposed addition involves an entity termed the 'Judge'. This agent assesses the texture of a rendered object, assigning a probability score reflecting its realism. This score is integrated into the loss function to encourage the NeRF object renderer to concurrently learn realistic and adversarial textures. The paper analyzes four strategies for developing a robust 'Judge': 1) Leveraging cutting-edge vision-language models. 2) Fine-tuning open-sourced vision-language models. 3) Pretraining neurosymbolic systems. 4) Utilizing traditional image processing techniques. Our findings indicate that strategies 1) and 4) yield less reliable outcomes, pointing towards strategies 2) or 3) as more promising directions for future research.
Published: 2024

73. Reproducibility Study of CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

Author: Shah, Manan and Bhalgat, Yash
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: This report is a reproducibility study of the paper "CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification" (Abdelfattah et al, ICCV 2023). Our report makes the following contributions: (1) We provide a reproducible, well commented and open-sourced code implementation for the entire method specified in the original paper. (2) We try to verify the effectiveness of the novel aggregation strategy which uses the CLIP model to initialize the pseudo labels for the subsequent unsupervised multi-label image classification task. (3) We try to verify the effectiveness of the gradient-alignment training method specified in the original paper, which is used to update the network parameters and pseudo labels. The code can be found at https://github.com/cs-mshah/CDUL, Comment: Reproducibility study
Published: 2024

74. Hierarchical Selective Classification

Author: Goren, Shani, Galil, Ido, and El-Yaniv, Ran
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Deploying deep neural networks for risk-sensitive tasks necessitates an uncertainty estimation mechanism. This paper introduces hierarchical selective classification, extending selective classification to a hierarchical setting. Our approach leverages the inherent structure of class relationships, enabling models to reduce the specificity of their predictions when faced with uncertainty. In this paper, we first formalize hierarchical risk and coverage, and introduce hierarchical risk-coverage curves. Next, we develop algorithms for hierarchical selective classification (which we refer to as "inference rules"), and propose an efficient algorithm that guarantees a target accuracy constraint with high probability. Lastly, we conduct extensive empirical studies on over a thousand ImageNet classifiers, revealing that training regimes such as CLIP, pretraining on ImageNet21k and knowledge distillation boost hierarchical selective performance.
Published: 2024

75. Generative Artificial Intelligence: A Systematic Review and Applications

Author: Sengar, Sandeep Singh, Hasan, Affan Bin, Kumar, Sanjay, and Carroll, Fiona
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, the study of artificial intelligence (AI) has undergone a paradigm shift. This has been propelled by the groundbreaking capabilities of generative models both in supervised and unsupervised learning scenarios. Generative AI has shown state-of-the-art performance in solving perplexing real-world conundrums in fields such as image translation, medical diagnostics, textual imagery fusion, natural language processing, and beyond. This paper documents the systematic review and analysis of recent advancements and techniques in Generative AI with a detailed discussion of their applications including application-specific models. Indeed, the major impact that generative AI has made to date, has been in language generation with the development of large language models, in the field of image translation and several other interdisciplinary applications of generative AI. Moreover, the primary contribution of this paper lies in its coherent synthesis of the latest advancements in these areas, seamlessly weaving together contemporary breakthroughs in the field. Particularly, how it shares an exploration of the future trajectory for generative AI. In conclusion, the paper ends with a discussion of Responsible AI principles, and the necessary ethical considerations for the sustainability and growth of these generative models.
Published: 2024

76. Air Signing and Privacy-Preserving Signature Verification for Digital Documents

Author: Sarveswarasarma, P., Sathulakjan, T., Godfrey, V. J. V., and Ambegoda, Thanuja D.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction
Abstract: This paper presents a novel approach to the digital signing of electronic documents through the use of a camera-based interaction system, single-finger tracking for sign recognition, and multi commands executing hand gestures. The proposed solution, referred to as "Air Signature," involves writing the signature in front of the camera, rather than relying on traditional methods such as mouse drawing or physically signing on paper and showing it to a web camera. The goal is to develop a state-of-the-art method for detecting and tracking gestures and objects in real-time. The proposed methods include applying existing gesture recognition and object tracking systems, improving accuracy through smoothing and line drawing, and maintaining continuity during fast finger movements. An evaluation of the fingertip detection, sketching, and overall signing process is performed to assess the effectiveness of the proposed solution. The secondary objective of this research is to develop a model that can effectively recognize the unique signature of a user. This type of signature can be verified by neural cores that analyze the movement, speed, and stroke pixels of the signing in real time. The neural cores use machine learning algorithms to match air signatures to the individual's stored signatures, providing a secure and efficient method of verification. Our proposed System does not require sensors or any hardware other than the camera.
Published: 2024

77. Two-Phase Dynamics of Interactions Explains the Starting Point of a DNN Learning Over-Fitted Features

Author: Zhang, Junpeng, Li, Qing, Lin, Liang, and Zhang, Quanshi
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper investigates the dynamics of a deep neural network (DNN) learning interactions. Previous studies have discovered and mathematically proven that given each input sample, a well-trained DNN usually only encodes a small number of interactions (non-linear relationships) between input variables in the sample. A series of theorems have been derived to prove that we can consider the DNN's inference equivalent to using these interactions as primitive patterns for inference. In this paper, we discover the DNN learns interactions in two phases. The first phase mainly penalizes interactions of medium and high orders, and the second phase mainly learns interactions of gradually increasing orders. We can consider the two-phase phenomenon as the starting point of a DNN learning over-fitted features. Such a phenomenon has been widely shared by DNNs with various architectures trained for different tasks. Therefore, the discovery of the two-phase dynamics provides a detailed mechanism for how a DNN gradually learns different inference patterns (interactions). In particular, we have also verified the claim that high-order interactions have weaker generalization power than low-order interactions. Thus, the discovered two-phase dynamics also explains how the generalization power of a DNN changes during the training process.
Published: 2024

78. When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Author: Ma, Xianzheng, Bhalgat, Yash, Smart, Brandon, Chen, Shuai, Li, Xinghui, Ding, Jian, Gu, Jindong, Chen, Dave Zhenyu, Peng, Songyou, Bian, Jia-Wang, Torr, Philip H, Pollefeys, Marc, Nießner, Matthias, Reid, Ian D, Chang, Angel X., Laina, Iro, and Prisacariu, Victor Adrian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
Published: 2024

79. Size-invariance Matters: Rethinking Metrics and Losses for Imbalanced Multi-object Salient Object Detection

Author: Li, Feiran, Xu, Qianqian, Bao, Shilong, Yang, Zhiyong, Cong, Runmin, Cao, Xiaochun, and Huang, Qingming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper explores the size-invariance of evaluation metrics in Salient Object Detection (SOD), especially when multiple targets of diverse sizes co-exist in the same image. We observe that current metrics are size-sensitive, where larger objects are focused, and smaller ones tend to be ignored. We argue that the evaluation should be size-invariant because bias based on size is unjustified without additional semantic information. In pursuit of this, we propose a generic approach that evaluates each salient object separately and then combines the results, effectively alleviating the imbalance. We further develop an optimization framework tailored to this goal, achieving considerable improvements in detecting objects of different sizes. Theoretically, we provide evidence supporting the validity of our new metrics and present the generalization analysis of SOD. Extensive experiments demonstrate the effectiveness of our method. The code is available at https://github.com/Ferry-Li/SI-SOD., Comment: This paper has been accepted by ICML2024
Published: 2024

80. Rethinking Barely-Supervised Segmentation from an Unsupervised Domain Adaptation Perspective

Author: Shen, Zhiqiang, Cao, Peng, Su, Junming, Yang, Jinzhu, and Zaiane, Osmar R.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper investigates an extremely challenging problem, barely-supervised medical image segmentation (BSS), where the training dataset comprises limited labeled data with only single-slice annotations and numerous unlabeled images. Currently, state-of-the-art (SOTA) BSS methods utilize a registration-based paradigm, depending on image registration to propagate single-slice annotations into volumetric pseudo labels for constructing a complete labeled set. However, this paradigm has a critical limitation: the pseudo labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: training a model using only single-annotated slices as the labeled set without relying on image registration. To this end, we formulate BSS as an unsupervised domain adaptation (UDA) problem. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis, which may result in a side effect: domain shifts between the synthesized images and the original images. Then, a frequency and spatial mix-up strategy (FSX) is further introduced to mitigate the domain shifts for UDA. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method with only one labeled slice achieves an 80.77% dice score on left atrial segmentation, outperforming the SOTA by 61.28%. The code will be released upon the publication of this paper.
Published: 2024

81. Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey

Author: Sahoo, Pranab, Meharia, Prabhash, Ghosh, Akash, Saha, Sriparna, Jain, Vinija, and Chadha, Aman
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.
Published: 2024

82. ReconBoost: Boosting Can Achieve Modality Reconcilement

Author: Hua, Cong, Xu, Qianqian, Bao, Shilong, Yang, Zhiyong, and Huang, Qingming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at https://github.com/huacong/ReconBoost., Comment: This paper has been accepted by ICML2024
Published: 2024

83. Shape-aware synthesis of pathological lung CT scans using CycleGAN for enhanced semi-supervised lung segmentation

Author: Khiati, Rezkellah Noureddine, Brillet, Pierre-Yves, Justet, Aurélien, Ispa, Radu, and Fetita, Catalin
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at https://github.com/noureddinekhiati/Semi-supervised-lung-segmentation, Comment: 14 pages, 7 figures
Published: 2024

84. StraightPCF: Straight Point Cloud Filtering

Author: Edirimuni, Dasith de Silva, Lu, Xuequan, Li, Gang, Wei, Lei, Robles-Kelly, Antonio, and Li, Hongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point cloud filtering is a fundamental 3D vision task, which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing, to ensure fidelity. In this paper, we introduce StraightPCF, a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths, thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts, and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition, we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: https://github.com/ddsediri/StraightPCF., Comment: This paper has been accepted to the IEEE/CVF CVPR Conference, 2024
Published: 2024

85. Exploring the Low-Pass Filtering Behavior in Image Super-Resolution

Author: Deng, Haoyu, Xu, Zijing, Duan, Yule, Wu, Xiao, Shu, Wenjie, and Deng, Liang-Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep neural networks for image super-resolution (ISR) have shown significant advantages over traditional approaches like the interpolation. However, they are often criticized as 'black boxes' compared to traditional approaches with solid mathematical foundations. In this paper, we attempt to interpret the behavior of deep neural networks in ISR using theories from the field of signal processing. First, we report an intriguing phenomenon, referred to as `the sinc phenomenon.' It occurs when an impulse input is fed to a neural network. Then, building on this observation, we propose a method named Hybrid Response Analysis (HyRA) to analyze the behavior of neural networks in ISR tasks. Specifically, HyRA decomposes a neural network into a parallel connection of a linear system and a non-linear system and demonstrates that the linear system functions as a low-pass filter while the non-linear system injects high-frequency information. Finally, to quantify the injected high-frequency information, we introduce a metric for image-to-image tasks called Frequency Spectrum Distribution Similarity (FSDS). FSDS reflects the distribution similarity of different frequency components and can capture nuances that traditional metrics may overlook. Code, videos and raw experimental results for this paper can be found in: https://github.com/RisingEntropy/LPFInISR., Comment: Accepted by ICML 2024
Published: 2024

86. Improving Breast Cancer Grade Prediction with Multiparametric MRI Created Using Optimized Synthetic Correlated Diffusion Imaging

Author: Tai, Chi-en Amy and Wong, Alexander
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Breast cancer was diagnosed for over 7.8 million women between 2015 to 2020. Grading plays a vital role in breast cancer treatment planning. However, the current tumor grading method involves extracting tissue from patients, leading to stress, discomfort, and high medical costs. A recent paper leveraging volumetric deep radiomic features from synthetic correlated diffusion imaging (CDI$^s$) for breast cancer grade prediction showed immense promise for noninvasive methods for grading. Motivated by the impact of CDI$^s$ optimization for prostate cancer delineation, this paper examines using optimized CDI$^s$ to improve breast cancer grade prediction. We fuse the optimized CDI$^s$ signal with diffusion-weighted imaging (DWI) to create a multiparametric MRI for each patient. Using a larger patient cohort and training across all the layers of a pretrained MONAI model, we achieve a leave-one-out cross-validation accuracy of 95.79%, over 8% higher compared to that previously reported.
Published: 2024

87. Dehazing Remote Sensing and UAV Imagery: A Review of Deep Learning, Prior-based, and Hybrid Approaches

Author: Lee, Gao Yu, Chen, Jinkuan, Dam, Tanmoy, Ferdaus, Md Meftahul, Poenar, Daniel Puiu, and Duong, Vu N
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: High-quality images are crucial in remote sensing and UAV applications, but atmospheric haze can severely degrade image quality, making image dehazing a critical research area. Since the introduction of deep convolutional neural networks, numerous approaches have been proposed, and even more have emerged with the development of vision transformers and contrastive/few-shot learning. Simultaneously, papers describing dehazing architectures applicable to various Remote Sensing (RS) domains are also being published. This review goes beyond the traditional focus on benchmarked haze datasets, as we also explore the application of dehazing techniques to remote sensing and UAV datasets, providing a comprehensive overview of both deep learning and prior-based approaches in these domains. We identify key challenges, including the lack of large-scale RS datasets and the need for more robust evaluation metrics, and outline potential solutions and future research directions to address them. This review is the first, to our knowledge, to provide comprehensive discussions on both existing and very recent dehazing approaches (as of 2024) on benchmarked and RS datasets, including UAV-based imagery., Comment: Submitted to journal and under review, once the paper is accepted, the copyright will be transferred to the corresponding journal
Published: 2024

88. Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

Author: Hong, Rachel, Agnew, William, Kohno, Tadayoshi, and Morgenstern, Jamie
Subjects: Computer Science - Computers and Society, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices., Comment: Content warning: This paper discusses societal stereotypes and sexually-explicit material that may be disturbing, distressing, and/or offensive to the reader
Published: 2024

89. Understanding and Evaluating Human Preferences for AI Generated Images with Instruction Tuning

Author: Wang, Jiarui, Duan, Huiyu, Zhai, Guangtao, and Min, Xiongkuo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent years, among which AI-based image generation has gained widespread attention due to its efficient and imaginative image creation ability. However, AI-generated Images (AIGIs) may not satisfy human preferences due to their unique distortions, which highlights the necessity to understand and evaluate human preferences for AIGIs. To this end, in this paper, we first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+, which provides human visual preference scores and detailed preference explanations from three perspectives including quality, authenticity, and correspondence. Then, based on the constructed AIGCIQA2023+ database, this paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning. Specifically, the MINT-IQA model first learn and evaluate human preferences for AI-generated Images from multi-perspectives, then via the vision-language instruction tuning strategy, MINT-IQA attains powerful understanding and explanation ability for human visual preference on AIGIs, which can be used for feedback to further improve the assessment capabilities. Extensive experimental results demonstrate that the proposed MINT-IQA model achieves state-of-the-art performance in understanding and evaluating human visual preferences for AIGIs, and the proposed model also achieves competing results on traditional IQA tasks compared with state-of-the-art IQA models. The AIGCIQA2023+ database and MINT-IQA model will be released to facilitate future research.
Published: 2024

90. Replication Study and Benchmarking of Real-Time Object Detection Models

Author: Asselin, Pierre-Luc, Coulombe, Vincent, Guimont-Martin, William, and Larrivée-Hardy, William
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work examines the reproducibility and benchmarking of state-of-the-art real-time object detection models. As object detection models are often used in real-world contexts, such as robotics, where inference time is paramount, simply measuring models' accuracy is not enough to compare them. We thus compare a large variety of object detection models' accuracy and inference speed on multiple graphics cards. In addition to this large benchmarking attempt, we also reproduce the following models from scratch using PyTorch on the MS COCO 2017 dataset: DETR, RTMDet, ViTDet and YOLOv7. More importantly, we propose a unified training and evaluation pipeline, based on MMDetection's features, to better compare models. Our implementation of DETR and ViTDet could not achieve accuracy or speed performances comparable to what is declared in the original papers. On the other hand, reproduced RTMDet and YOLOv7 could match such performances. Studied papers are also found to be generally lacking for reproducibility purposes. As for MMDetection pretrained models, speed performances are severely reduced with limited computing resources (larger, more accurate models even more so). Moreover, results exhibit a strong trade-off between accuracy and speed, prevailed by anchor-free models - notably RTMDet or YOLOx models. The code used is this paper and all the experiments is available in the repository at https://github.com/Don767/segdet_mlcr2024., Comment: Authors are presented in alphabetical order, each having equal contribution to the work. Copyright may be transferred without notice, after which this version may no longer be accessible
Published: 2024

91. Rectified Gaussian kernel multi-view k-means clustering

Author: Sinaga, Kristina P.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we show two new variants of multi-view k-means (MVKM) algorithms to address multi-view data. The general idea is to outline the distance between $h$-th view data points $x_i^h$ and $h$-th view cluster centers $a_k^h$ in a different manner of centroid-based approach. Unlike other methods, our proposed methods learn the multi-view data by calculating the similarity using Euclidean norm in the space of Gaussian-kernel, namely as multi-view k-means with exponent distance (MVKM-ED). By simultaneously aligning the stabilizer parameter $p$ and kernel coefficients $\beta^h$, the compression of Gaussian-kernel based weighted distance in Euclidean norm reduce the sensitivity of MVKM-ED. To this end, this paper designated as Gaussian-kernel multi-view k-means (GKMVKM) clustering algorithm. Numerical evaluation of five real-world multi-view data demonstrates the robustness and efficiency of our proposed MVKM-ED and GKMVKM approaches., Comment: 13 pages, 1 figure, 7 Tables
Published: 2024

92. A Survey on Backbones for Deep Video Action Recognition

Author: Tang, Zixuan, Zhao, Youjun, Wen, Yuhang, and Liu, Mengyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research., Comment: This paper has been accepted by ICME workshop
Published: 2024

93. Towards Robust Physical-world Backdoor Attacks on Lane Detection

Author: Zhang, Xinwei, Liu, Aishan, Zhang, Tianyuan, Liang, Siyuan, and Liu, Xianglong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15\% on average in Attack Success Rate). Our codes will be available upon paper publication.
Published: 2024

94. DisBeaNet: A Deep Neural Network to augment Unmanned Surface Vessels for maritime situational awareness

Author: Vemula, Srikanth, Franco, Eulises, and Frye, Michael
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Intelligent detection and tracking of the vessels on the sea play a significant role in conducting traffic avoidance in unmanned surface vessels(USV). Current traffic avoidance software relies mainly on Automated Identification System (AIS) and radar to track other vessels to avoid collisions and acts as a typical perception system to detect targets. However, in a contested environment, emitting radar energy also presents the vulnerability to detection by adversaries. Deactivating these Radiofrequency transmitting sources will increase the threat of detection and degrade the USV's ability to monitor shipping traffic in the vicinity. Therefore, an intelligent visual perception system based on an onboard camera with passive sensing capabilities that aims to assist USV in addressing this problem is presented in this paper. This paper will present a novel low-cost vision perception system for detecting and tracking vessels in the maritime environment. This novel low-cost vision perception system is introduced using the deep learning framework. A neural network, DisBeaNet, can detect vessels, track, and estimate the vessel's distance and bearing from the monocular camera. The outputs obtained from this neural network are used to determine the latitude and longitude of the identified vessel.
Published: 2024

95. A Comprehensive Survey of Masked Faces: Recognition, Detection, and Unmasking

Author: Mahmoud, Mohamed, Kasem, Mahmoud SalahEldin, and Kang, Hyun-Soo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Masked face recognition (MFR) has emerged as a critical domain in biometric identification, especially by the global COVID-19 pandemic, which introduced widespread face masks. This survey paper presents a comprehensive analysis of the challenges and advancements in recognising and detecting individuals with masked faces, which has seen innovative shifts due to the necessity of adapting to new societal norms. Advanced through deep learning techniques, MFR, along with Face Mask Recognition (FMR) and Face Unmasking (FU), represent significant areas of focus. These methods address unique challenges posed by obscured facial features, from fully to partially covered faces. Our comprehensive review delves into the various deep learning-based methodologies developed for MFR, FMR, and FU, highlighting their distinctive challenges and the solutions proposed to overcome them. Additionally, we explore benchmark datasets and evaluation metrics specifically tailored for assessing performance in MFR research. The survey also discusses the substantial obstacles still facing researchers in this field and proposes future directions for the ongoing development of more robust and effective masked face recognition systems. This paper serves as an invaluable resource for researchers and practitioners, offering insights into the evolving landscape of face recognition technologies in the face of global health crises and beyond.
Published: 2024

96. General Place Recognition Survey: Towards Real-World Autonomy

Author: Yin, Peng, Jiao, Jianhao, Zhao, Shiqi, Xu, Lingyun, Huang, Guoquan, Choset, Howie, Scherer, Sebastian, and Han, Jianda
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: In the realm of robotics, the quest for achieving real-world autonomy, capable of executing large-scale and long-term operations, has positioned place recognition (PR) as a cornerstone technology. Despite the PR community's remarkable strides over the past two decades, garnering attention from fields like computer vision and robotics, the development of PR methods that sufficiently support real-world robotic systems remains a challenge. This paper aims to bridge this gap by highlighting the crucial role of PR within the framework of Simultaneous Localization and Mapping (SLAM) 2.0. This new phase in robotic navigation calls for scalable, adaptable, and efficient PR solutions by integrating advanced artificial intelligence (AI) technologies. For this goal, we provide a comprehensive review of the current state-of-the-art (SOTA) advancements in PR, alongside the remaining challenges, and underscore its broad applications in robotics. This paper begins with an exploration of PR's formulation and key research challenges. We extensively review literature, focusing on related methods on place representation and solutions to various PR challenges. Applications showcasing PR's potential in robotics, key PR datasets, and open-source libraries are discussed. We also emphasizes our open-source package, aimed at new development and benchmark for general PR. We conclude with a discussion on PR's future directions, accompanied by a summary of the literature covered and access to our open-source library, available to the robotics community at: https://github.com/MetaSLAM/GPRS., Comment: 20 pages, 12 figures, under review
Published: 2024

97. Continuous max-flow augmentation of self-supervised few-shot learning on SPECT left ventricles

Author: Szűcs, Ádám István, Kári, Béla, and Pártos, Oszkár
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Single-Photon Emission Computed Tomography (SPECT) left ventricular assessment protocols are important for detecting ischemia in high-risk patients. To quantitatively measure myocardial function, clinicians depend on commercially available solutions to segment and reorient the left ventricle (LV) for evaluation. Based on large normal datasets, the segmentation performance and the high price of these solutions can hinder the availability of reliable and precise localization of the LV delineation. To overcome the aforementioned shortcomings this paper aims to give a recipe for diagnostic centers as well as for clinics to automatically segment the myocardium based on small and low-quality labels on reconstructed SPECT, complete field-of-view (FOV) volumes. A combination of Continuous Max-Flow (CMF) with prior shape information is developed to augment the 3D U-Net self-supervised learning (SSL) approach on various geometries of SPECT apparatus. Experimental results on the acquired dataset have shown a 5-10\% increase in quantitative metrics based on the previous State-of-the-Art (SOTA) solutions, suggesting a good plausible way to tackle the few-shot SSL problem on high-noise SPECT cardiac datasets., Comment: ISBI 2024 Accepted paper for presentation
Published: 2024

98. Interpretability Needs a New Paradigm

Author: Madsen, Andreas, Lakkaraju, Himabindu, Reddy, Siva, and Chandar, Sarath
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
Published: 2024

99. Relevant Irrelevance: Generating Alterfactual Explanations for Image Classifiers

Author: Mertes, Silvan, Huber, Tobias, Karle, Christina, Weitz, Katharina, Schlagowski, Ruben, Conati, Cristina, and André, Elisabeth
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In this paper, we demonstrate the feasibility of alterfactual explanations for black box image classifiers. Traditional explanation mechanisms from the field of Counterfactual Thinking are a widely-used paradigm for Explainable Artificial Intelligence (XAI), as they follow a natural way of reasoning that humans are familiar with. However, most common approaches from this field are based on communicating information about features or characteristics that are especially important for an AI's decision. However, to fully understand a decision, not only knowledge about relevant features is needed, but the awareness of irrelevant information also highly contributes to the creation of a user's mental model of an AI system. To this end, a novel approach for explaining AI systems called alterfactual explanations was recently proposed on a conceptual level. It is based on showing an alternative reality where irrelevant features of an AI's input are altered. By doing so, the user directly sees which input data characteristics can change arbitrarily without influencing the AI's decision. In this paper, we show for the first time that it is possible to apply this idea to black box models based on neural networks. To this end, we present a GAN-based approach to generate these alterfactual explanations for binary image classifiers. Further, we present a user study that gives interesting insights on how alterfactual explanations can complement counterfactual explanations., Comment: Accepted at IJCAI 2024. arXiv admin note: text overlap with arXiv:2207.09374
Published: 2024

100. Picking watermarks from noise (PWFN): an improved robust watermarking model against intensive distortions

Author: Xie, Sijing, Zhao, Chengxin, Sun, Nan, Li, Wei, and Ling, Hefei
Subjects: Computer Science - Multimedia, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Digital watermarking is the process of embedding secret information by altering images in an undetectable way to the human eye. To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-noise-decoder architecture by adding different noises to the noise layer. The decoder then extracts the watermarked information from the distorted image. However, this method can only resist weak noise attacks. To improve the robustness of the decoder against stronger noise, this paper proposes to introduce a denoise module between the noise layer and the decoder. The module aims to reduce noise and recover some of the information lost caused by distortion. Additionally, the paper introduces the SE module to fuse the watermarking information pixel-wise and channel dimensions-wise, improving the encoder's efficiency. Experimental results show that our proposed method is comparable to existing models and outperforms state-of-the-art under different noise intensities. In addition, ablation experiments show the superiority of our proposed module.
Published: 2024

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

160,610 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources