Author: "Vasant Honavar" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Vasant Honavar"' showing total 302 results

Start Over Author "Vasant Honavar"

302 results on '"Vasant Honavar"'

1. Feeding the machine: Challenges to reproducible predictive modeling in resting-state connectomics

Author: Andrew Cwiek, Sarah M. Rajtmajer, Bradley Wyble, Vasant Honavar, Emily Grossner, and Frank G. Hillary
Subjects: Electronic computers. Computer science, QA75.5-76.95
Published: 2022
Full Text: View/download PDF

2. Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data

Author: Yasser EL-Manzalawy, Tsung-Yu Hsieh, Manu Shivakumar, Dokyoon Kim, and Vasant Honavar
Subjects: Multi-omics data integration, Multi-view feature selection, Cancer survival prediction, Machine learning, Internal medicine, RC31-1245, Genetics, QH426-470
Abstract: Abstract Background Large-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer tantalizing possibilities for realizing the promise and potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including heterogeneity, and high-dimensionality of omics data. Methods We propose a novel framework for multi-omics data integration using multi-view feature selection. We introduce a novel multi-view feature selection algorithm, MRMR-mv, an adaptation of the well-known Min-Redundancy and Maximum-Relevance (MRMR) single-view feature selection algorithm to the multi-view setting. Results We report results of experiments using an ovarian cancer multi-omics dataset derived from the TCGA database on the task of predicting ovarian cancer survival. Our results suggest that multi-view models outperform both view-specific models (i.e., models trained and tested using a single type of omics data) and models based on two baseline data fusion methods. Conclusions Our results demonstrate the potential of multi-view feature selection in integrative analyses and predictive modeling from multi-omics data.
Published: 2018
Full Text: View/download PDF

3. iScore: An MPI supported software for ranking protein–protein docking models based on a random walk graph kernel and support vector machines

Author: Nicolas Renaud, Yong Jung, Vasant Honavar, Cunliang Geng, Alexandre M.J.J. Bonvin, and Li C. Xue
Subjects: Protein–protein docking, Scoring, Graph kernel functions, Support vector machines, MPI, Position-specific scoring matrix (PSSM), Computer software, QA76.75-76.765
Abstract: Computational docking is a promising tool to model three-dimensional (3D) structures of protein–protein complexes, which provides fundamental insights of protein functions in the cellular life. Singling out near-native models from the huge pool of generated docking models (referred to as the scoring problem) remains as a major challenge in computational docking. We recently published iScore, a novel graph kernel based scoring function. iScore ranks docking models based on their interface graph similarities to the training interface graph set. iScore uses a support vector machine approach with random-walk graph kernels to classify and rank protein–protein interfaces. Here, we present the software for iScore. The software provides executable scripts that fully automate the computational workflow. In addition, the creation and analysis of the interface graph can be distributed across different processes using Message Passing interface (MPI) and can be offloaded to GPUs thanks to dedicated CUDA kernels.
Published: 2020
Full Text: View/download PDF

4. Biomarker discovery in inflammatory bowel diseases using network-based feature selection.

Author: Mostafa Abbas, John Matta, Thanh Le, Halima Bensmail, Tayo Obafemi-Ajayi, Vasant Honavar, and Yasser El-Manzalawy
Subjects: Medicine, Science
Abstract: Reliable identification of Inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.
Published: 2019
Full Text: View/download PDF

5. BioNetwork Bench: Database and Software for Storage, Query, and Analysis of Gene and Protein Networks

Author: Oksana Kohutyuk, Fadi Towfic, M. Heather West Greenlee, and Vasant Honavar
Subjects: Biology (General), QH301-705.5
Published: 2012

6. FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.

Author: Yasser El-Manzalawy, Mostafa Abbas, Qutaibah Malluhi, and Vasant Honavar
Subjects: Medicine, Science
Abstract: A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.
Published: 2016
Full Text: View/download PDF

7. RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins.

Author: Rasna R Walia, Li C Xue, Katherine Wilkins, Yasser El-Manzalawy, Drena Dobbs, and Vasant Honavar
Subjects: Medicine, Science
Abstract: Protein-RNA interactions are central to essential cellular processes such as protein synthesis and regulation of gene expression and play roles in human infectious and genetic diseases. Reliable identification of protein-RNA interfaces is critical for understanding the structural bases and functional implications of such interactions and for developing effective approaches to rational drug design. Sequence-based computational methods offer a viable, cost-effective way to identify putative RNA-binding residues in RNA-binding proteins. Here we report two novel approaches: (i) HomPRIP, a sequence homology-based method for predicting RNA-binding sites in proteins; (ii) RNABindRPlus, a new method that combines predictions from HomPRIP with those from an optimized Support Vector Machine (SVM) classifier trained on a benchmark dataset of 198 RNA-binding proteins. Although highly reliable, HomPRIP cannot make predictions for the unaligned parts of query proteins and its coverage is limited by the availability of close sequence homologs of the query protein with experimentally determined RNA-binding sites. RNABindRPlus overcomes these limitations. We compared the performance of HomPRIP and RNABindRPlus with that of several state-of-the-art predictors on two test sets, RB44 and RB111. On a subset of proteins for which homologs with experimentally determined interfaces could be reliably identified, HomPRIP outperformed all other methods achieving an MCC of 0.63 on RB44 and 0.83 on RB111. RNABindRPlus was able to predict RNA-binding residues of all proteins in both test sets, achieving an MCC of 0.55 and 0.37, respectively, and outperforming all other methods, including those that make use of structure-derived features of proteins. More importantly, RNABindRPlus outperforms all other methods for any choice of tradeoff between precision and recall. An important advantage of both HomPRIP and RNABindRPlus is that they rely on readily available sequence and sequence-derived features of RNA-binding proteins. A webserver implementation of both methods is freely available at http://einstein.cs.iastate.edu/RNABindRPlus/.
Published: 2014
Full Text: View/download PDF

8. Predicting the binding patterns of hub proteins: a study using yeast protein interaction networks.

Author: Carson M Andorf, Vasant Honavar, and Taner Z Sen
Subjects: Medicine, Science
Abstract: Protein-protein interactions are critical to elucidating the role played by individual proteins in important biological pathways. Of particular interest are hub proteins that can interact with large numbers of partners and often play essential roles in cellular control. Depending on the number of binding sites, protein hubs can be classified at a structural level as singlish-interface hubs (SIH) with one or two binding sites, or multiple-interface hubs (MIH) with three or more binding sites. In terms of kinetics, hub proteins can be classified as date hubs (i.e., interact with different partners at different times or locations) or party hubs (i.e., simultaneously interact with multiple partners).Our approach works in 3 phases: Phase I classifies if a protein is likely to bind with another protein. Phase II determines if a protein-binding (PB) protein is a hub. Phase III classifies PB proteins as singlish-interface versus multiple-interface hubs and date versus party hubs. At each stage, we use sequence-based predictors trained using several standard machine learning techniques.Our method is able to predict whether a protein is a protein-binding protein with an accuracy of 94% and a correlation coefficient of 0.87; identify hubs from non-hubs with 100% accuracy for 30% of the data; distinguish date hubs/party hubs with 69% accuracy and area under ROC curve of 0.68; and SIH/MIH with 89% accuracy and area under ROC curve of 0.84. Because our method is based on sequence information alone, it can be used even in settings where reliable protein-protein interaction data or structures of protein-protein complexes are unavailable to obtain useful insights into the functional and evolutionary characteristics of proteins and their interactions.We provide a web server for our three-phase approach: http://hybsvm.gdcb.iastate.edu.
Published: 2013
Full Text: View/download PDF

9. On evaluating MHC-II binding peptide prediction methods.

Author: Yasser El-Manzalawy, Drena Dobbs, and Vasant Honavar
Subjects: Medicine, Science
Abstract: Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods.
Published: 2008
Full Text: View/download PDF

10. Representing and Reasoning with Qualitative Preferences: Tools and Applications

Author: Ganesh Ram Santhanam, Samik Basu, Vasant Honavar
Published: 2022

11. Machine Learning Prediction of Seizures after Ischemic Strokes. (S30.007)

Author: Alain Lekoubou Looti, justin Petucci, Avnish Katoch, and Vasant Honavar
Published: 2023
Full Text: View/download PDF

12. Detecting and Interpreting Changes in Scanning Behavior in Large Network Telescopes

Author: Michalis Kallitsis, Rupesh Prajapati, Vasant Honavar, Dinghao Wu, and John Yen
Subjects: Computer Networks and Communications, Safety, Risk, Reliability and Quality
Published: 2022
Full Text: View/download PDF

13. The Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data-Driven Research

Author: Anthony Simonet, Ivan Rodero, Manish Parashar, Grace Agnew, Forough Ghahramani, Ronald C. Jantz, and Vasant Honavar
Subjects: Collaborative software, 010504 meteorology & atmospheric sciences, General Computer Science, business.industry, Computer science, Big data, General Engineering, Cloud computing, 02 engineering and technology, Virtual reality, Collaboratory, 01 natural sciences, Data science, Data-driven, Cyberinfrastructure, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, business, 0105 earth and related environmental sciences
Abstract: The Virtual Data Collaboratory is a federated data cyberinfrastructure designed to drive data-intensive, interdisciplinary, and collaborative research that will impact researchers, educators, and entrepreneurs across a broad range of disciplines and domains as well as institutional and geographic boundaries.
Published: 2020
Full Text: View/download PDF

14. A Computational Model of Rodent Spatial Learning and Some Behavioral Experiments

Author: Karthik Balakrishnan, Rushi Bhatt, and Vasant Honavar
Published: 2022
Full Text: View/download PDF

15. Forecasting User Interests Through Topic Tag Predictions in Online Health Communities

Author: Amogh Subbakrishna Adishesha, Lily Jakielaszek, Fariha Azhar, Peixuan Zhang, Vasant Honavar, Fenglong Ma, Chandra Belani, Prasenjit Mitra, and Sharon Xiaolei Huang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Health Information Management, Health Informatics, Electrical and Electronic Engineering, Computer Science Applications, Machine Learning (cs.LG)
Abstract: The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This paper proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on an expert curated data set which demonstrate the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest)., Comment: Healthcare Informatics and NLP
Published: 2022
Full Text: View/download PDF

16. MetaScore: A novel machine-learning based approach to improve traditional scoring functions for scoring protein-protein docking conformations

Author: Vasant Honavar, Li C. Xue, Cunliang Geng, Yong Jung, and Alexandre M. J. J. Bonvin
Subjects: Basis (linear algebra), Computer science, business.industry, Interface (Java), Machine learning, computer.software_genre, Random forest, Set (abstract data type), Docking (molecular), Hit rate, Artificial intelligence, business, computer, Classifier (UML), Topology (chemistry)
Abstract: Protein-protein interactions play a ubiquitous role in biological function. Knowledge of the three-dimensional (3D) structures of the complexes they form is essential for understanding the structural basis of those interactions and how they orchestrate key cellular processes. Computational docking has become an indispensable alternative to the expensive and timeconsuming experimental approaches for determining 3D structures of protein complexes. Despite recent progress, identifying near-native models from a large set of conformations sampled by docking - the so-called scoring problem - still has considerable room for improvement.We present here MetaScore, a new machine-learning based approach to improve the scoring of docked conformations. MetaScore utilizes a random forest (RF) classifier trained to distinguish near-native from non-native conformations using a rich set of features extracted from the respective protein-protein interfaces. These include physico-chemical properties, energy terms, interaction propensity-based features, geometric properties, interface topology features, evolutionary conservation and also scores produced by traditional scoring functions (SFs). MetaScore scores docked conformations by simply averaging of the score produced by the RF classifier with that produced by any traditional SF. We demonstrate that (i) MetaScore consistently outperforms each of nine traditional SFs included in this work in terms of success rate and hit rate evaluated over the top 10 predicted conformations; (ii) An ensemble method, MetaScore-Ensemble, that combines 10 variants of MetaScore obtained by combining the RF score with each of the traditional SFs outperforms each of the MetaScore variants. We conclude that the performance of traditional SFs can be improved upon by judiciously leveraging machine-learning.
Published: 2021
Full Text: View/download PDF

17. Personalized Sleep Parameters Estimation from Actigraphy: A Machine Learning Approach

Author: Yasser EL-Manzalawy, Lindsay Master, Vasant Honavar, Aria Khademi, and Orfeu M. Buxton
Subjects: Boosting (machine learning), medicine.diagnostic_test, business.industry, Actigraphy, Absolute difference, Polysomnography, Logistic regression, Machine learning, computer.software_genre, Random forest, 03 medical and health sciences, Behavioral Neuroscience, 0302 clinical medicine, 030228 respiratory system, Medicine, Sleep (system call), Artificial intelligence, Sleep onset, business, computer, psychological phenomena and processes, 030217 neurology & neurosurgery, Applied Psychology
Abstract: Background The current gold standard for measuring sleep is polysomnography (PSG), but it can be obtrusive and costly. Actigraphy is a relatively low-cost and unobtrusive alternative to PSG. Of particular interest in measuring sleep from actigraphy is prediction of sleep-wake states. Current literature on prediction of sleep-wake states from actigraphy consists of methods that use population data, which we call generalized models. However, accounting for variability of sleep patterns across individuals calls for personalized models of sleep-wake states prediction that could be potentially better suited to individual-level data and yield more accurate estimation of sleep. Purpose To investigate the validity of developing personalized machine learning models, trained and tested on individual-level actigraphy data, for improved prediction of sleep-wake states and reliable estimation of nightly sleep parameters. Participants and methods We used a dataset including 54 participants and systematically trained and tested 5 different personalized machine learning models as well as their generalized counterparts. We evaluated model performance compared to concurrent PSG through extensive machine learning experiments and statistical analyses. Results Our experiments show the superiority of personalized models over their generalized counterparts in estimating PSG-derived sleep parameters. Personalized models of regularized logistic regression, random forest, adaptive boosting, and extreme gradient boosting achieve estimates of total sleep time, wake after sleep onset, sleep efficiency, and number of awakenings that are closer to those obtained by PSG, in absolute difference, than the same estimates from their generalized counterparts. We further show that the difference between estimates of sleep parameters obtained by personalized models and those of PSG is statistically non-significant. Conclusion Personalized machine learning models of sleep-wake states outperform their generalized counterparts in terms of estimating sleep parameters and are indistinguishable from PSG labeled sleep-wake states. Personalized machine learning models can be used in actigraphy studies of sleep health and potentially screening for some sleep disorders.
Published: 2019
Full Text: View/download PDF

18. Minimum Intervention Cover of a Causal Graph

Author: Vasant Honavar, Arnab Bhattacharyya, and Saravanan Kandasamy
Subjects: Counterfactual thinking, Polynomial, Causal graph, Theoretical computer science, Computer science, Generalization, 05 social sciences, Inference, General Medicine, 010501 environmental sciences, 01 natural sciences, Set (abstract data type), 0502 economics and business, Cover (algebra), 050207 economics, Completeness (statistics), 0105 earth and related environmental sciences, Causal model
Abstract: Eliciting causal effects from interventions and observations is one of the central concerns of science, and increasingly, artificial intelligence. We provide an algorithm that, given a causal graph G, determines MIC(G), a minimum intervention cover of G, i.e., a minimum set of interventions that suffices for identifying every causal effect that is identifiable in a causal model characterized by G. We establish the completeness of do-calculus for computing MIC(G). MIC(G) effectively offers an efficient compilation of all of the information obtainable from all possible interventions in a causal model characterized by G. Minimum intervention cover finds applications in a variety of contexts including counterfactual inference, and generalizing causal effects across experimental settings. We analyze the computational complexity of minimum intervention cover and identify some special cases of practical interest in which MIC(G) can be computed in time that is polynomial in the size of G.
Published: 2019
Full Text: View/download PDF

19. SrVARM: State Regularized Vector Autoregressive Model for Joint Learning of Hidden State Transitions and State-Dependent Inter-Variable Dependencies from Multi-variate Time Series

Author: Suhang Wang, Vasant Honavar, Yiwei Sun, Xianfeng Tang, and Tsung-Yu Hsieh
Subjects: Sequence, State-space representation, Computer science, Bayesian network, 02 engineering and technology, 010501 environmental sciences, Directed acyclic graph, 01 natural sciences, Variable (computer science), Recurrent neural network, Autoregressive model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Time series, Algorithm, 0105 earth and related environmental sciences
Abstract: Many applications, e.g., healthcare, education, call for effective methods methods for constructing predictive models from high dimensional time series data where the relationship between variables can be complex and vary over time. In such settings, the underlying system undergoes a sequence of unobserved transitions among a finite set of hidden states. Furthermore, the relationships between the observed variables and their temporal dynamics may depend on the hidden state of the system. To further complicate matters, the hidden state sequences underlying the observed data from different individuals may not be aligned relative to a common frame of reference. Against this background, we consider the novel problem of jointly learning the state-dependent inter-variable relationships as well as the pattern of transitions between hidden states from multi-variate time series data. To solve this problem, we introduce the State-Regularized Vector Autoregressive Model (SrVARM) which combines a state-regularized recurrent neural network to learn the dynamics of transitions between discrete hidden states with an augmented autoregressive model which models the inter-variable dependencies in each state using a state-dependent directed acyclic graph (DAG). We propose an efficient algorithm for training SrVARM by leveraging a recently introduced reformulation of the combinatorial problem of optimizing the DAG structure with respect to a scoring function into a continuous optimization problem. We report results of extensive experiments with simulated data as well as a real-world benchmark that show that SrVARM outperforms state-of-the-art baselines in recovering the unobserved state transitions and discovering the state-dependent relationships among variables.
Published: 2021
Full Text: View/download PDF

20. Explainable Multivariate Time Series Classification

Author: Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, and Vasant Honavar
Subjects: Multivariate statistics, Artificial neural network, Computer science, business.industry, Feature extraction, 02 engineering and technology, Machine learning, computer.software_genre, Convolution, Task (project management), 020204 information systems, Classifier (linguistics), 0202 electrical engineering, electronic engineering, information engineering, Domain knowledge, 020201 artificial intelligence & image processing, Artificial intelligence, Time series, business, computer
Abstract: Many real-world applications, e.g., healthcare, present multi-variate time series prediction problems. In such settings, in addition to the predictive accuracy of the models, model transparency and explainability are paramount. We consider the problem of building explainable classifiers from multi-variate time series data. A key criterion to understand such predictive models involves elucidating and quantifying the contribution of time varying input variables to the classification. Hence, we introduce a novel, modular, convolution-based feature extraction and attention mechanism that simultaneously identifies the variables as well as time intervals which determine the classifier output. We present results of extensive experiments with several benchmark data sets that show that the proposed method outperforms the state-of-the-art baseline methods on multi-variate time series classification task. The results of our case studies demonstrate that the variables and time intervals identified by the proposed method make sense relative to available domain knowledge.
Published: 2021
Full Text: View/download PDF

21. Feeding the machine: challenges to reproducible predictive modeling in resting-state connectomics

Author: Andrew Cwiek, Sarah Rajtmajer, Brad Wyble, Vasant Honavar, and Frank Hillary
Abstract: In this critical review, we examine the current application of predictive models, e.g. classifiers, trained using ML to assist in interpretation of functional neuroimaging data. Our primary goal is to summarize how ML is being applied and critically assess the most common practices. Our review covers 250 studies published using ML and resting-state functional MRI (fMRI) to infer various dimensions of the human functional connectome. Results for lockbox performance was, on average, ~13% less accurate than performance measured through cross validation alone highlighting the importance of hold-out (“lockbox”) data which was included in only 16% of the studies surveyed. There was also a concerning lack of transparency across the key steps in training and evaluating predictive models. The summary of this literature underscores the importance of the use of a lockbox and highlights several methodological pitfalls that can be addressed by the imaging community. We argue that, ideally, studies are motivated both by the reproducibility and generalizability of findings as well as the potential clinical significance of the insights. We offer recommendations for principled integration of machine learning into the clinical neurosciences with the goal of advancing imaging biomarkers of brain disorders, understanding causative determinants for health risks, and parsing heterogeneous patient outcomes.
Published: 2021
Full Text: View/download PDF

22. Functional Autoencoders for Functional Data Representation Learning

Author: Tsung-Yu Hsieh, Yiwei Sun, Suhang Wang, and Vasant Honavar
Published: 2021
Full Text: View/download PDF

23. FARE: Enabling Fine-grained Attack Categorization under Low-quality Labeled Data

Author: Gang Wang, Vasant Honavar, Tongbo Luo, Xinyu Xing, Junjie Liang, and Wenbo Guo
Subjects: Categorization, Computer science, business.industry, media_common.quotation_subject, Labeled data, Quality (business), Artificial intelligence, Machine learning, computer.software_genre, business, computer, media_common
Published: 2021
Full Text: View/download PDF

24. Dynamical Gaussian Process Latent Variable Model for Representation Learning from Longitudinal Data

Author: Vasant Honavar and Thanh Le
Subjects: symbols.namesake, Computer science, Dimensionality reduction, symbols, Cluster analysis, Latent variable model, Representation (mathematics), Gaussian process, Algorithm, Feature learning, Synthetic data, Convolution
Abstract: Many real-world applications involve longitudinal data, consisting of observations of several variables, where different subsets of variables are sampled at irregularly spaced time points. We introduce the Longitudinal Gaussian Process Latent Variable Model (L-GPLVM), a variant of the Gaussian Process Latent Variable Model, for learning compact representations of such data. L-GPLVM overcomes a key limitation of the Dynamic Gaussian Process Latent Variable Model and its variants, which rely on the assumption that the data are fully observed over all of the sampled time points. We describe an effective approach to learning the parameters of L-GPLVM from sparse observations, by coupling the dynamical model with a Multitask Gaussian Process model for sampling of the missing observations at each step of the gradient-based optimization of the variational lower bound. We further show the advantage of the Sparse Process Convolution framework to learn the latent representation of sparsely and irregularly sampled longitudinal data with minimal computational overhead relative to a standard Latent Variable Model. We demonstrated experiments with synthetic data as well as variants of MOCAP data with varying degrees of sparsity of observations that show that L-GPLVM substantially and consistently outperforms the state-of-the-art alternatives in recovering the missing observations even when the available data exhibits a high degree of sparsity. The compact representations of irregularly sampled and sparse longitudinal data can be used to perform a variety of machine learning tasks, including clustering, classification, and regression.
Published: 2020
Full Text: View/download PDF

25. Adversarial Attacks on Graph Neural Networks via Node Injections: A Hierarchical Reinforcement Learning Approach

Author: Xianfeng Tang, Yiwei Sun, Vasant Honavar, Tsung-Yu Hsieh, and Suhang Wang
Subjects: Computer science, business.industry, media_common.quotation_subject, 02 engineering and technology, Complex network, Graph, Task (computing), 020204 information systems, Node (computer science), 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), Benchmark (computing), Reinforcement learning, 020201 artificial intelligence & image processing, Markov decision process, Function (engineering), business, media_common, Computer network
Abstract: Graph Neural Networks (GNN) offer the powerful approach to node classification in complex networks across many domains including social media, E-commerce, and FinTech. However, recent studies show that GNNs are vulnerable to attacks aimed at adversely impacting their node classification performance. Existing studies of adversarial attacks on GNN focus primarily on manipulating the connectivity between existing nodes, a task that requires greater effort on the part of the attacker in real-world applications. In contrast, it is much more expedient on the part of the attacker to inject adversarial nodes, e.g., fake profiles with forged links, into existing graphs so as to reduce the performance of the GNN in classifying existing nodes. Hence, we consider a novel form of node injection poisoning attacks on graph data. We model the key steps of a node injection attack, e.g., establishing links between the injected adversarial nodes and other nodes, choosing the label of an injected node, etc. by a Markov Decision Process. We propose a novel reinforcement learning method for Node Injection Poisoning Attacks (NIPA), to sequentially modify the labels and links of the injected nodes, without changing the connectivity between existing nodes. Specifically, we introduce a hierarchical Q-learning network to manipulate the labels of the adversarial nodes and their links with other nodes in the graph, and design an appropriate reward function to guide the reinforcement learning agent to reduce the node classification performance of GNN. The results of the experiments show that NIPA is consistently more effective than the baseline node injection attack methods for poisoning graph data on three benchmark datasets.
Published: 2020
Full Text: View/download PDF

26. 0102 Performance Evaluation of a 24-hour Sleep-Wake State Classifier Derived from Research-Grade Actigraphy

Author: Daniel M Roberts, Margeaux M Schade, Anne-Marie Chang, Vasant Honavar, Daniel Gartenberg, and Orfeu M Buxton
Subjects: Physiology (medical), Neurology (clinical)
Abstract: Introduction Wrist-worn research-grade actigraphy devices are commonly used to identify sleep and wakefulness in freely-living people. However, common existing algorithms were developed primarily to classify sleep-wake within a defined in-bed period with PSG, and exhibit relatively high sensitivity (accuracy on sleep epochs) but relatively low specificity (accuracy on wake epochs). This classification imbalance results in the algorithms performing poorly when attempting to classify data that does not have a predefined sleep period, such as over a 24-hour interval. Here, we develop a 24-hour actigraphy classifier to overcome limitations in specificity (accuracy on wake epochs), which typically afflict in-bed focused algorithms. Methods Four datasets scored via either PSG or direct observation of daytime wakefulness were combined (n=52 participants of mean age 49.8yrs, age range 19 - 86; 52% male; 221 total days/nights). Actigraphy (counts) and PSG (RPSGT-staged epochs) were temporally aligned. A model was trained to transform a time-series actigraphy counts to a time series of sleep-wake classifications, using the TensorFlow library for Python. 5-fold cross-validation was used to train and evaluate the model. Classification performance was compared to the output of the Spectrum device (Philips-Respironics) using the Oakley algorithm with a wake threshold of ‘medium’. Results The developed classifier was compared to the Spectrum classifications across the 24-hour intervals. The developed classifier had higher accuracy (95.4% vs. 76.8%), higher specificity (95.9% vs. 68.9%) and higher balanced-accuracy (95.2% vs. 81.6%) relative to the Spectrum classifications, each assessed via paired-sample t-test. Sensitivity did not statistically differ (94.5% vs. 94.4%). Conclusion The model trained and evaluated on 24-hour data outperformed the existing algorithm output in terms of overall accuracy, specificity, and balanced accuracy, while sensitivity did not significantly differ. A model trained on 24-hour data may be more appropriate for analyses of freely living people, or older populations where napping is more common. Developing an accurate 24-hour sleep/wake classifier fosters new opportunities to evaluate sleep patterns in the absence of self-reports or assumptions about time in bed. Support (If Any) UL1TR002014, NSF#1622766, R43/44-AG056250
Published: 2022
Full Text: View/download PDF

27. Partner‐specific prediction of RNA‐binding residues in proteins: A critical assessment

Author: Yong Jung, Vasant Honavar, Drena Dobbs, and Yasser EL-Manzalawy
Subjects: Models, Molecular, protein‐RNA interactions, Protein Conformation, Protein Data Bank (RCSB PDB), Computational biology, Biology, Biochemistry, 03 medical and health sciences, protein‐RNA Interface prediction, Sequence Analysis, Protein, Structural Biology, Gene expression, Amino Acid Sequence, RNA‐specificity metric, Binding site, Structural motif, Molecular Biology, Research Articles, 030304 developmental biology, 0303 health sciences, Binding Sites, Base Sequence, 030302 biochemistry & molecular biology, Computational Biology, Proteins, RNA-Binding Proteins, RNA, performance evaluation, RNA Sequence, Metric (mathematics), RNA-Binding Motifs, Critical assessment, partner‐specific protein‐RNA binding, Software, Research Article, Protein Binding
Abstract: RNA‐protein interactions play essential roles in regulating gene expression. While some RNA‐protein interactions are “specific”, that is, the RNA‐binding proteins preferentially bind to particular RNA sequence or structural motifs, others are “non‐RNA specific.” Deciphering the protein‐RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein‐RNA interfaces, there is a need for computational methods to identify RNA‐binding residues in proteins. While most of the existing computational methods for predicting RNA‐binding residues in RNA‐binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner‐specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner‐specific protein‐RNA interface prediction tools, PS‐PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA‐specificity metric (RSM), for quantifying the RNA‐specificity of the RNA binding residues predicted by such tools. Our results show that the RNA‐binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner‐agnostic metrics, RNA partner‐specific methods are outperformed by the state‐of‐the‐art partner‐agnostic methods. We conjecture that either (a) the protein‐RNA complexes in PDB are not representative of the protein‐RNA interactions in nature, or (b) the current methods for partner‐specific prediction of RNA‐binding residues in proteins fail to account for the differences in RNA partner‐specific versus partner‐agnostic protein‐RNA interactions, or both.
Published: 2018
Full Text: View/download PDF

28. A user similarity-based Top-N recommendation approach for mobile in-application advertising

Author: Vasant Honavar, Junjie Liang, Jinlong Hu, and Yuezhen Kuang
Subjects: Information retrieval, Computer science, media_common.quotation_subject, Mobile advertising, General Engineering, 02 engineering and technology, Recommender system, Computer Science Applications, Artificial Intelligence, 020204 information systems, Factor (programming language), Similarity (psychology), Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Quality (business), computer, computer.programming_language, media_common
Abstract: Ensuring scalability of recommender systems without sacrificing the quality of the recommendations produced, presents significant challenges, especially in the large-scale, real-world setting of mobile ad targeting. In this paper, we propose MobRec, a novel two-stage user similarity based approach to recommendation which combines information provided by slowly-changing features of the mobile context and implicit user feedback indicative of user preferences. MobRec uses the contextual features to cluster, during an off-line stage, users that share similar patterns of mobile behavior. In the online stage, MobRec focuses on the cluster consisting of users that are most similar to the target user in terms of their contextual features as well as implicit feedback. MobRec also employs a novel strategy for robust estimation of user preferences from noisy clicks. Results of experiments using a large-scale real-world mobile advertising dataset demonstrate that MobRec outperforms the state-of-the-art neighborhood-based as well as latent factor-based recommender systems, in terms of both scalability and the quality of the recommendations.
Published: 2018
Full Text: View/download PDF

29. Machine-learning guided biophysical model development: application to ribosome catalysis

Author: Yang Jiang, Justin Petucci, Nishant Soni, Vasant Honavar, and Edward O'Brien
Subjects: Biophysics
Published: 2022
Full Text: View/download PDF

30. Algorithmic Bias in Recidivism Prediction: A Causal Perspective (Student Abstract)

Author: Vasant Honavar and Aria Khademi
Subjects: Recidivism, Computer science, Causal inference, Perspective (graphical), Econometrics, Sanctions, Profiling (information science), Observational study, Racial bias, General Medicine, Unmeasured confounding, Causality
Abstract: ProPublica's analysis of recidivism predictions produced by Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software tool for the task, has shown that the predictions were racially biased against African American defendants. We analyze the COMPAS data using a causal reformulation of the underlying algorithmic fairness problem. Specifically, we assess whether COMPAS exhibits racial bias against African American defendants using FACT, a recently introduced causality grounded measure of algorithmic fairness. We use the Neyman-Rubin potential outcomes framework for causal inference from observational data to estimate FACT from COMPAS data. Our analysis offers strong evidence that COMPAS exhibits racial bias against African American defendants. We further show that the FACT estimates from COMPAS data are robust in the presence of unmeasured confounding.
Published: 2020
Full Text: View/download PDF

31. LIGHTNING TALK - CIF21 DIBBs: EI: Element: The Virtual Data Collaboratory: a Regional Cyberinfrastructure for Collaborative Data Intense Science

Author: Rodero, Ivan, Vasant Honavar, Evans, Jenni, Agnew, Grace, and Oehsen, James Von
Abstract: This project develops a virtual data collaboratory that can be accessed by researchers, educators, and entrepreneurs across institutional and geographic boundaries, fostering community engagement and accelerating interdisciplinary research. A federated data system is created, using existing components and building upon existing cyberinfrastructure and resources in New Jersey and Pennsylvania. Seven universities are directly involved (the three Rutgers University campuses, Pennsylvania State University, the University of Pennsylvania, the University of Pittsburgh, Drexel University, Temple University, and the City University of New York); indirectly, other regional schools served by the New Jersey and Pennsylvania high-speed networks also participate. The system has applicability to a several science and engineering domains, such as protein-DNA interaction and smart cities, and is likely to be extensible to other domains. The cyberinfrastructure is to be integrated into both graduate and undergraduate programs across several institutions. The end product is a fully-developed system for collaborative use by the research and education community. A data management and sharing system is constructed, based largely on commercial off-the-shelf technology. The storage system is based on the Hadoop Distributed File System (HDFS), a Java-based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers. The Fedora and VIVO object-based storage systems are used, enabling linked data approaches. The system will be integrated with existing research data repositories, such as the Ocean Observatories Initiative and Protein Data Bank repositories. Regional high-performance computing and network infrastructure is leveraged, including New Jersey's Regional Education and Research Network (NJEdge), Pennsylvania's Keystone Initiative for Network Based Education and Research (KINBER), the Extreme Science and Engineering Discovery Environment (XSEDE) computing capabilities, Open Science Grid, and other NSF Campus Cyberinfrastructure investments. The project also develops a custom site federation and data services layer; the data services layer provides services for data linking, search, and sharing; coupling to computation, analytics, and visualization; mechanisms to attach unique Digital Object Identifiers (DOIs), archive data, and broadly publish to internal and wider audiences; and manage the long-term data lifecycle, ensuring immutable and authentic data and reproducible research.
Published: 2020
Full Text: View/download PDF

32. Untitled ItemCIF21 DIBBs: EI: Element: The Virtual Data Collaboratory: a Regional Cyberinfrastructure for Collaborative Data Intense Science

Author: Rodero, Ivan, Vasant Honavar, Evans, Jenni, Agnew, Grace, and Oehsen, James Von
Abstract: This project develops a virtual data collaboratory that can be accessed by researchers, educators, and entrepreneurs across institutional and geographic boundaries, fostering community engagement and accelerating interdisciplinary research. A federated data system is created, using existing components and building upon existing cyberinfrastructure and resources in New Jersey and Pennsylvania. Seven universities are directly involved (the three Rutgers University campuses, Pennsylvania State University, the University of Pennsylvania, the University of Pittsburgh, Drexel University, Temple University, and the City University of New York); indirectly, other regional schools served by the New Jersey and Pennsylvania high-speed networks also participate. The system has applicability to a several science and engineering domains, such as protein-DNA interaction and smart cities, and is likely to be extensible to other domains. The cyberinfrastructure is to be integrated into both graduate and undergraduate programs across several institutions. The end product is a fully-developed system for collaborative use by the research and education community. A data management and sharing system is constructed, based largely on commercial off-the-shelf technology. The storage system is based on the Hadoop Distributed File System (HDFS), a Java-based file system providing scalable and reliable data storage, designed to span large clusters of commodity servers. The Fedora and VIVO object-based storage systems are used, enabling linked data approaches. The system will be integrated with existing research data repositories, such as the Ocean Observatories Initiative and Protein Data Bank repositories. Regional high-performance computing and network infrastructure is leveraged, including New Jersey's Regional Education and Research Network (NJEdge), Pennsylvania's Keystone Initiative for Network Based Education and Research (KINBER), the Extreme Science and Engineering Discovery Environment (XSEDE) computing capabilities, Open Science Grid, and other NSF Campus Cyberinfrastructure investments. The project also develops a custom site federation and data services layer; the data services layer provides services for data linking, search, and sharing; coupling to computation, analytics, and visualization; mechanisms to attach unique Digital Object Identifiers (DOIs), archive data, and broadly publish to internal and wider audiences; and manage the long-term data lifecycle, ensuring immutable and authentic data and reproducible research.
Published: 2020
Full Text: View/download PDF

33. Adaptive Structural Co-regularization for Unsupervised Multi-view Feature Selection

Author: Tsung-Yu Hsieh, Suhang Wang, Vasant Honavar, and Yiwei Sun
Subjects: Computational complexity theory, business.industry, Computer science, Big data, Feature selection, Machine learning, computer.software_genre, Regularization (mathematics), Synthetic data sets, Embedding, Unsupervised learning, Artificial intelligence, business, computer
Abstract: With the advent of big data, there is an urgent need for methods and tools for integrative analyses of multi-modal or multi-view data. Of particular interest are unsupervised methods for parsimonious selection of non-redundant, complementary, and information-rich features from multi-view data. We introduce Adaptive Structural Co-Regularization Algorithm (ASCRA) for unsupervised multi-view feature selection. ASCRA jointly optimizes the embeddings of the different views so as to maximize their agreement with a consensus embedding which aims to simultaneously recover the latent cluster structure in the multi-view data while accounting for correlations between views. ASCRA uses the consensus embedding to guide efficient selection of features that preserve the latent cluster structure of the multi-view data. We establish ASCRA's convergence properties and analyze its computational complexity. The results of our experiments using several real-world and synthetic data sets suggest that ASCRA outperforms or is competitive with state-of-the-art unsupervised multi-view feature selection methods.
Published: 2019
Full Text: View/download PDF

34. Biomarker discovery in inflammatory bowel diseases using network-based feature selection

Author: John Matta, Halima Bensmail, Thanh Le, Mostafa M. Abbas, Vasant Honavar, Yasser EL-Manzalawy, and Tayo Obafemi-Ajayi
Subjects: 0301 basic medicine, Biomarker identification, Computer science, Biopsy, 01 natural sciences, Biochemistry, Machine Learning, 010104 statistics & probability, 0302 clinical medicine, Discriminative model, Medicine and Health Sciences, Centrality, Biomarker discovery, 0303 health sciences, Multidisciplinary, medicine.diagnostic_test, Ecology, Inflammatory Bowel Diseases, Genomics, 3. Good health, Random forest, Identification (information), Feature (computer vision), Medicine, Biomarker (medicine), 030211 gastroenterology & hepatology, Algorithms, Network Analysis, Research Article, Computer and Information Sciences, Science, Feature selection, Surgical and Invasive Medical Procedures, Computational biology, Gastroenterology and Hepatology, Network Resilience, Microbiology, Microbial Ecology, 03 medical and health sciences, medicine, Genetics, Humans, 0101 mathematics, 030304 developmental biology, Inflammatory Bowel Disease, Ecology and Environmental Sciences, Biology and Life Sciences, Models, Theoretical, 030104 developmental biology, Metagenomics, Potential biomarkers, Biomarkers
Abstract: Reliable identification of inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.
Published: 2019

35. PlasmoSEP: Predicting surface-exposed proteins on the malaria parasite using semisupervised self-training and expert-annotated data

Author: Vasant Honavar, Scott E. Lindner, Elyse E. Munoz, and Yasser EL-Manzalawy
Subjects: 0301 basic medicine, Proteomics, Plasmodium, Bioinformatics, Plasmodium vivax, Plasmodium falciparum, Protozoan Proteins, Semi-supervised learning, Computational biology, computer.software_genre, Biochemistry, Article, Salivary Glands, 03 medical and health sciences, parasitic diseases, medicine, False positive paradox, Humans, Molecular Biology, biology, musculoskeletal, neural, and ocular physiology, Predicting surface-exposed proteins, Membrane Proteins, Plasmodium yoelii, Models, Theoretical, biology.organism_classification, medicine.disease, 3. Good health, Malaria, Surface-exposed proteomics, High-Throughput Screening Assays, 030104 developmental biology, Proteome, Data mining, computer, Self training, Algorithms
Abstract: Accurate and comprehensive identification of surface-exposed proteins (SEPs) in parasites is a key step in developing novel subunit vaccines. However, the reliability of MS-based high-throughput methods for proteome-wide mapping of SEPs continues to be limited due to high rates of false positives (i.e., proteins mistakenly identified as surface exposed) as well as false negatives (i.e., SEPs not detected due to low expression or other technical limitations). We propose a framework called PlasmoSEP for the reliable identification of SEPs using a novel semisupervised learning algorithm that combines SEPs identified by high-throughput experiments and expert annotation of high-throughput data to augment labeled data for training a predictive model. Our experiments using high-throughput data from the Plasmodium falciparum surface-exposed proteome provide several novel high-confidence predictions of SEPs in P. falciparum and also confirm expert annotations for several others. Furthermore, PlasmoSEP predicts that 25 of 37 experimentally identified SEPs in Plasmodium yoelii salivary gland sporozoites are likely to be SEPs. Finally, PlasmoSEP predicts several novel SEPs in P. yoelii and Plasmodium vivax malaria parasites that can be validated for further vaccine studies. Our computational framework can be easily adapted to improve the interpretation of data from high-throughput studies.
Published: 2016

36. Two-dimensional hybrid organic–inorganic perovskites as emergent ferroelectric materials

Author: Vasant Honavar, Dong Yang, Shashank Priya, Yuchen Hou, Congcong Wu, Kai Wang, Adri C. T. van Duin, and Tao Ye
Subjects: 010302 applied physics, Materials science, business.industry, General Physics and Astronomy, Nanotechnology, 02 engineering and technology, Material Design, Photodetection, 021001 nanoscience & nanotechnology, 01 natural sciences, Ferroelectricity, Crystal, Photovoltaics, 0103 physical sciences, Curie temperature, Light emission, 0210 nano-technology, business, Perovskite (structure)
Abstract: Hybrid organic–inorganic perovskite (HOIP) materials have attracted significant attention in photovoltaics, light emission, photodetection, etc. Based on the prototype metal halide perovskite crystal, there is a huge space for tuning the composition and crystal structure of this material, which would provide great potential to render multiple physical properties beyond the ongoing emphasis on the optoelectronic property. Recently, the two-dimensional (2D) HOIPs have emerged as a potential candidate for a new class of ferroelectrics with high Curie temperature and spontaneous polarization. Room-temperature solution-processability further makes HOIP a promising alternative to traditional oxide ferroelectrics such as BaTiO3 and PbTiO3. In this perspective, we focus on the molecular aspects of 2D HOIPs, their correlation with macroscopic properties, as well as the material design rules assisted by advanced simulation tools (e.g., machine learning and atomistic modeling techniques). The perspective provides a comprehensive discussion on the structural origin of ferroelectricity, current progress in the design of new materials, and potential opportunities and challenges with emerging materials. We expect that this perspective will provide inspiration for innovation in 2D HOIP ferroelectrics.
Published: 2020
Full Text: View/download PDF

37. MEGAN: A Generative Adversarial Network for Multi-View Network Embedding

Author: Tsung-Yu Hsieh, Suhang Wang, Vasant Honavar, Xianfeng Tang, and Yiwei Sun
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Science - Machine Learning, Theoretical computer science, Computer science, Node (networking), Network embedding, Network structure, Computer Science - Social and Information Networks, 02 engineering and technology, Link (geometry), ENCODE, Visualization, Machine Learning (cs.LG), 020204 information systems, Complementarity (molecular biology), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Generative adversarial network
Abstract: Data from many real-world applications can be naturally represented by multi-view networks where the different views encode different types of relationships (e.g., friendship, shared interests in music, etc.) between real-world individuals or entities. There is an urgent need for methods to obtain low-dimensional, information preserving and typically nonlinear embeddings of such multi-view networks. However, most of the work on multi-view learning focuses on data that lack a network structure, and most of the work on network embeddings has focused primarily on single-view networks. Against this background, we consider the multi-view network representation learning problem, i.e., the problem of constructing low-dimensional information preserving embeddings of multi-view networks. Specifically, we investigate a novel Generative Adversarial Network (GAN) framework for Multi-View Network Embedding, namely MEGAN, aimed at preserving the information from the individual network views, while accounting for connectivity across (and hence complementarity of and correlations between) different views. The results of our experiments on two real-world multi-view data sets show that the embeddings obtained using MEGAN outperform the state-of-the-art methods on node classification, link prediction and visualization tasks., Comment: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19
Published: 2019
Full Text: View/download PDF

38. Improving Image Captioning by Leveraging Knowledge Graphs

Author: Yimin Zhou, Yiwei Sun, and Vasant Honavar
Subjects: FOS: Computer and information sciences, Closed captioning, Information retrieval, Commonsense knowledge, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, 02 engineering and technology, 010501 environmental sciences, Semantics, 01 natural sciences, Image (mathematics), Visualization, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: We explore the use of a knowledge graphs, that capture general or commonsense knowledge, to augment the information extracted from images by the state-of-the-art methods for image captioning. The results of our experiments, on several benchmark data sets such as MS COCO, as measured by CIDEr-D, a performance metric for image captioning, show that the variants of the state-of-the-art methods for image captioning that make use of the information extracted from knowledge graphs can substantially outperform those that rely solely on the information extracted from images., Accepted by WACV'19
Published: 2019
Full Text: View/download PDF

39. Towards Robust Relational Causal Discovery

Author: Lee, S. and Vasant Honavar
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider the problem of learning causal relationships from relational data. Existing approaches rely on queries to a relational conditional independence (RCI) oracle to establish and orient causal relations in such a setting. In practice, queries to a RCI oracle have to be replaced by reliable tests for RCI against available data. Relational data present several unique challenges in testing for RCI. We study the conditions under which traditional iid-based conditional independence (CI) tests yield reliable answers to RCI queries against relational data. We show how to conduct CI tests against relational data to robustly recover the underlying relational causal structure. Results of our experiments demonstrate the effectiveness of our proposed approach., Comment: 14 pages
Published: 2019
Full Text: View/download PDF

40. iScore: A novel graph kernel-based function for scoring protein-protein docking models

Author: Yong Jung, Li C. Xue, Nicolas Renaud, Cunliang Geng, Alexandre M. J. J. Bonvin, Vasant Honavar, Sub NMR Spectroscopy, and NMR Spectroscopy
Subjects: 0303 health sciences, 03 medical and health sciences, Graph kernel, 010304 chemical physics, Docking (molecular), Computer science, Protein protein, 0103 physical sciences, Computational biology, 01 natural sciences, Function (biology), 030304 developmental biology, Conserved sequence
Abstract: Protein complexes play a central role in many aspects of biological function. Knowledge of the three-dimensional (3D) structures of protein complexes is critical for gaining insights into the structural basis of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determination of 3D structures of protein complexes, computational docking has evolved as a valuable tool to predict the 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to that of the state-of-the-art scoring functions on independent data sets consisting docking software-specific data sets and the CAPRI score set built from a wide variety of docking approaches. iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary and topological, and physicochemical information for scoring docked conformations. This work represents the first successful demonstration of graph kernel to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. It paves the way for the further development of computational methods for predicting the structure of protein complexes.
Published: 2018
Full Text: View/download PDF

41. Multi-View Network Embedding Via Graph Factorization Clustering and Co-Regularized Multi-View Agreement

Author: Vasant Honavar, Tsung-Yu Hsieh, Ngot Bui, and Yiwei Sun
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Science - Machine Learning, Theoretical computer science, Social network, Linear programming, Computer science, business.industry, Network embedding, Computer Science - Social and Information Networks, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Graph, Machine Learning (cs.LG), 0202 electrical engineering, electronic engineering, information engineering, Bipartite graph, Graph (abstract data type), Embedding, 020201 artificial intelligence & image processing, Cluster analysis, business, Graph factorization, 0105 earth and related environmental sciences
Abstract: Real-world social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Sub-networks of such a network based on each type of link correspond to distinct views of the underlying network. In real-world applications, each node is typically linked to only a small subset of other nodes. Hence, practical approaches to problems such as node labeling have to cope with the resulting sparse networks. While low-dimensional network embeddings offer a promising approach to this problem, most of the current network embedding methods focus primarily on single view networks. We introduce a novel multi-view network embedding (MVNE) algorithm for constructing low-dimensional node embeddings from multi-view networks. MVNE adapts and extends an approach to single view network embedding (SVNE) using graph factorization clustering (GFC) to the multi-view setting using an objective function that maximizes the agreement between views based on both the local and global structure of the underlying multi-view graph. Our experiments with several benchmark real-world single view networks show that GFC-based SVNE yields network embeddings that are competitive with or superior to those produced by the state-of-the-art single view network embedding methods when the embeddings are used for labeling unlabeled nodes in the networks. Our experiments with several multi-view networks show that MVNE substantially outperforms the single view methods on integrated view and the state-of-the-art multi-view methods. We further show that even when the goal is to predict labels of nodes within a single target view, MVNE outperforms its single-view counterpart suggesting that the MVNE is able to extract the information that is useful for labeling nodes in the target view from the all of the views., ICDMW2018 -- IEEE International Conference on Data Mining workshop on Graph Analytics
Published: 2018

42. Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data

Author: Vasant Honavar, Dokyoon Kim, Tsung-Yu Hsieh, Yasser EL-Manzalawy, and Manu Shivakumar
Subjects: 0301 basic medicine, Computer science, 02 engineering and technology, computer.software_genre, Gene expression, 0202 electrical engineering, electronic engineering, information engineering, Redundancy (engineering), Genetics (clinical), Ovarian Neoplasms, 0303 health sciences, High-Throughput Nucleotide Sequencing, Cancer survival, Prognosis, 3. Good health, Survival Rate, DNA methylation, Biomarker (medicine), 020201 artificial intelligence & image processing, Female, DNA microarray, Algorithms, Data integration, lcsh:Internal medicine, lcsh:QH426-470, DNA Copy Number Variations, Feature selection, Computational biology, Multi-view feature selection, 03 medical and health sciences, Machine learning, Genetics, medicine, Biomarkers, Tumor, Humans, Relevance (information retrieval), lcsh:RC31-1245, 030304 developmental biology, Cancer survival prediction, Cancer prevention, Gene Expression Profiling, Research, Computational Biology, DNA Methylation, Precision medicine, medicine.disease, Human genetics, lcsh:Genetics, Multi-omics data integration, 030104 developmental biology, Ovarian cancer, Transcriptome, computer
Abstract: BackgroundLarge-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer the tantalizing possibilities of realizing the potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including the heterogeneity of data types, and the extreme high-dimensionality of omics data.MethodsIn this study, we propose a novel framework for integrating multi-omics data based on multi-view feature selection, an emerging research problem in machine learning research. We also present a novel multi-view feature selection algorithm, MRMR-mv, which adapts the well-known Min-Redundancy and Maximum-Relevance (MRMR) single-view feature selection algorithm for the multi-view settings.ResultsWe report results of experiments on the task of building a predictive model of cancer survival from an ovarian cancer multi-omics dataset derived from the TCGA database. Our results suggest that multi-view models for predicting ovarian cancer survival outperform both view-specific models (i.e., models trained and tested using one multi-omics data source) and models based on two baseline data fusion methods.ConclusionsOur results demonstrate the potential of multi-view feature selection in integrative analyses and predictive modeling from multi-omics data.
Published: 2018

43. Microbiomarkers Discovery in Inflammatory Bowel Diseases using Network-Based Feature Selection

Author: Mostafa M. Abbas, Thanh Le, Yasser EL-Manzalawy, Vasant Honavar, and Halima Bensmail
Subjects: 0301 basic medicine, 03 medical and health sciences, 030104 developmental biology, Metagenomics, Node (networking), Inference, Feature selection, Genomics, Identification (biology), Computational biology, Biomarker discovery, Biology, Random forest
Abstract: Discovery of disease biomarkers is a key step in translating advances in genomics into clinical practice. There is growing evidence that changes in gut microbial composition are associated with the onset and progression of Type 2 Diabetes (T2D), Obesity, and Inflammatory Bowel Disease (IBD). Reliable identification of the most informative features (i.e., microbes) for discriminating metagenomics samples from two or more groups (i.e., phenotypes) is a major challenge in computational metagenomics. We propose a Network-Based Biomarker Discovery (NBBD) framework for detecting disease biomarkers from metagenomics data. NBBD has two major customizable modules: i) A network inference module for inferring ecological networks from the abundances of microbial operational taxonomic units (OTUs); ii) A node importance scoring module for comparing the constructed networks for the chosen phenotypes and assigning a score to each node based on the degree to which the topological properties of the node differ across two networks. We empirically evaluated the proposed NBBD framework, using five network inference methods for inferring gut microbial networks combined with six node topological properties, on the identification of IBD biomarkers using a large dataset from a cohort of 657 and 316 IBD and healthy controls metagenomic biopsy samples, respectively. Our results show that NBBD is very competitive with some of the state-of-the-art feature selection methods including the widely used method based on random forest variable importance scores.
Published: 2018
Full Text: View/download PDF

44. Representing and reasoning with modular ontologies

Author: Jie Bao and Vasant Honavar
Subjects: Theoretical computer science, Knowledge representation and reasoning, Commonsense knowledge, business.industry, Computer science, Ontology-based data integration, Process ontology, Web Ontology Language, Ontology (information science), Reuse, Ontology language, Semantics, Knowledge sharing, Knowledge-based systems, Description logic, Knowledge base, Ontology components, Ontology, Upper ontology, IDEF5, Software engineering, business, Semantic Web, computer, computer.programming_language
Abstract: Realizing the full potential of the semantic web requires the large-scale adoption and use of ontology based approaches to sharing of information and resources. In such a setting, instead of a single, centralized ontology, it is much more natural to have multiple distributed ontologies that cover different, perhaps partially overlapping, domains. Such ontologies represent the local knowledge of the ontology designers, that is, knowledge that is applicable within a specific context. Hence, many application scenarios, such as collaborative construction and management of complex ontologies, distributed databases, and large knowledge base applications, present an urgent need for ontology languages that support localized and contextualized semantics, partial and selective reuse of ontology modules, flexible ways to limit the scope and visibility of knowledge (as needed for selective knowledge sharing), federated approaches to reasoning with distributed ontologies, and structured approaches to collaborative construction of large ontologies. Against this background, this dissertation develops a family of description logics based modular ontology languages, namely Package-based Description Logics (P-DL), to address the needs of such applications. The main contributions of this dissertation include: (1) The identification and theoretical characterization of the desiderata of modular ontology languages that can support selective sharing and reuse of knowledge across independently developed knowledge bases; (2) The development of a family of ontology languages called P-DL, which extend the classical description logics (DL) to support selective knowledge sharing through a novel semantic importing mechanism and the establishment of a minimal set of restrictions on the use of imported concepts and roles to support localized semantics, transitive propagation of imported knowledge, and different interpretations from the point of view of different ontology modules; (3) The development of a family of sound and complete tableau-based federated reasoning algorithms for distributed, autonomous, P-DL ontologies including ALCP and SHIQP , i.e., P-DL onologies where the individual modules are expressed in the P-DL counterpart of DL ALC and SHIQ respectively, that can be used to efficiently reason over a set of distributed, autonomous, ontology modules from the point of view of any specific module, that avoid the need to integrate ontologies using message exchanges between modules as needed; (4) The formulation of criteria for answering queries against a knowledge base using hidden or private knowledge, whenever it is feasible to do so without compromising hidden knowledge, and the development of privacy-preserving reasoning strategies for the case of the commonly used hierarchical ontologies and SHIQ ontologies, along with a theoretical characterization of the conditions under which they are guaranteed to be privacy-preserving; (5) The development of some prototype tools for collaborative development of large ontologies, including support for concurrent editing and partial loading of ontologies into memory.
Published: 2018
Full Text: View/download PDF

45. Abstraction, aggregation and recursion for generating accurate and simple classifiers

Author: Vasant Honavar and Dae-Ki Kang
Subjects: Computer science, business.industry, Decision tree, Pattern recognition, Machine learning, computer.software_genre, Support vector machine, Generative model, Naive Bayes classifier, ComputingMethodologies_PATTERNRECOGNITION, Knowledge extraction, Anomaly detection, Artificial intelligence, Tuple, Cluster analysis, business, computer
Abstract: An important goal of inductive learning is to generate accurate and compact classifiers from data. In a typical inductive learning scenario, instances in a data set are simply represented as ordered tuples of attribute values. In our research, we explore three methodologies to improve the accuracy and compactness of the classifiers: abstraction, aggregation, and recursion. Firstly, abstraction is aimed at the design and analysis of algorithms that generate and deal with taxonomies for the construction of compact and robust classifiers. In many applications of the data-driven knowledge discovery process, taxonomies have been shown to be useful in constructing compact, robust, and comprehensible classifiers. However, in many application domains, human-designed taxonomies are unavailable. We introduce algorithms for automated construction of taxonomies inductively from both structured (such as UCI Repository) and unstructured (such as text and biological sequences) data. We introduce AVT-Learner, an algorithm for automated construction of attribute value taxonomies (AVT) from data, and Word Taxonomy Learner (WTL), an algorithm for automated construction of word taxonomy from text and sequence data. We describe experiments on the UCI data sets and compare the performance of AVT-NBL (an AVT-guided Naive Bayes Learner) with that of the standard Naive Bayes Learner (NBL). Our results show that the AVTs generated by AVT-Learner are compeitive with human-generated AVTs (in cases where such AVTs are available). AVT-NBL using AVTs generated by AVT-Learner achieves classification accuracies that are comparable to or higher than those obtained by NBL; and the resulting classifiers are significantly more compact than those generated by NBL. Similarly, our experimental results of WTL and WTNBL on protein localization sequences and Reuters newswire text categorization data sets show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model. Secondly, we apply aggregation to construct features as a multiset of values for the intrusion detection task. For this task, we propose a bag of system calls representation for system call traces and describe misuse and anomaly detection results on the University of New Mexico (UNM) and MIT Lincoln Lab (MIT LL) system call sequences with the proposed representation. With the feature representation as input, we compare the performance of several machine learning techniques for misuse detection and show experimental results on anomaly detection. The results show that standard machine learning and clustering techniques using the simple bag of system calls representation based on the system call traces generated by the operating system's kernel is effective and often performs better than approaches that use foreign contiguous sequences in detecting intrusive behaviors of compromised processes. Finally, we construct a set of classifiers by recursive application of the Naive Bayes learning algorithms. Naive Bayes (NB) classifier relies on the assumption that the instances in each class can be described by a single generative model. This assumption can be restrictive in many real world classification tasks. We describe recursive Naive Bayes learner (RNBL), which relaxes this assumption by constructing a tree of Naive Bayes classifiers for sequence classification, where each individual NB classifier in the tree is based on an event model (one model for each class at each node in the tree). In our experiments on protein sequences, Reuters newswire documents and UC-Irvine benchmark data sets, we observe that RNBL substantially outperforms NB classifier. Furthermore, our experiments on the protein sequences and the text documents show that RNBL outperforms C4.5 decision tree learner (using tests on sequence composition statistics as the splitting criterion) and yields accuracies that are comparable to those of support vector machines (SVM) using similar information.
Published: 2018
Full Text: View/download PDF

46. Learning classifiers from distributed, semantically heterogeneous, autonomous data sources

Author: Doina Caragea and Vasant Honavar
Subjects: Set (abstract data type), Interoperation, Information retrieval, Theoretical computer science, Data acquisition, Process (engineering), Computer science, Throughput, Ontology (information science), Communication complexity, Knowledge acquisition
Abstract: Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications, the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition. However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity. To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data. The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources.
Published: 2018
Full Text: View/download PDF

47. Biologically inspired computational structures and processes for autonomous agents and robots

Author: Vasant Honavar and Karthik Balakrishnan
Subjects: Human–computer interaction, Computer science, Autonomous agent, Robot
Published: 2018
Full Text: View/download PDF

48. Interactive and verifiable web services composition, specification reformulation and substitution

Author: Jyotishman Pathak and Vasant Honavar
Subjects: Service (systems architecture), Database, Computer science, Event (computing), business.industry, Services computing, computer.software_genre, Data structure, Component (UML), Component-based software engineering, Web service, WS-Policy, Software engineering, business, computer
Abstract: Recent advances in networks, information and computation grids, and WWW have resulted in the proliferation of physically distributed and autonomously developed software components and services. These developments allow us to rapidly build new value-added applications from existing ones in various domains such as e-Science, e-Business, and e-Government. Towards this end, this dissertation develops solutions for the following problems related to Web services and Service-Oriented Architectures: (1) Web service composition. The ability to compose complex Web services from a multitude of available component services is one of the most important problems in service-oriented computing paradigm. In this dissertation, we propose a new framework for modeling complex Web services based on the techniques of abstraction, composition and reformulation. The approach allows service developers to specify an abstract and possibly incomplete specification of the composite (goal) service. This specification is used to select a set of suitable component services such that their composition realizes the desired goal. In the event that such a composition is unrealizable, the cause for the failure of composition is determined and is communicated to the developer thereby enabling further reformulation of the goal specification. This process can be iterated until a feasible composition is identified or the developer decides to abort. (2) Web service specification reformulation . In practice, often times the composite service specification provided by the service developers result in the failure of composition. Typically, handling such failure requires the developer to analyze the cause(s) of the failure and obtain an alternate composition specification that can be realized from the available services. To assist developers in such situations, we describe a technique which given the specification of a desired composite service with a certain functional behavior, automatically identifies alternate specifications with the same functional behavior. At its core, our technique relies on analyzing data and control dependencies of the composite service and generating alternate specifications on-the-fly without violating the dependencies. We present a novel data structure to record these dependencies, and devise algorithms for populating the data structure and for obtaining the alternatives. (3) Web service substitution. The assembly of a composite service that satisfy a desired set of requirements is only the first step. Ensuring that the composite service, once assembled, can be successfully deployed presents additional challenges that need to be addressed. In particular, it is possible that one or more of the component services participating in a composition might become unavailable during deployment. Such circumstances warrant the unavailable service to be substituted by another without violating the functional and behavioral properties of the composition. To address this requirement, we introduce the notion of context-specific substitutability in Web services, where context refers to the overall functionality of the composition that is required to be maintained after replacement of its constituents. Using the context information, we investigate two variants of the substitution problem, namely environment-independent and environment-dependent, where environment refers to the constituents of a composition and show how the substitutability criteria can be relaxed within this model. The work described above contributed to the design and implementation of MoSCoE—an open-source platform for modeling and executing complex Web services (http://www.moscoe.org).
Published: 2018
Full Text: View/download PDF

49. Structural induction: towards automatic ontology elicitation

Author: Vasant Honavar and Adrian Silvescu
Subjects: Structure (mathematical logic), Class (set theory), Theoretical computer science, Formalism (philosophy), Computer science, business.industry, Abstraction (mathematics), Ontology, Structural induction, Artificial intelligence, Problem of induction, business, Turing, computer, computer.programming_language
Abstract: Induction is the process by which we obtain laws (and more encompassing - theories) about the world. This process can thought as aiming to derive two aspects of a theory: a Structural aspect and a Numerical aspect respectively. The Structural aspect is concerned with the entities modeled and their interrelationship, also known as ontology. The Numerical aspect is concerned with the quantities involved in the relationships among the above-mentioned entities along with uncertainties either postulated to exist in the world or inherent to the nature of the induction process. In this thesis we will focus more on the structural aspect hence the name: Structural Induction: toward Automatic Ontology Elicitation. In order to deal with the problem of Structural Induction we need to solve two main problems: (1) We have to say what we mean by Structure (What? ); and (2) We have to say how to get it (How?). In this thesis we give one very definite answer to the first question ( What?) and we also explore how to answer the second question (How?) in some particular cases. A comprehensive answer to the second question (How?) in the most general setup will involve dealing very carefully with the interplay between the Structural and Numerical aspects of Induction and will represent a full solution to the Induction problem. This is a vast enterprise and we will only be able touch some aspects of this issue. The main thesis presented in this work is that the fundamental structural elements based on which theories are constructed are Abstraction (grouping similar entities under one overarching category) and Super-Structuring (grouping into a bigger unit topologically close entities - in particular spatio-temporally close). This thesis is supported by showing that each member of the Turing equivalent class of General Generative Grammars can be decomposed in terms of these operators and their duals (Reverse Abstraction and Reverse SuperStructuring, respectively). Thus, if we are to believe the Computationalistic Assumption (that the most general way to present a finite theory is by the means of an entity expressed in a Turing equivalent formalism) we have proved that our thesis is correct. We call this thesis the Abstraction + SuperStructuring thesis. The rest of the thesis is concerned with issues in the opened by the second question presented above ( How?): Given that we have established what we mean by Structure, how to get it?
Published: 2018
Full Text: View/download PDF

50. Learning predictive models from massive, semantically disparate data

Author: Vasant Honavar and Neeraj Koul
Subjects: Data stream mining, business.industry, Computer science, Decision tree, Machine learning, computer.software_genre, Missing data, Set (abstract data type), Naive Bayes classifier, Software, Disparate system, Noise (video), Artificial intelligence, business, computer
Abstract: Machine learning approaches offer some of the most successful techniques for constructing predictive models from data. However, applying such techniques in practice requires overcoming several challenges: infeasibility of centralized access to the data because of the massive size of some of the data sets that often exceeds the size of memory available to the learner, distributed nature of data, access restrictions, data fragmentation, semantic disparities between the data sources, and data sources that evolve spatially or temporally (e.g. data streams and genomic data sources in which new data is being submitted continuously). Learning using statistical queries and semantic correspondences that present a unified view of disparate data sources to the learner offer a powerful general framework for addressing some of these challenges. Against this background, this thesis describes (1) approaches to deal with missing values in the statistical query based algorithms for building predictors (Naive Bayes and decision trees) and the techniques to minimize the number of required queries in such a setting. (2) Sufficient statistics based algorithms for constructing and updating sequence classifiers. (3) Reduction of several aspects of learning from semantically disparate data sources (such as (a) how errors in mappings affect the accuracy of the learned model and (b) how to choose an optimal mapping from among a set of alternative expert-supplied or automatically generated mappings) to the well-studied problems of domain adaptation and learning in presence of noise and (4) a software for learning predictive models from semantically disparate data.
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Database

Publisher

302 results on '"Vasant Honavar"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources