Author: "Hoffmann, Achim" / Topic: computer science - Searchworks@Jio Institute Digital Library Search Results

1. Bootstrapping Word Sense Disambiguation Using Dynamic Web Knowledge.

Author: Qiang Yang, Webb, Geoff, Wang, Yuanyong, and Hoffmann, Achim
Abstract: Word Sense Disambiguation(WSD) is one of the traditionally most difficult problems in natural language processing and has broad theoretical and practical implications. One of the main difficulties for WSD systems is the lack of relevant knowledge-commonly known as the knowledge acquisition bottleneck problem. We present in this paper a novel method that utilizes dynamic Web data obtained through Web search engines to effectively enrich the semantic knowledge for WSD systems. We demonstrated through a word sense disambiguation system the large quantity and good quality of the extracted knowledge. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

2. Query-Topic Focused Web Pages Summarization.

Author: Qiang Yang, Webb, Geoff, Yoo, Seung Yeol, and Hoffmann, Achim
Abstract: We present a novel Web Pages Summarizer ContextSummarizer that groups the given Web pages into ‘sense-clusters' respecting a user's topical interests. ContextSummarizer constructs then an extractive summary for each sense-cluster. A user's topical interest is described by the user who selects and refines some of the word senses disambiguated within the content contexts of the given Web pages. The semantic similarity measures between the contents of Web pages/segments/sentences and the user-selected word senses were used to choose the most topically relevant sentences as the extractive summaries referring to a user's topical interest. ContextSummarizer addresses the semantic-alignment problem between the content of a Web page, the user's topical interest, and the extractive summary of the Web page. Our case studies and experimental results showed that our query-topic focused extractive summaries returns more topically relevant sentences for an extractive summary than those produced by existing summarization systems. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

3. Clustering-Based Relevance Feedback for Web Pages.

Author: Qiang Yang, Webb, Geoff, Seung Yeol Yoo, and Hoffmann, Achim
Abstract: Most traditional relevance feedback systems simply choose the top ranked Web pages as the source of providing the weights of candidate query expansion terms. However, the contents of such top-ranked Web pages is often composed of heterogeneous sub-topics which can be and should be recognized and distinguished. However, current approaches treat retrieved Web pages as one unit and often fail to extract good quality candidate query expansion terms. In this paper, our basic idea is that the Web pages properly clustered into a sub-topic cluster can be used as a better source than whole given Web pages, to provide more topically coherent relevance feedback for that specific sub-topic. Thus, we propose Clustering-Based Relevance Feedback for Web Pages, which utilizes three methods to cluster retrieved Web pages into several subtopic-clusters. These three methods cooperate to construct good quality clusters by respectively supporting Web page Segmentation, Term Selection, k Seed Centroid Selection. Here, the automatically selected terms indicate the relevance feedback to construct all sub-topic clusters and assign the given Web pages to proper clusters. Each subset of the selected terms, which occurs in the Web pages assigned into a sub-topic cluster, indicates the relevance feedback to expand a query over that sub-topic cluster. Our experimental results showed that the clustering performances based on two traditional term-weighting methods (i.e., an unsupervised method and a supervised method) can be significantly improved with our methods. [ABSTRACT FROM AUTHOR]
Published: 2006
Full Text: View/download PDF

4. A Bare Bones Approach to Literature-Based Discovery: An Analysis of the Raynaud's/Fish-Oil and Migraine-Magnesium Discoveries in Semantic Space.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Cole, R. J., and Bruza, P. D.
Abstract: Literature discovery can be characterized as a goal directed search for previously unknown implicit knowledge captured within a collection of scientific articles. Swanson's serendipitous discovery of a treatment for Raynaud's disease by dietary fish-oil while browsing Medline, an online collection of biomedical literature, exemplifies such a discovery. By means of a series of experiments, the impact of stop words, various weighting schemes, discovery mechanisms, and contextual reduction are studied in relation to replicating the Raynaud/fish-oil and migraine-magnesium discoveries by operational means. Two aspects of discovery were brought under focus: (i) the discovery of intermediate, or B -terms, and (ii) the discovery of indirect A - C connections via the B-terms. A semantic space representation of the underlying corpus is computed and discoveries automated by computing associations between words in both higher and contextually reduced spaces. It was found that the discovery of B-terms and A - C connections can be achieved to an encouraging degree with a standard stop word list. In addition, no single weighting scheme seems to suffice. Log-likelihood appears to be potentially effective for leading to the discovery of B-terms, whereas both odds ratio and simple co-occurrence frequencies both facilitate the discovery of A - C connections. With regard to discovery mechanism, both semantic similarity (via cosine) and information flow computation seem promising for computing A - C connections, but more research is needed to understand their relative strengths and weaknesses. Discovery in a contextually reduced semantic space revealed mixed results. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

5. The Robot Scientist Project.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, King, Ross D., Young, Michael, Clare, Amanda J., Whelan, Kenneth E., and Rowland, Jem
Abstract: We are interested in the automation of science for both philosophical and technological reasons. To this end we have built the first automated system that is capable of automatically: originating hypotheses to explain data, devising experiments to test these hypotheses, physically running these experiments using a laboratory robot, interpreting the results, and then repeat the cycle. We call such automated systems "Robot Scientists". We applied our first Robot Scientist to predicting the function of genes in a well-understood part of the metabolism of the yeast S. cerevisiae. For background knowledge, we built a logical model of metabolism in Prolog. The experiments consisted of growing mutant yeast strains with known genes knocked out on specified growth media. The results of these experiments allowed the Robot Scientist to test hypotheses it had abductively inferred from the logical model. In empirical tests, the Robot Scientist experiment selection methodology outperformed both randomly selecting experiments, and a greedy strategy of always choosing the experiment of lowest cost; it was also as good as the best humans tested at the task. To extend this proof of principle result to the discovery of novel knowledge we require new hardware that is fully automated, a model of all of the known metabolism of yeast, and an efficient way of inferring probable hypotheses. We have made progress in all of these areas, and we are currently 6building a new Robot Scientist that we hope will be able to automatically discover new biological knowledge. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

6. Effective Classifier Pruning with Rule Information.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Zhang, Xiaolong, Luo, Mingjian, and Pi, Daoying
Abstract: This paper presents an algorithm to prune a tree classifier with a set of rules which are converted from a C4.5 classifier, where rule information is used as a pruning criterion. Rule information measures the goodness of a rule when discriminating labeled instances. Empirical results demonstrate that the proposed pruning algorithm has high predictive accuracyThe work is supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, and the Project (No.2004D006) from Hubei Provincial Department of Education, P. R. China.. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

7. Detecting and Revising Misclassifications Using ILP.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Yokoyama, Masaki, Matsui, Tohgoroh, and Ohwada, Hayato
Abstract: This paper proposes a method for detecting misclassifications of a classification rule and then revising them. Given a rule and a set of examples, the method divides misclassifications by the rule into miscovered examples and uncovered examples, and then, separately, learns to detect them using Inductive Logic Programming (ILP). The method then combines the acquired rules with the initial rule and revises the labels of misclassified examples. The paper shows the effectiveness of the proposed method by theoretical analysis. In addition, it presents experimental results, using the Brill tagger for Part-Of-Speech (POS) tagging. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

8. CLASSIC'CL: An Integrated ILP System.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Stolle, Christian, Karwath, Andreas, and Raedt, Luc
Abstract: A novel inductive logic programming system, called Classic'cl is presented. Classic'cl integrates several settings for learning, in particular learning from interpretations and learning from satisfiability. Within these settings, it addresses descriptive and probabilistic modeling tasks. As such, Classic'cl (C-armr, cLAudien, icl-S(S)at, ICl, and CLlpad) integrates several well-known inductive logic programming systems such as Claudien, Warmr (and its extension C-armr), ICL, ICL-SAT, and LLPAD. We report on the implementation, the integration issues as well as on some experiments that compare Classic'cl with some of its predecessors. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

9. Finding Significant Web Pages with Lower Ranks by Pseudo-Clique Search.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Okubo, Yoshiaki, Haraguchi, Makoto, and Shi, Bin
Abstract: In this paper, we discuss a method of finding useful clusters of web pages which are significant in the sense that their contents are similar or closely related to ones of higher-ranked pages. Since we are usually careless of pages with lower ranks, they are unconditionally discarded even if their contents are similar to some pages with high ranks. We try to extract such hidden pages together with significant higher-ranked pages as a cluster. In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition(SVD) to the term-document matrix generated from a corpus w.r.t. a specific topic. Based on the correlations, we can evaluate potential similarities among web pages from which we try to obtain clusters. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. We present an algorithm for finding Top-N weighted pseudo-cliques. Our experimental result shows that quite valuable clusters can be actually extracted according to our method. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

10. Unit Volume Based Distributed Clustering Using Probabilistic Mixture Model.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Lee, Keunjoon, Joo, Jinu, Yang, Jihoon, and Park, Sungyong
Abstract: Extracting useful knowledge from numerous distributed data repositories can be a very hard task when such data cannot be directly centralized or unified as a single file or database. This paper suggests practical distributed clustering algorithms without accessing the raw data to overcome the inefficiency of centralized data clustering methods. The aim of this research is to generate unit volume based probabilistic mixture model from local clustering results without moving original data. It has been shown that our method is appropriate for distributed clustering when real data cannot be accessed or centralized. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

11. A Data Analysis Approach for Evaluating the Behavior of Interestingness Measures.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Huynh, Xuan-Hiep, Guillet, Fabrice, and Briand, Henri
Abstract: In recent years, the problem of finding the different aspects existing in a dataset has attracted many authors in the domain of knowledge quality in KDD. The discovery of knowledge in the form of association rules has become an important research. One of the most difficult issues is that an enormous number of association rules are discovered, so it is not easy to choose the best association rules or knowledge for a given dataset. Some methods are proposed for choosing the best rules with an interestingness measure or matching properties of interestingness measure for a given set of interestingness measures. In this paper, we propose a new approach to discover the clusters of interestingness measures existing in a dataset. Our approach is based on the evaluation of the distance computed between interestingness measures. We use two techniques: agglomerative hierarchical clustering (AHC) and partitioning around medoids (PAM) to help the user graphically evaluates the behavior of interestingness measures. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

12. Automatic Extraction of Proteins and Their Interactions from Biological Text.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Hong, Kiho, Park, Junhyung, Yang, Jihoon, and Paek, Eunok
Abstract: Text mining techniques have been proposed for extracting protein names and their interactions from biological text. First, we have made improvements on existing methods for handling single word protein names consisting of characters, special symbols, and numbers. Second, compound word protein names are also extracted using conditional probabilities of the occurrences of neighboring words. Third, interactions are extracted based on Bayes theorem over discriminating verbs that represent the interactions of proteins. Experimental results demonstrate the feasibility of our approach with improved performance in terms of accuracy and F-measure, requiring significantly less amount of computational time. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

13. Learning Ontology-Aware Classifiers.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Zhang, Jun, Caragea, Doina, and Honavar, Vasant
Abstract: Many practical applications of machine learning in data-driven scientific discovery commonly call for the exploration of data from multiple points of view that correspond to explicitly specified ontologies. This paper formalizes a class of problems of learning from ontology and data, and explores the design space of learning classifiers from attribute value taxonomies (AVTs) and data. We introduce the notion of AVT-extended data sources and partially specified data. We propose a general framework for learning classifiers from such data sources. Two instantiations of this framework, AVT-based Decision Tree classifier and AVT-based Naïve Bayes classifier are presented. Experimental results show that the resulting algorithms are able to learn robust high accuracy classifiers with substantially more compact representations than those obtained by standard learners. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

14. Active Constrained Clustering by Examining Spectral Eigenvectors.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Xu, Qianjun, desJardins, Marie, and Wagstaff, Kiri L.
Abstract: This work focuses on the active selection of pairwise constraints for spectral clustering. We develop and analyze a technique for Active Constrained Clustering by Examining Spectral eigenvectorS (ACCESS) derived from a similarity matrix. The ACCESS method uses an analysis based on the theoretical properties of spectral decomposition to identify data items that are likely to be located on the boundaries of clusters, and for which providing constraints can resolve ambiguity in the cluster descriptions. Empirical results on three synthetic and five real data sets show that ACCESS significantly outperforms constrained spectral clustering using randomly selected constraints. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

15. Massive Biomedical Term Discovery.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Wermter, Joachim, and Hahn, Udo
Abstract: Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

16. Exploring Predicate-Argument Relations for Named Entity Recognition in the Molecular Biology Domain.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Wattarujeekrit, Tuangthong, and Collier, Nigel
Abstract: In this paper, the semantic relationships between a predicate and its arguments in terms of semantic roles are employed to improve lexical-based named entity recognition (NER) in the molecular biology domain. The semantic roles were realized in various sets of syntactic features used by a machine learning model to explore what should be the efficient way in allowing this knowledge to provide the highest positive effect on the NER. The empirical results show that the best feature set consists of predicate's surface form, predicate's lemma, voice, and the united feature of subject-object head's lemma and transitive-intransitive sense. The performance improvement from using these features indicates the advantage of the predicate-argument semantic knowledge on NER. There are still rooms to enhance NER by using this semantic knowledge (e.g. to employ other semantic roles besides agent and theme and to extend the rules for efficient identification of an argument's boundary). [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

17. SCALETRACK: A System to Discover Dynamic Law Equations Containing Hidden States and Chaos.

Author: Hoffmann, Achim, Scheffer, Tobias, Washio, Takashi, Adachi, Fuminori, and Motoda, Hiroshi
Abstract: This paper proposes a novel system to discover simultaneous time differential law equations reflecting first principles underlying objective processes. The system has the power to discover equations containing hidden state variables and/or representing chaotic dynamics without using any detailed domain knowledge. These tasks have not been addressed in any mathematical and engineering domains in spite of their essential importance. Its promising performance is demonstrated through applications to both mathematical and engineering examples. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

18. Pattern Classification via Single Spheres.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Wang, Jigang, Neskovic, Predrag, and Cooper, Leon N.
Abstract: Previous sphere-based classification algorithms usually need a number of spheres in order to achieve good classification performance. In this paper, inspired by the support vector machines for classification and the support vector data description method, we present a new method for constructing single spheres that separate data with the maximum separation ratio. In contrast to previous methods that construct spheres in the input space, the new method constructs separating spheres in the feature space induced by the kernel. As a consequence, the new method is able to construct a single sphere in the feature space to separate patterns that would otherwise be inseparable when using a sphere in the input space. In addition, by adjusting the ratio of the radius of the sphere to the separation margin, it can provide a series of solutions ranging from spherical to linear decision boundaries, effectively encompassing both the support vector machines for classification and the support vector data description method. Experimental results show that the new method performs well on both artificial and real-world datasets. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

19. An Algorithm for Mining Implicit Itemset Pairs Based on Differences of Correlations.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Taniguchi, Tsuyoshi, and Haraguchi, Makoto
Abstract: Given a transaction database as a global set of transactions and its local database obtained by some conditioning to the global one, we consider a pair of itemsets whose degrees of correlations are higher in the local database than in the global one. A problem of finding paired itemsets with high correlation in one database is known as Discovery of Correlation, and some algorithms to search for such characteristic paired itemsets are already proposed. However, even non-characteristic paired itemsets in the local database are also meaningful, provided the degree of correlation increases much higher in the local database than in the global one. They can be an implicit and hidden evidence showing that something particular to the local database occurs even though they are not yet realized as characteristic ones in the local. From this viewpoint, we have already proposed to measure the significance of paired itemsets by the difference of two correlations before and after the conditioning to the local database, and define a notion of DC pairs whose degrees of differences of correlations are high. As DC pairs are regarded as compound itemsets consisting of two component itemsets, we can have two basic strategies for finding them. One strategy firstly examines the compound itemsets and then the components, while another one does the component itemsets and then the compound ones. According to the former strategy, which we have already proposed and tested for its effectiveness, we have to enumerate many number of candidate compound itemsets that cannot be decomposable to components. For this reason, this paper presents a new algorithm according to the second strategy. It firstly enumerate possible component itemsets based on a new pruning rule for cutting off useless components. Secondly it forms the compound itemsets by combining the components thus detected, while we also make use of a constraint for preventing our algorithm from checking meaningless combinations. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

20. Learning On-Line Classification via Decorrelated LMS Algorithm: Application to Brain-Computer Interfaces.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Sun, Shiliang, and Zhang, Changshui
Abstract: The classification of time-varying neurophysiological signals, e.g., electroencephalogram (EEG) signals, advances the requirement of adaptability for classifiers. In this paper we address the challenge of neurophysiological signal classification arising from brain-computer interface (BCI) applications and propose an on-line classifier designed via the decorrelated least mean square (LMS) algorithm. Based on a Bayesian classifier with Gaussian mixture models, we derive the general formulation of gradient descent algorithms under the criterion of LMS. Further, to accelerate convergence, the decorrelated gradient instead of the instantaneous gradient is adopted for updating the parameters of the classifier adaptively. Utilizing the presented classifier for the off-line analysis of practical classification tasks in brain-computer interface applications shows its effectiveness and robustness compared to the stochastic gradient descent classifier which uses the instantaneous gradient directly. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

21. Monotone Classification by Function Decomposition.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Popova, Viara, and Bioch, Jan C.
Abstract: The paper focuses on the problem of classification by function decomposition within the frame of monotone classification. We propose a decomposition method for discrete functions which can be applied to monotone problems in order to generate a monotone classifier based on the extracted concept hierarchy. We formulate and prove a criterion for the existence of a positive extension of the scheme f=g(S0,h(S1)) in the context of discrete functions. We also propose a method for finding an assignment for the intermediate concept with a minimal number of values. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

22. The q-Gram Distance for Ordered Unlabeled Trees.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Ohkura, Nobuhito, Hirata, Kouichi, Kuboyama, Tetsuji, and Harao, Masateru
Abstract: In this paper, we investigate the q-gram distance for ordered unlabeled trees (trees, for short). First, we formulate a q-gram as simply a tree with q nodes isomorphic to a line graph, and the q-gram distance between two trees as similar as one between two strings. Then, by using the depth sequence based on postorder, we design the algorithm EnumGram to enumerate all q-grams in a tree T with n nodes which runs in O(n2) time and in O(q) space. Furthermore, we improve it to the algorithm LinearEnumGram which runs in O(qn) time and in O(qd) space, where d is the depth of T. Hence, we can evaluate the q-gram distance Dq(T1,T2) between T1 and T2 in time and in space, where ni and di are the number of nodes in Ti and the depth of Ti, respectively. Finally, we show the relationship between the q-gram distance Dq(T1,T2) and the edit distanceE(T1,T2) that Dq(T1,T2)≤ (gl+1)E(T1,T2), where , , gi is the degree of Ti and li is the number of leaves in Ti. In particular, for the top-down tree edit distanceF(T1,T2), this result implies that . [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

23. Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Ng, Yen Kaow, Ono, Hirotaka, and Shinohara, Takeshi
Abstract: We consider the problem of finding a set of patterns that best characterizes a set of strings. To this end, Arimura et. al. [3] considered the use of minimal multiple generalizations (mmg) for such characterizations. Given any sample set, the mmgs are, roughly speaking, the most (syntactically) specific set of languages containing the sample within a given class of languages. Takae et. al. [17] found the mmgs of the class of pattern languages [1] which includes so-called sort symbols to be fairly accurate as predictors for signal peptides. We first reproduce their results using updated data. Then, by using a measure for estimating the level of over-generalizations made by the mmgs, we show results that explain the high level of accuracies resulting from the use of sort symbols, and discuss how better results can be obtained. The measure that we suggests here can also be applied to other types of patterns, e.g. the PROSITE patterns [4]. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

24. Support Vector Inductive Logic Programming.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Muggleton, Stephen, Lodhi, Huma, Amini, Ata, and Sternberg, Michael J. E.
Abstract: In this paper we explore a topic which is at the intersection of two areas of Machine Learning: namely Support Vector Machines (SVMs) and Inductive Logic Programming (ILP). We propose a general method for constructing kernels for Support Vector Inductive Logic Programming (SVILP). The kernel not only captures the semantic and syntactic relational information contained in the data but also provides the flexibility of using arbitrary forms of structured and non-structured data coded in a relational way. While specialised kernels have been developed for strings, trees and graphs our approach uses declarative background knowledge to provide the learning bias. The use of explicitly encoded background knowledge distinguishes SVILP from existing relational kernels which in ILP-terms work purely at the atomic generalisation level. The SVILP approach is a form of generalisation relative to background knowledge, though the final combining function for the ILP-learned clauses is an SVM rather than a logical conjunction. We evaluate SVILP empirically against related approaches, including an industry-standard toxin predictor called TOPKAT. Evaluation is conducted on a new broad-ranging toxicity dataset (DSSTox). The experimental results demonstrate that our approach significantly outperforms all other approaches in the study. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

25. Movement Analysis of Medaka (Oryzias Latipes) for an Insecticide Using Decision Tree.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Lee, Sengtai, Kim, Jeehoon, Baek, Jae-Yeon, Han, Man-Wi, Ji, Chang Woo, and Chon, Tae-Soo
Abstract: Behavioral sequences of the medaka (Oryzias latipes) were continuously investigated through an automatic image recognition system in response to medaka treated with the insecticide and medaka not treated with the insecticide, diazinon (0.1 mg/l) during a 1 hour period. The observation of behavior through the movement tracking program showed many patterns of the medaka. After much observation, behavioral patterns were divided into four basic patterns: active-smooth, active-shaking, inactive-smooth, and inactive-shaking. The "smooth" and "shaking" patterns were shown as normal movement behavior. However, the "shaking" pattern was more frequently observed than the "smooth" pattern in medaka specimens that were treated with insecticide. Each pattern was classified using a devised decision tree after the feature choice. It provides a natural way to incorporate prior knowledge from human experts in fish behavior and contains the information in a logical expression tree. The main focus of this study was to determine whether the decision tree could be useful for interpreting and classifying behavior patterns of the medaka. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

26. An Experiment with Association Rules and Classification: Post-Bagging and Conviction.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Jorge, Alípio M., and Azevedo, Paulo J.
Abstract: In this paper we study a new technique we call post-bagging, which consists in resampling parts of a classification model rather then the data. We do this with a particular kind of model: large sets of classification association rules, and in combination with ordinary best rule and weighted voting approaches. We empirically evaluate the effects of the technique in terms of classification accuracy. We also discuss the predictive power of different metrics used for association rule mining, such as confidence, lift, conviction and χ2. We conclude that, for the described experimental conditions, post-bagging improves classification results and that the best metric is conviction. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

27. Mining Frequent δ-Free Patterns in Large Databases.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Hébert, Céline, and Crémilleux, Bruno
Abstract: Mining patterns under constraints in large data (also called fat data) is an important task to benefit from the multiple uses of the patterns embedded in these data sets. It is a difficult task due to the exponential growth of the search space according to the number of attributes. From such contexts, closed patterns can be extracted by using the properties of the Galois connections. But, from the best of our knowledge, there is no approach to extract interesting patterns like δ-free patterns which are on the core of a lot of relevant rules. In this paper, we propose a new method based on an efficient way to compute the extension of a pattern and a pruning criterion to mine frequent δ-free patterns in large databases. We give an algorithm (FTminer) for the practical use of this method. We show the efficiency of this approach by means of experiments on benchmarks and on gene expression data. Keywords: Large databases, δ-free patterns, extensions, rules, condensed representations. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

28. Cross-Language Mining for Acronyms and Their Completions from the Web.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Hahn, Udo, Daumke, Philipp, Schulz, Stefan, and Markó, Kornél
Abstract: We propose a method that aligns biomedical acronyms and their long-form definitions across different languages. We use a freely available search and extraction tool by which abbreviations, together with their fully expanded forms, are massively mined from the Web. In a subsequent step, language-specific variants, synonyms, and translations of the extracted acronym definitions are normalized by referring to a language-independent, shared semantic interlingua. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

29. Assisting Scientific Discovery with an Adaptive Problem Solver.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Dartnell, Christopher, and Sallantin, Jean
Abstract: This paper is an attempt to design an interaction protocol for a multi-agent learning platform to assist a human community in their task of scientific discovery. Designing tools to assist Scientific Discovery offers a challenging problematic, since the problems studied by scientists are not yet solved, and valid models are not yet available. It is therefore impossible to create a problem solver to simulate a given phenomenon and explain or predict facts. We propose to assist scientists with learning machines considered as adaptive problem solvers, to build interactively a consistent model suited for reasoning, simulating, predicting, and explaining facts. The interaction protocol presented in this paper is based on Angluin's "Learning from Different Teachers" [1] and we extend the original protocol to make it operational to assist scientists solve open problems. The main problem we deal with is that this learning model supposes the existence of teachers having previously solved the problem. These teachers are able to answer the learner's queries whereas this is not the case in the context of Scientific Discovery in which it is only possible to refute a model by finding experimental processes revealing contradictions. Our first contribution is to directly use Angluin's interaction protocol to let a machine learn a program that approximates the theory of a scientist, and to help him improve this theory. Our second contribution is to attenuate Angluin's protocol to take into account a social cognition level during which multiple scientists interact with each other by the means of publications and refutations of rival theories. The program learned by the machine can be included in a publication to avoid false refutations coming from a wrong interpretation of the theory. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

30. Bias Management of Bayesian Network Classifiers.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Castillo, Gladys, and Gama, João
Abstract: The purpose of this paper is to describe an adaptive algorithm for improving the performance of Bayesian Network Classifiers (BNCs) in an on-line learning framework. Instead of choosing a priori a particular model class of BNCs, our adaptive algorithm scales up the model's complexity by gradually increasing the number of allowable dependencies among features. Starting with the simple Naïve Bayes structure, it uses simple decision rules based on qualitative information about the performance's dynamics to decide when it makes sense to do the next move in the spectrum of feature dependencies and to start searching for a more complex classifier. Results in conducted experiments using the class of Dependence Bayesian Classifiers on three large datasets show that our algorithm is able to select a model with the appropriate complexity for the current amount of training data, thus balancing the computational cost of updating a model with the benefits of increasing in accuracy. Keywords: Bias Management, Bayesian Classifiers, Machine Learning. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

31. Named Entity Recognition for the Indonesian Language: Combining Contextual, Morphological and Part-of-Speech Features into a Knowledge Engineering Approach.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Budi, Indra, Bressan, Stéphane, Wahyudi, Gatot, Hasibuan, Zainal A., and Nazief, Bobby A.A.
Abstract: We present a novel named entity recognition approach for the Indonesian language. We call the new method InNER for Indonesian Named Entity Recognition. InNER is based on a set of rules capturing the contextual, morphological, and part of speech knowledge necessary in the process of recognizing named entities in Indonesian texts. The InNER strategy is one of knowledge engineering: the domain and language specific rules are designed by expert knowledge engineers. After showing in our previous work that mined association rules can effectively recognize named entities and outperform maximum entropy methods, we needed to evaluate the potential for improvement to the rule based approach when expert crafted knowledge is used. The results are conclusive: the InNER method yields recall and precision of up to 63.43% and 71.84%, respectively. Thus, it significantly outperforms not only maximum entropy methods but also the association rule based method we had previously designed. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

32. Practical Algorithms for Pattern Based Linear Regression.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Bannai, Hideo, Hatano, Kohei, Inenaga, Shunsuke, and Takeda, Masayuki
Abstract: We consider the problem of discovering the optimal pattern from a set of strings and associated numeric attribute values. The goodness of a pattern is measured by the correlation between the number of occurrences of the pattern in each string, and the numeric attribute value assigned to the string. We present two algorithms based on suffix trees, that can find the optimal substring pattern in O(Nn) and O(N2) time, respectively, where n is the number of strings and N is their total length. We further present a general branch and bound strategy that can be used when considering more complex pattern classes. We also show that combining the O(N2) algorithm and the branch and bound heuristic increases the efficiency of the algorithm considerably. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

33. The Arrowsmith Project: 2005 Status Report.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, and Smalheiser, Neil R.
Abstract: In the 1980s, Don Swanson proposed the concept of "undiscovered public knowledge," and published several examples in which two disparate literatures (i.e., sets of articles having no papers in common, no authors in common, and few cross-citations) nevertheless held complementary pieces of knowledge that, when brought together, made compelling and testable predictions about potential therapies for human disorders. In the 1990s, Don and I published more predictions together and created a computer-assisted search strategy ("Arrowsmith"). At first, the so-called one-node search was emphasized, in which one begins with a single literature (e.g., that dealing with a disease) and searches for a second unknown literature having complementary knowledge (e.g. that dealing with potential therapies). However, we soon realized that the two-node search is better aligned to the information practices of most biomedical investigators: in this case, the user chooses two literatures and then seeks to identify meaningful links between them. Could typical biomedical investigators learn to carry out Arrowsmith analyses? Would they find routine occasions for using such a sophisticated tool? Would they uncover significant links that affect their experiments? Four years ago, we initiated a project to answer these questions, working with several neuroscience field testers. Initially we expected that investigators would spend several days learning how to carry out searches, and would spend several days analyzing each search. Instead, we completely re-designed the user interface, the back-end databases, and the methods of processing linking terms, so that investigators could use Arrowsmith without any tutorial at all, and requiring only minutes to carry out a search. The Arrowsmith Project now hosts a suite of free, public tools. It has launched new research spanning medical informatics, genomics and social informatics, and has, indeed, assisted investigators in formulating new experiments, with direct impact on basic science and neurological diseases. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

34. Invention and Artificial Intelligence.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, and Bradshaw, Gary
Abstract: Invention, like scientific discovery, sometimes occurs through a heuristic search process where an inventor seeks a successful invention by searching through a space of inventions. For complex inventions, such as the airplane or model rockets, the process of invention can be expedited by an appropriate strategy of invention. Two case studies will be used to illustrate these general principles: the invention of the airplane (1799-1909) and the invention of a model rocket by a group of high school students in rural West Virginia in the late 1950's. Especially during the invention of the airplane, inventors were forced to make scientific discoveries to complete the invention. Then we consider the enterprise of artificial intelligence and argue that general principles of invention may be applied to expedite the development of AI systems. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

35. Unsupervised Bilingual Word Sense Disambiguation Using Web Statistics.

Author: Zhang, Shichao, Jarvis, Ray, Wang, Yuanyong, and Hoffmann, Achim
Abstract: Word sense disambiguation has sense division and sense selection as its two sub-problems. An appropriate solution to the sense division problem is usually dependent on the application being pursued. In the context of machine translation, picking the correct translation for a word among multiple candidates, is known as target word selection. The work in this paper uses the Web as the main knowledge source to address the difficulty of making a target word selection based on statistics, which are normally drawn from rather limited corpora. The proposed approach uses simple and easily accessible web statistics-search engine hits (number of document returned for a particular query) to demonstrate the great potential of the Web as a knowledge source for word sense disambiguation. Our experimental results so far are very encouraging. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

36. Using Messaging Structure to Evolve Agents Roles in Electronic Markets.

Author: Barley, Michael Wayne, Kasabov, Nik, Beydoun, Ghassan, Debenham, John, and Hoffmann, Achim
Abstract: Exogenous dynamics play a central role in survival and evolution of institutions. In this paper, we develop an approach to automate part of this evolution process for electronic market places which bring together many online buyers and suppliers. In particular, for a given market place, we focus on other market places doing similar business, as a form of exogenous evolutionary factor. Automatically tracking and analyzing how other market places do their business has a number of difficulties; for example, different electronic markets- with similar purpose- might use different names for similar agent roles and tasks. In this paper, we argue that low level analysis of sequences of messages exchanged between agents within e-markets is an effective mechanism in integrating similar roles specifications, independent of what names these roles - or even the messages themselves - may take. We focus on the structure of messages (message schemas), sequences of message schemas, sets of sequences of message schemas to compare and integrate roles. Using statistical analysis over such structures we bypass the difficult problem of identifying semantics of roles and exchanged messages through their human readable names (syntactic forms). To allow such low level analysis, different e-market specifications are expressed using the same language. Our language of choice is a recently developed multi agent systems specification language, Islander 2.0. We illustrate our approach with example specifications and institutions simulation traces. [ABSTRACT FROM AUTHOR]
Published: 2005

37. Knowledge-Level Management of Web Information.

Author: Yanchun Zhang, Tanaka, Katsumi, Jeffrey Xu Yu, Shan Wang, Minglu Li, Seung Yeol Yoo, and Hoffmann, Achim
Abstract: We present a knowledge-rich software agent, ContextExplicator, which mediates between the Web and the user's information or knowledge needs. It provides a method for incremental knowledge-level management (i.e., knowledge discovery, acquisition and representation) for heterogeneous information in the Web. In ContextExplicator, the incremental knowledge management works through iterative negotiations with the human user: 1Automatic Word-SenseIn this paper, "term(s)" and "word(s)" are used interchangeably. Disambiguation and Induction. General knowledge (e.g., from a lexicon) and previously discovered knowledge support the sense-disambiguation & sense-induction of a word in the given documents, resulting in an improved and refined organization of previously discovered knowledge, 2Interactive Specialization of Query Criteria. At a given moment, the user can reduce certain semantic ambiguities of previously discovered knowledge by selecting one of the context-words which are suggested by ContextExplicator to discriminate between sets of retrieved documents. The selected context-word is also used to direct the discovery of new knowledge in the given documents, and 3Visualization of the Discovered Knowledge. The discovered knowledge is represented in a conceptual lattice. Each lattice-node represents a single word-sense or a conjunction of senses of multiple words. To each node the respectively identified documents are associated. Each web-document is multi-classified into relevant word-sense clusters (lattice nodes), according to the occurrences of specific word-senses in the respective web-document. As a conceptual lattice allows the user to navigate the word-sense clusters and the classified web-documents with multi-level abstractions (i.e., super-/sub-lattice nodes), it provides a flexible scheme for managing knowledge and web-documents in a scalable way. [ABSTRACT FROM AUTHOR]
Published: 2005

38. Incremental Knowledge Acquisition for Building Sophisticated Information Extraction Systems with KAFTIE.

Author: Karagiannis, Dimitris, Reimer, Ulrich, Pham, Son Bao, and Hoffmann, Achim
Abstract: The aim of our work is to develop a flexible and powerful Knowledge Acquisition framework that allows users to rapidly develop Natural Language Processing systems, including information extraction systems. In this paper we present our knowledge acquisition framework, KAFTIE, which strongly supports the rapid development of complex knowledge bases for information extraction. We specifically target scientific papers which involve rather complex sentence structures from which different types of information are automatically extracted. Tasks on which we experimented with our framework are to identify concepts/terms of which positive or negative aspects are mentioned in scientific papers. These tasks are challenging as they require the analysis of the relationship between the concept/term and its sentiment expression. Furthermore, the context of the expression needs to be inspected. The results so far are very promising as we managed to build systems with relative ease that achieve F-measures of around 84% on a corpus of scientific papers in the area of artificial intelligence. Keywords: Incremental Knowledge Acquisition, Knowledge-based systems, Natural language processing. [ABSTRACT FROM AUTHOR]
Published: 2004

39. Text Mining for Clinical Chinese Herbal Medical Knowledge Discovery.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Zhou, Xuezhong, Liu, Baoyan, and Wu, Zhaohui
Abstract: Chinese herbal medicine has been an effective therapy for healthcare and disease treatment. Large amount of TCM literature data have been curated in the last ten years, most of which is about the TCM clinical researches with herbal medicine. This paper develops text mining system named MeDisco/3T to extract the clinical Chinese medical formula data from literature, and discover the combination knowledge of herbal medicine by frequent itemset analysis. Over 18,000 clinical Chinese medical formula are acquired, furthermore, significant frequent herbal medicine pairs and the family combination rule of herbal medicine have primary been studied. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

40. Rule-Based FCM: A Relational Mapping Model.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Yang, Ying, Li, Tao-shen, and Le, Jia-jin
Abstract: Rule-Based Fuzzy Cognitive Map (RBFCM) is proposed as an evolution of Fuzzy Causal Maps (FCM) to allow more complete and complex representation of cognition so that relations other than monotonic causality are made possible. This paper shows how RBFCM can be viewed in the context of relation algebra, and proposes a novel model for representing and reasoning causal knowledge relation. The mapping model and rules are introduced to infer three kinds of causal relations that FCM can't support. Capability analysis shows that our model is much better than FCM in emulating real world. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

41. Network Boosting for BCI Applications.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Wang, Shijun, Lin, Zhonglin, and Zhang, Changshui
Abstract: Network Boosting is an ensemble learning method which combines learners together based on a network and can learn the target hypothesis asymptotically. We apply the approach to analyze data from the P300 speller paradigm. The result on the Data set II of BCI (Brain-computer interface) competition III shows that Network Boosting achieves higher classification accuracy than logistic regression, SVM, Bagging and AdaBoost. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

42. Discovering User Preferences by Using Time Entries in Click-Through Data to Improve Search Engine Results.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, and Ramachandran, Parthasarathy
Abstract: The search engine log files have been used to gather direct user feedback on the relevancy of the documents presented in the results page. Typically the relative position of the clicks gathered from the log files is used a proxy for the direct user feedback. In this paper we identify reasons for the incompleteness of the relative position of clicks for deciphering the user preferences. Hence, we propose the use of time spent by the user in reading through the document as indicative of user preference for a document with respect to a query. Also, we identify the issues involved in using the time measure and propose means to address them. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

43. A Tabu Clustering Method with DHB Operation and Mergence and Partition Operation.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Liu, Yongguo, Zheng, Dong, Li, Shiqun, Wang, Libin, and Chen, Kefei
Abstract: A new tabu clustering method called ITCA is developed for the minimum sum of squares clustering problem, where DHB operation and mergence and partition operation are introduced to fine-tune the current solution and create the neighborhood, respectively. Compared with some known clustering methods, ITCA can obtain better performance, which is extensively demonstrated by experimental simulations. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

44. Knowledge Discovery Through Composited Visualization, Navigation and Retrieval.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Lim, Wei-Ching, and Lee, Chien-Sing
Abstract: In large databases, many problems occur when visualizing, navigating and retrieving information from databases. Ontologies help in adding semantics and context to the resources in databases. Hence, this paper presents the OntoVis, an ontological authoring and visualization tool, which emphasizes on the clustering of concepts in Formal Concept Analysis (FCA). The composited visualization, navigation and retrieval of resources will be presented in this paper. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

45. A Semantic Enrichment of Data Tables Applied to Food Risk Assessment.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Gagliardi, Hélène, Haemmerlé, Ollivier, Pernelle, Nathalie, and Saïs, Fatiha
Abstract: Our work deals with the automatic construction of domain specific data warehouses. Our application domain concerns microbiological risks in food products. The MIEL++ system [2], implemented during the Sym'Previus project, is a tool based on a database containing experimental and industrial results about the behavior of pathogenic germs in food products. This database is incomplete by nature since the number of possible experiments is potentially infinite. Our work, developed within the e.dot projectCooperation between INRIA, Paris South University, INRA and Xyleme., presents a way of palliating that incompleteness by complementing the database with data automatically extracted from the Web. We propose to query these data through a mediated architecture based on a domain ontology. So, we need to make them compatible with the ontology. In the e.dot project [5], we exclusively focus on documents in Html or Pdf format which contain data tables. Data tables are very common presentation scheme to describe synthetic data in scientific articles. These tables are semantically enriched and we want this enrichment to be as automatic and flexible as possible. Thus, we have defined a Document Type Definition named SML (Semantic Markup Language) which can deal with additional or incomplete information in a semantic relation, ambiguities or possible interpretation errors. In this paper, we present this semantic enrichment step. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

46. Self-generation of Control Rules Using Hierarchical and Nonhierarchical Clustering for Coagulant Control of Water Treatment Plants.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Bae, Hyeon, Kim, Sungshin, Kim, Yejin, and Kim, Chang-Won
Abstract: In coagulant control of water treatment plants, rule extraction, one of datamining categories, was performed for coagulant control of a water treatment plant. Clustering methods were applied to extract control rules from data. These control rules can be used for fully automation of water treatment plants instead of operator's knowledge for plant control. In this study, statistical indices were used to determine cluster numbers and seed points from hierarchical clustering. These statistical approaches give information about features of clusters, so it can reduce computing cost and increase accuracy of clustering. The proposed algorithm can play an important role in datamining and knowledge discovery. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

47. Training Support Vector Machines via SMO-Type Decomposition Methods.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Chen, Pai-Hsuen, Fan, Rong-En, and Lin, Chih-Jen
Abstract: This article gives a comprehensive study on SMO-type (Sequential Minimal Optimization) decomposition methods for training support vector machines. We propose a general and flexible selection of the two-element working set. Main theoretical results include 1) a simple asymptotic convergence proof, 2) a useful explanation of the shrinking and caching techniques, and 3) the linear convergence of this method. This analysis applies to any SMO-type implementation whose selection falls into the proposed framework. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

48. Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources.

Author: Hoffmann, Achim, Motoda, Hiroshi, Scheffer, Tobias, Caragea, Doina, Zhang, Jun, Bao, Jie, Pathak, Jyotishman, and Honavar, Vasant
Abstract: Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

48 results on '"Hoffmann, Achim"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources