Author: "Katti Faceli" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Katti Faceli"' showing total 46 results

Start Over Author "Katti Faceli"

46 results on '"Katti Faceli"'

1. Evaluation of gene selection metrics for tumor cell classification

Author: Katti Faceli, André C.P.L.F. de Carvalho, and Wilson A. Silva Jr
Subjects: gene selection, machine learning, gene expression, sage, Genetics, QH426-470
Abstract: Gene expression profiles contain the expression level of thousands of genes. Depending on the issue under investigation, this large amount of data makes analysis impractical. Thus, it is important to select subsets of relevant genes to work with. This paper investigates different metrics for gene selection. The metrics are evaluated based on their ability in selecting genes whose expression profile provides information to distinguish between tumor and normal tissues. This evaluation is made by constructing classifiers using the genes selected by each metric and then comparing the performance of these classifiers. The performance of the classifiers is evaluated using the error rate in the classification of new tissues. As the dataset has few tissue samples, the leave-one-out methodology was employed to guarantee more reliable results. The classifiers are generated using different machine learning algorithms. Support Vector Machines (SVMs) and the C4.5 algorithm are employed. The experiments are conduced employing SAGE data obtained from the NCBI web site. There are few analysis involving SAGE data in the literature. It was found that the best metric for the data and algorithms employed is the metric logistic.
Published: 2004
Full Text: View/download PDF

2. HSS: Compact Set of Partitions via Hybrid Selection.

Author: Vanessa Antunes, Katti Faceli, and Tiemi Christine Sakata
Published: 2017
Full Text: View/download PDF

3. CVis - Towards a novel visualization tool to explore the relationship between input and output partitions in multi-objective clustering ensembles.

Author: Katti Faceli, Tiemi C. Sakata, and Julia Handl
Published: 2017
Full Text: View/download PDF

4. ASAClu: Selecting Diverse and Relevant Clusters.

Author: Joao Luis Baptista de Almeida, Tiemi Christine Sakata, and Katti Faceli
Published: 2016
Full Text: View/download PDF

5. Impact of Base Partitions on Multi-objective and Traditional Ensemble Clustering Algorithms.

Author: Jane Piantoni, Katti Faceli, Tiemi C. Sakata, Julio C. Pereira, and Marcílio Carlos Pereira de Souto
Published: 2015
Full Text: View/download PDF

6. PVis - Partitions' visualizer: Extracting knowledge by visualizing a collection of partitions.

Author: Katti Faceli, Tiemi C. Sakata, André Carlos Ponce de Leon Ferreira de Carvalho, and Marcílio Carlos Pereira de Souto
Published: 2014
Full Text: View/download PDF

7. A Comparison of External Clustering Evaluation Indices in the Context of Imbalanced Data Sets.

Author: Marcilio C. P. de Souto, André L. V. Coelho, Katti Faceli, Tiemi C. Sakata, Viviane Bonadia, and Ivan G. Costa
Published: 2012
Full Text: View/download PDF

8. The Assessment of the Quality of Sugar Using Electronic Tongue and Machine Learning Algorithms.

Author: Tiemi C. Sakata, Katti Faceli, Tiago A. Almeida 0001, Antonio Riul Jr., and Wanessa M. D. M. F. Steluti
Published: 2012
Full Text: View/download PDF

9. Improvements in the Partitions Selection Strategy for Set of Clustering Solutions.

Author: Tiemi C. Sakata, Katti Faceli, Marcílio Carlos Pereira de Souto, and André Carlos Ponce de Leon Ferreira de Carvalho
Published: 2010
Full Text: View/download PDF

10. A Strategyfor the Selection of Solutions of the Pareto Front Approximation in Multi-objective Clustering Approaches.

Author: Katti Faceli, Marcílio Carlos Pereira de Souto, and André Carlos Ponce de Leon Ferreira de Carvalho
Published: 2008
Full Text: View/download PDF

11. Data clustering based on complex network community detection.

Author: Tatyana B. S. de Oliveira, Liang Zhao 0001, Katti Faceli, and André Carlos Ponce de Leon Ferreira de Carvalho
Published: 2008
Full Text: View/download PDF

12. Multi-Objective Clustering Ensemble with Prior Knowledge.

Author: Katti Faceli, André Carlos Ponce de Leon Ferreira de Carvalho, and Marcílio Carlos Pereira de Souto
Published: 2007
Full Text: View/download PDF

13. Evaluation of the Contents of Partitions Obtained with Clustering Gene Expression Data.

Author: Katti Faceli, André Carlos Ponce de Leon Ferreira de Carvalho, and Marcílio Carlos Pereira de Souto
Published: 2005
Full Text: View/download PDF

14. AcheSeuEcoponto

Author: Angelina Vitorino de Souza Melaré, Sahudy Montenegro González, and Katti Faceli
Subjects: education.field_of_study, Municipal solid waste, Knowledge management, Web system, business.industry, Computer science, Population, Brazilian population, education, Map visualization, business, Dissemination
Abstract: AcheSeuEcoponto is a web system to assist the Brazilian population in allocating its solid residues appropriately. It was developed using geo and web technologies. The system offers functionalities to citizens (e.g., querying the nearest drop-off centre by waste type, map visualization, and directions) and managers (e.g., reports of underserved areas of disposal centres and users' profile). Besides a discussion on how the system assists citizens and managers, the authors provide a characterization of users' profile and show how the use of web technologies helped in the dissemination of information. Both analyses can help managers giving essential guidelines on how to motivate population engagement.
Published: 2020

15. Hybrid Strategy for Selecting Compact Set of Clustering Partitions

Author: Katti Faceli, Vanessa Antunes, Tiemi C. Sakata, Marcilio C. P. de Souto, Universidade Federal de São Carlos/Sorocaba (UFSCar/Sorocaba), Universidade Federal de São Carlos, Sorocaba (UFSCar/Sorocaba), Universidade Federal de Sao Carlos - UFSCar (BRAZIL), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université d'Orléans (UO)
Subjects: 0209 industrial biotechnology, Multiobjective optimisation, Computer science, Rand index, Clustering Algorithm, 02 engineering and technology, Multi-objective optimization, Partition (database), Evolutionary computation, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 020901 industrial engineering & automation, Compact space, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Cluster analysis, Selection algorithm, Algorithm, Software
Abstract: International audience; The selection of the most appropriate clustering algorithm is not a straightforward task, given that there is no clustering algorithm capable of determining the actual groups present in any dataset. A potential solution is to use different clustering algorithms to produce a set of partitions (solutions) and then select the best partition produced according to a specified validation measure; these measures are generally biased toward one or more clustering algorithms. Nevertheless, in several real cases, it is important to have more than one solution as the output. To address these problems, we present a hybrid partition selection algorithm, HSS, which accepts as input a set of base partitions potentially generated from clustering algorithms with different biases and aims, to return a reduced and yet diverse set of partitions (solutions). HSS comprises three steps: (i) the application of a multiobjective algorithm to a set of base partitions to generate a Pareto Front (PF) approximation; (ii) the division of the solutions from the PF approximation into a certain number of regions; and (iii) the selection of a solution per region by applying the Adjusted Rand Index. We compare the results of our algorithm with those of another selection strategy, ASA. Furthermore, we test HSS as a post-processing tool for two clustering algorithms based on multiobjective evolutionary computing: MOCK and MOCLE. The experiments revealed the effectiveness of HSS in selecting a reduced number of partitions while maintaining their quality.
Published: 2020

16. Technologies and decision support systems to aid solid-waste management: a systematic review

Author: Katti Faceli, Vitor Casadei, Angelina Vitorino de Souza Melaré, and Sahudy Montenegro González
Subjects: Technology, Decision support system, Engineering, Process management, Municipal solid waste, Relation (database), 020209 energy, Decision Making, Control (management), Expert Systems, 02 engineering and technology, 010501 environmental sciences, Solid Waste, 01 natural sciences, Decision Support Techniques, Waste Management, 0202 electrical engineering, electronic engineering, information engineering, Population growth, Population Growth, Waste Management and Disposal, 0105 earth and related environmental sciences, Geography, business.industry, Management science, Refuse Disposal, Sustainable management, Information and Communications Technology, Geographic Information Systems, business, Software, Environmental Monitoring, Economic problem
Abstract: Population growth associated with population migration to urban areas and industrial development have led to a consumption relation that results in environmental, social, and economic problems. With respect to the environment, a critical concern is the lack of control and the inadequate management of the solid waste generated in urban centers. Among the challenges are proper waste-collection management, treatment, and disposal, with an emphasis on sustainable management. This paper presents a systematic review on scientific publications concerning decision support systems applied to Solid Waste Management (SWM) using ICTs and OR in the period of 2010-2013. A statistical analysis of the eighty-seven most relevant publications is presented, encompassing the ICTs and OR methods adopted in SWM, the processes of solid-waste management where they were adopted, and which countries are investigating solutions for the management of solid waste. A detailed discussion on how the ICTs and OR methods have been combined in the solutions was also presented. The analysis and discussion provided aims to help researchers and managers to gather insights on technologies/methods suitable the SWM challenges they have at hand, and on gaps that can be explored regarding technologies/methods that could be useful as well as the processes in SWM that currently do not benefit from using ICTs and OR methods.
Published: 2017

17. Multi-Objective Clustering Ensemble.

Author: Katti Faceli, André Carlos Ponce de Leon Ferreira de Carvalho, and Marcílio Carlos Pereira de Souto
Published: 2006
Full Text: View/download PDF

18. HSS: Compact Set of Partitions via Hybrid Selection

Author: Tiemi C. Sakata, Katti Faceli, and Vanessa Antunes
Subjects: Compact space, Linear programming, Computer science, 020204 information systems, Rand index, 0202 electrical engineering, electronic engineering, information engineering, Partition (number theory), 020201 artificial intelligence & image processing, Algorithm design, 02 engineering and technology, Cluster analysis, Algorithm, Multi-objective optimization
Abstract: Inability to identify partitions of different sizes and shapes is a fundamental limitation of any clustering algorithm, especially when different regions ofthe search space contain clusters with varied characteristics. It is possible to apply diverse clustering algorithms, with different parameters, but then, it is necessary to deal with a large number of partitions. Techniques such as ensemble and multiobjective clustering treat this problem using distinct criteria, but they have high computational cost. Moreover, the ensemble technique generates a single solution, which may not represent every real partition present in the data. On the other hand, multiobjective clusteringmay generate a large number of partitions difficult to be analysed manually. In this paper, we propose a hybrid multiojective algorithm, HSS, that aims to return a reduced and yet diverse set of solutions. It can be divided in threesteps: (i) the application of a multiobjective algorithm to a set of base partitions for the generation of a Pareto Front (PF), (ii) the division of the solutions from the PF into a certain number of regions and (iii) the selectionof a solution per region, through the application of the Adjusted Rand Index. Experiments show the effectiveness of HSS in selecting a reduced number of partitions.
Published: 2017

19. CVis — Towards a novel visualization tool to explore the relationship between input and output partitions in multi-objective clustering ensembles

Author: Julia Handl, Tiemi C. Sakata, and Katti Faceli
Subjects: 0301 basic medicine, 02 engineering and technology, Decision maker, computer.software_genre, Partition (database), Ensemble learning, Visualization, 03 medical and health sciences, 030104 developmental biology, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Cluster analysis, computer, Intuition, Mathematics
Abstract: Ensemble methods for clustering take a collection of input partitions, produced for the same data set, and generate an ensemble partition that tries to preserve the information carried in this collective. Acceptance of the resulting partition(s) by decision makers can be a problem, due to the inherent complexity of ensemble techniques, and the associated lack of intuition on how a consensus has been derived from the original set of input partitions. This problem is exacerbated in multi-objective ensemble techniques, which generate a set of non-dominated consensus partitions. In this context, the selection of a final candidate clustering may require additional insight into the relationships between non-dominated output partitions. In this manuscript, we describe the first prototype of a novel visualization tool, CVis, which has been developed as a general tool to provide insight into the relationship between any set of partitions of a given data set. We proceed to demonstrate the specific use of this tool in understanding the relationship between the sets of input, the sets of outputs, and the input-output relationships for the multi-objective ensemble technique MOCLE. We discuss how the interlinked analysis of such sets of partitions can shed light onto the functioning, and the strengths and limitations of a particular ensemble technique. In particular, the tool facilitates the visual analysis of the level of support identified for individual consensus clusters, which is helpful in explaining final solutions to a decision maker.
Published: 2017

20. Experimentos em Aprendizado de M�quina para Fus�o de Sensores

Author: Andr C. P. L. F. de Carvalho, Katti Faceli, and Solange Oliveira Rezende
Published: 2016

21. Multi-objective design of hierarchical consensus functions for clustering ensembles via genetic programming

Author: Katti Faceli, André L. V. Coelho, and Everlandio Fernandes
Subjects: Clustering high-dimensional data, Information Systems and Management, Fuzzy clustering, Computer science, Correlation clustering, Single-linkage clustering, Population, Conceptual clustering, Genetic programming, computer.software_genre, Machine learning, Management Information Systems, Biclustering, Arts and Humanities (miscellaneous), CURE data clustering algorithm, Consensus clustering, Developmental and Educational Psychology, Cluster analysis, education, education.field_of_study, Brown clustering, business.industry, Dendrogram, Hierarchical clustering, Canopy clustering algorithm, FLAME clustering, Data mining, Artificial intelligence, Hierarchical clustering of networks, business, computer, Information Systems
Abstract: This paper investigates a genetic programming (GP) approach aimed at the multi-objective design of hierarchical consensus functions for clustering ensembles. By this means, data partitions obtained via different clustering techniques can be continuously refined (via selection and merging) by a population of fusion hierarchies having complementary validation indices as objective functions. To assess the potential of the novel framework in terms of efficiency and effectiveness, a series of systematic experiments, involving eleven variants of the proposed GP-based algorithm and a comparison with basic as well as advanced clustering methods (of which some are clustering ensembles and/or multi-objective in nature), have been conducted on a number of artificial, benchmark and bioinformatics datasets. Overall, the results corroborate the perspective that having fusion hierarchies operating on well-chosen subsets of data partitions is a fine strategy that may yield significant gains in terms of clustering robustness.
Published: 2011

22. Inducing multi-objective clustering ensembles with genetic programming

Author: André L. V. Coelho, Everlandio Fernandes, and Katti Faceli
Subjects: Clustering high-dimensional data, DBSCAN, Fuzzy clustering, Computer science, Cognitive Neuroscience, Single-linkage clustering, Rand index, Correlation clustering, Conceptual clustering, computer.software_genre, Machine learning, Biclustering, Artificial Intelligence, CURE data clustering algorithm, Consensus clustering, Cluster analysis, k-medians clustering, Brown clustering, business.industry, Constrained clustering, Ensemble learning, Computer Science Applications, Hierarchical clustering, ComputingMethodologies_PATTERNRECOGNITION, Data stream clustering, Canopy clustering algorithm, FLAME clustering, Artificial intelligence, Data mining, Hierarchical clustering of networks, business, computer
Abstract: The recent years have witnessed a growing interest in two advanced strategies to cope with the data clustering problem, namely, clustering ensembles and multi-objective clustering. In this paper, we present a genetic programming based approach that can be considered as a hybrid of these strategies, thereby allowing that different hierarchical clustering ensembles be simultaneously evolved taking into account complementary validity indices. Results of computational experiments conducted with artificial and real datasets indicate that, in most of the cases, at least one of the Pareto optimal partitions returned by the proposed approach compares favorably or go in par with the consensual partitions yielded by two well-known clustering ensemble methods in terms of clustering quality, as gauged by the corrected Rand index.
Published: 2010

23. Partitions selection strategy for set of clustering solutions

Author: André C. P. L. F. de Carvalho, Marcilio C. P. de Souto, Tiemi C. Sakata, and Katti Faceli
Subjects: Clustering high-dimensional data, Fuzzy clustering, Cognitive Neuroscience, Model selection, Correlation clustering, Single-linkage clustering, Constrained clustering, computer.software_genre, Computer Science Applications, Artificial Intelligence, Data mining, Cluster analysis, computer, k-medians clustering, Mathematics
Abstract: Clustering is a difficult task: there is no single cluster definition and the data can have more than one underlying structure. Pareto-based multi-objective genetic algorithms (e.g., MOCK-Multi-Objective Clustering with automatic K-determination and MOCLE-Multi-Objective Clustering Ensemble) were proposed to tackle these problems. However, the output of such algorithms can often contains a high number of partitions, becoming difficult for an expert to manually analyze all of them. In order to deal with this problem, we present two selection strategies, which are based on the corrected Rand, to choose a subset of solutions. To test them, they are applied to the set of solutions produced by MOCK and MOCLE in the context of several datasets. The study was also extended to select a reduced set of partitions from the initial population of MOCLE. These analysis show that both versions of selection strategy proposed are very effective. They can significantly reduce the number of solutions and, at the same time, keep the quality and the diversity of the partitions in the original set of solutions.
Published: 2010

24. Uma Revisão Sobre Combinação de Agrupamentos

Author: Murilo Coelho Naldi, Katti Faceli, André C. P. L. F. de Carvalho, and Fapesp, Cnpq
Subjects: Ciência da Computação, Combinação de Agrupamentos, General Computer Science
Abstract: Vários algoritmos de agrupamentos foram propostos na literatura. O uso de diferentes algoritmos de agrupamento, ou até mesmo de um único algoritmo, pode obter diferentes resultados quando aplicados em um mesmo conjunto de dados. A combinação de resultados, obtidos de uma técnica de classificação ou de técnicas distintas, é utilizada com sucesso para melhorar a estabilidade ou desempenho dessas técnicas. Por isto, nos últimos anos houve um aumento crescente no interesse do uso de combinação de agrupamentos de dados. Neste trabalho, é feita uma revisão sobre os principais métodos de combinação de agrupamentos encontrados na literatura. Para isso, a revisão começa com uma descrição do problema de combinação e uma análise dos objetivos comumente adotados por métodos de combinação. Em seguida, discorre-se sobre a necessidade da diversidade nos agrupamentos a serem combinados e métodos para medi-la. Também é definido um critério para medir a informação mútua entre agrupamentos e são apresentados exemplos de seu uso. O desempenho dos métodos foi comparado por vários autores na literatura e uma análise dessas comparações é realizada neste trabalho.
Published: 2010

25. Multi-objective clustering ensemble for gene expression data analysis

Author: André C. P. L. F. de Carvalho, Daniel S. A. de Araujo, Katti Faceli, and Marcilio C. P. de Souto
Subjects: Clustering high-dimensional data, Fuzzy clustering, Cognitive Neuroscience, Correlation clustering, Single-linkage clustering, k-means clustering, computer.software_genre, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, CURE data clustering algorithm, Canopy clustering algorithm, Data mining, Cluster analysis, computer, Mathematics
Abstract: In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. Its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective clustering with automatic K-determination (MOCK), the algorithm most closely related to ours.
Published: 2009

26. Multi-objective clustering ensemble

Author: Katti Faceli, André C. P. L. F. de Carvalho, and Marcílio C.P. de Souto
Published: 2007

27. Evaluation of gene selection metrics for tumor cell classification

Author: André C. P. L. F. de Carvalho, Wilson A. Silva, and Katti Faceli
Subjects: lcsh:QH426-470, business.industry, Normal tissue, Word error rate, Tumor cells, Pattern recognition, Biology, Bioinformatics, Expression (mathematics), Support vector machine, lcsh:Genetics, machine learning, sage, ComputingMethodologies_PATTERNRECOGNITION, Gene selection, Metric (mathematics), gene expression, Genetics, Artificial intelligence, business, Molecular Biology, gene selection, Web site
Abstract: Gene expression profiles contain the expression level of thousands of genes. Depending on the issue under investigation, this large amount of data makes analysis impractical. Thus, it is important to select subsets of relevant genes to work with. This paper investigates different metrics for gene selection. The metrics are evaluated based on their ability in selecting genes whose expression profile provides information to distinguish between tumor and normal tissues. This evaluation is made by constructing classifiers using the genes selected by each metric and then comparing the performance of these classifiers. The performance of the classifiers is evaluated using the error rate in the classification of new tissues. As the dataset has few tissue samples, the leave-one-out methodology was employed to guarantee more reliable results. The classifiers are generated using different machine learning algorithms. Support Vector Machines (SVMs) and the C4.5 algorithm are employed. The experiments are conduced employing SAGE data obtained from the NCBI web site. There are few analysis involving SAGE data in the literature. It was found that the best metric for the data and algorithms employed is the metric logistic.
Published: 2004

28. PVis — Partitions' visualizer: Extracting knowledge by visualizing a collection of partitions

Author: Marcilio C. P. de Souto, Tiemi C. Sakata, Katti Faceli, and André C. P. L. F. de Carvalho
Subjects: Set (abstract data type), Point (typography), Computer science, Data mining, computer.software_genre, Cluster analysis, computer, Partition (database), Domain (software engineering), Visualization
Abstract: Recent advances in cluster analysis highlight the importance of finding multiple meaningful partitions and point out to the need for approaches to evaluate them. They also suggest that the evaluation should consider knowledge of a domain expert. In this paper, we present a visualization method, called PVis 1 (Partition's Visualizer), that allows the integrated visualization of a collection of partitions. PVis allows to compare the content of a set of partitions. The comparison can be done with respect to priori knowledge provided by an expert. PVis can be useful in the discovery of relevant information to the domain experts performing cluster analysis. In order to illustrate our approach, we give an example of how to perform an exploratory analysis of collections of partitions. In order to do so, we use a well-known dataset from the Bioinformatics domain, regarding molecular classification of cancer.
Published: 2014

29. The Assessment of the Quality of Sugar Using Electronic Tongue and Machine Learning Algorithms

Author: Antonio Riul Junior, Wanessa M. D. M. F. Steluti, Tiemi C. Sakata, Katti Faceli, and Tiago A. Almeida
Subjects: Computer science, business.industry, media_common.quotation_subject, Electronic tongue, Sugar industry, Supervised learning, Machine learning, computer.software_genre, Quality (business), Artificial intelligence, Sugar, business, Baseline (configuration management), computer, media_common
Abstract: The correct classification of sugar according to its physico-chemical characteristics directly influences the value of the product and its acceptance by the market. This study shows that using an electronic tongue system along with established techniques of supervised learning leads to the correct classification of sugar samples according to their qualities. In this paper, we offer two new real, public and non-encoded sugar datasets whose attributes were automatically collected using an electronic tongue, with and without pH controlling. Moreover, we compare the performance achieved by several established machine learning methods. Our experiments were diligently designed to ensure statistically sound results and they indicate that $k$-nearest neighbors method outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.
Published: 2012

30. Improvements in the Partitions Selection Strategy for Set of Clustering Solutions

Author: André C. P. L. F. de Carvalho, Katti Faceli, Tiemi C. Sakata, and Marcilio C. P. de Souto
Subjects: Fuzzy clustering, Data stream clustering, Computer science, Model selection, Rand index, Correlation clustering, Canopy clustering algorithm, Constrained clustering, Data mining, computer.software_genre, Cluster analysis, computer
Abstract: No clustering algorithm is guaranteed to find actual groups in any dataset. Thus, the selection of the most suitable clustering algorithm to be applied to a given dataset is not easy. To deal with this problem, one can apply various clustering algorithms to the dataset, generating a set of partitions (solutions). Next, one can choose the best partition generated, according to a given validation measure - such measures are usually biased towards one or more clustering algorithms. However, in many cases, it is interesting to have more than one solution. In a previous work, we proposed a selection strategy able to reduce the number of solutions obtained from Pareto-based multi-objective genetic algorithms. This selection strategy uses the correct Rand index to select a subset of the most different partitions. The size of the solutions' set is controlled by a threshold of the value of this index, given as an external parameter. The reduction of the threshold value decreases the number of solutions. Since the choice of such a threshold value is not intuitive, this paper describes a modification of the original selection algorithm that automatically adjusts this threshold and guarantees the selection of the most evident partitions, which was simultaneously obtained with distinct clustering criteria. The new version does not require any user settings, presents a better number of solutions and maintains the diversity of the partitions in the reduced set.
Published: 2010

31. Evaluation of Clustering Results: The Trade-off Bias-Variability

Author: Katti Faceli, André C. P. L. F. de Carvalho, and Margarida G. M. S. Cardoso
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Computer science, CURE data clustering algorithm, Rand index, Consensus clustering, Correlation clustering, Stability (learning theory), Unsupervised learning, Data mining, Cluster analysis, computer.software_genre, Measure (mathematics), computer
Abstract: Clustering evaluation generally relies on some desirable properties of clustering solutions (partitions, in particular): the properties of clusters’ compactness and separation, as well as the property of stability are often considered as indicators of clustering quality. In fact, since the real clustering is unknown (clustering being originated by an unsupervised process), one should focus on obtaining good enough partitions. Clustering quality is, however, a difficult concept to put in practice. Furthermore, when aiming for clusters compactness and separation one does not necessarily meet the real clusters (e.g. Brun et al. 2007). Similarly, when focusing on the property of stability, one may find that solutions which are more stable but do not necessarily fit better the real solution (e.g. Cardoso et al. 2008). In the present paper we consider clustering solution’s reproducibility in other data sets drawn from the same source as an indicator of stability. We use a new cross-validation procedure and measure the agreement between clustering solutions obtained and the real partitions (real data sets from the UCI repository, Asuncion and Newman 2007, are used). Next, we study the association between indicators of stability and agreement with the real partition. We conclude with a discussion of the trade-off bias-variability, which we believe is a relevant issue to investigate within unsupervised learning, clustering in particular.
Published: 2010

32. A Strategy for the Selection of Solutions of the Pareto Front Approximation in Multi-objective Clustering Approaches

Author: Katti Faceli, A.C.P.L.F. de Carvalho, and M.C.P. de Souto
Subjects: Mathematical optimization, Pareto interpolation, Genetic algorithm, Rand index, Pareto principle, Cluster analysis, Multi-objective optimization, Pareto analysis, Selection (genetic algorithm), Mathematics
Abstract: One of the advantages of Pareto-based multi-objective genetic algorithms for clustering, when compared to classical clustering algorithms, is that, instead of a single solution (partition), they give as an output a set of solutions (approximation of the Pareto front or Pareto front, for short). However, such a set could be very large (e.g., hundreds of partitions) and, consequently, difficult to be analyzed manually. We present a selection strategy, based on the corrected Rand index, that aims at recommending, as final solution for Pareto-based multi-objective genetic algorithm approaches, a subset of partitions from the Pareto front. This subset should be much smaller than the the latter and, at the same time, keep the quality and the diversity of the partitions. In order to test our strategy, we develop a study of case in which we apply the strategy to the sets of solutions obtained with the multi-objective clustering ensemble algorithm (MOCLE) in the context of several data sets.
Published: 2008

33. Data clustering based on complex network community detection

Author: Liang Zhao, Katti Faceli, T.B.S. de Oliveira, and A.A. de Carvalho
Subjects: Fuzzy clustering, Computer science, business.industry, Single-linkage clustering, Correlation clustering, Pattern recognition, computer.software_genre, Complete-linkage clustering, Determining the number of clusters in a data set, CURE data clustering algorithm, Canopy clustering algorithm, Artificial intelligence, Data mining, business, Cluster analysis, computer
Abstract: Data clustering is an important technique to extract and understand relevant information in large data sets. In this paper, a clustering algorithm based on graph theoretic models and community detection in complex networks is proposed. Two steps are involved in this processing: The first step is to represent input data as a network and the second one is to partition the network into subnetworks producing data clusters. In the network partition stage, each node has a randomly assigned initial angle and it is gradually updated according to its neighbors angle agreement. Finally, a stable state is reached and nodes belonging to the same cluster have similar angles. This process is repeated, each time a cluster is chosen and results in an hierarchical divisive clustering. Simulation results show two main advantages of the algorithm: the ability to detect clusters in different shapes, densities and sizes and the ability to generate clusters with different refinement degrees. Besides of these, the proposed algorithm presents high robustness and efficiency in clustering.
Published: 2008

34. Cluster Ensemble and Multi-Objective Clustering Methods

Author: Marcilio C. P. de Souto, André C. P. L. F. de Carvalho, and Katti Faceli
Subjects: Biclustering, ComputingMethodologies_PATTERNRECOGNITION, Data exploration, Knowledge extraction, Computer science, Data stream mining, Consensus clustering, Cluster (physics), Data mining, Cluster analysis, computer.software_genre, computer, Data mining algorithm
Abstract: Clustering is an important tool for data exploration. Several clustering algorithms exist, and new algorithms are frequently proposed in the literature. These algorithms have been very successful in a large number of real-world problems. However, there is no clustering algorithm, optimizing only a single criterion, able to reveal all types of structures (homogeneous or heterogeneous) present in a dataset. In order to deal with this problem, several multi-objective clustering and cluster ensemble methods have been proposed in the literature, including our multi-objective clustering ensemble algorithm. In this chapter, we present an overview of these methods, which, to a great extent, are based on the combination of various aspects of traditional clustering algorithms.
Published: 2008

35. A framework for cluster analysis based in the multi-objective combination of clustering algorithms

Author: Katti Faceli, André Carlos Ponce de Leon Ferreira de Carvalho, Heloisa de Arruda Camargo, Francisco de Assis Tenório de Carvalho, Luiz Antonio Nogueira Lorena, and Maria Carolina Monard
Subjects: Set (abstract data type), Structure (mathematical logic), Exploratory data analysis, Computer science, Independent set, Data domain, Cluster (physics), Data mining, Cluster analysis, computer.software_genre, computer, Visualization
Abstract: Esta Tese apresenta um framework para análise exploratória de dados via técnicas de agrupamento. O objetivo é facilitar o trabalho dos especialistas no domínio dos dados. O ponto central do framework é um algoritmo de ensemble multi-objetivo, o algoritmo MOCLE, complementado por um método para a visualização integrada de um conjunto de partições. Pela aplicação conjunta das idéias de ensemble de agrupamentos e agrupamento multi-objetivo, o MOCLE efetua atomaticamente importantes passos da análise de agrupamento: executa vários algoritmos conceitualmente diferentes com várias configurações de parâmetros, combina as partições resultantes desses algoritmos e seleciona as partições com os melhores compromissos de diferentes medidas de validação. MOCLE é uma abordagem robusta para lidar com diferentes tipos de estrutura que podem estar presentes em um conjunto de dados. Ele resulta em um conjunto conciso e estável de estruturas alternativas de alta qualidade, sem a necessidade de conhecimento prévio sobre os dados e nem conhecimento profundo em análise de agrupamento. Além disso, para facilitar a descoberta de estruturas mais complexas, o MOCLE permite a integração automática de conhecimento prévio de uma estrutura simples por meio das suas funções objetivo. Finalmente, o método de visualização proposto permite a observação simultânea de um conjunto de partições. Isso ajuda na análise dos resultados do MOCLE. This Thesis presents a framework for exploratory data analysis via clustering techniques. The goal is to facilitate the work of the experts in the data domain. The core of the framework is a multi-objective clustering ensemble algorithm, the MOCLE algorithm, complemented by a method for integrated visualization of a set of partitions. By applying together the ideas of clustering ensemble and multi-objective clustering, MOCLE automatically performs important steps of cluster analysis: run several conceptually different clustering algorithms with various parameter configuration, combine the partitions resulting from these algorithms, and select the partitions with the best trade-offs for different validation measures. MOCLE is a robust approach to deal with different types of structures that can be present in a dataset. It results in a concise and stable set of high quality alternative structures, without the need of previous knowledge about the data or deep knowledge on cluster analysis. Furthermore, in order to facilitate the discovery of more complex structures, MOCLE allows the automatic integration of previous knowledge of a simple structure via their objective functions. Finally, the visualization method proposed allows the simultaneous observation of a set of partitions. This helps in the analysis of MOCLE results.
Published: 2006

36. Evaluation of the Contents of Partitions Obtained with Clustering Gene Expression Data

Author: Marcilio C. P. de Souto, Katti Faceli, and André C. P. L. F. de Carvalho
Subjects: Identification (information), Proximity measure, Fuzzy clustering, Computer science, Correlation clustering, Data mining, computer.software_genre, Cluster analysis, computer, Complete-linkage clustering, Cluster algorithm
Abstract: This work investigates the behavior of two different clustering algorithms, with two proximity measures, in terms of the contents of the partitions obtained with them. An analysis of how the classes are separated by these algorithms, as different numbers of clusters are generated, is also presented. A discussion on the use of these information in the identification of special cases for further analysis by biologists is presented.
Published: 2005

37. Combining intelligent techniques for sensor fusion

Author: Katti Faceli, Solange Oliveira Rezende, and André C. P. L. F. de Carvalho
Subjects: Computer science, business.industry, Reliability (computer networking), Estimator, Mobile robot, Robotics, computer.software_genre, Sensor fusion, Robot learning, Artificial intelligence, situated approach, Artificial Intelligence, Fuse (electrical), Robot, Computer vision, Data mining, Artificial intelligence, Hyper-heuristic, Representation (mathematics), business, computer, Reliability (statistics)
Abstract: Mobile robots rely on sensor data to build a representation of their environment. However, sensors usually provide incomplete, inconsistent or inaccurate information. Sensor fusion has been successfully employed to enhance the accuracy of sensor measures. This work proposes and investigates the use of Artificial Intelligence techniques for sensor fusion. Its main goal is to improve the accuracy and reliability of the distance measure between a robot and an object in its work environment, based on measures obtained from different sensors. Several Machine Learning algorithms are investigated to fuse the sensors data. The best model generated by each algorithm is called estimator. It is shown that the employment of estimators based on Artificial Intelligence can improve significantly the performance achieved by each sensor alone. The Machine Learning algorithms employed have different characteristics, causing the estimators to have different behaviors in different situations. Aiming to achieve an even more accurate and reliable behavior, the estimators are combined in committees. The results obtained suggest that this combination can further improve the reliability and accuracy of the distances measured by the individual sensors and estimators used for sensor fusion.
Published: 2003

38. Experiments on machine learning techniques for sensor fusion

Author: Katti Faceli, A.C.P.L.F. de Carvalho, and Solange Oliveira Rezende
Subjects: Computer science, business.industry, Mobile robot, Machine learning, computer.software_genre, Sensor fusion, Object (computer science), Robot control, Robot, Computer vision, Artificial intelligence, business, Representation (mathematics), computer, Reliability (statistics)
Abstract: Mobile robots rely on sensor data to have a representation of their environment. However the sensors usually provide incomplete, inconsistent or inaccurate information. Sensor fusion has been successfully employed to enhance the accuracy of sensor measures. This article proposes and investigates the use of artificial intelligence techniques for sensor fusion to improve the accuracy and reliability of a distance between a robot and an object in its work environment.
Published: 2002

39. Combination of artificial intelligence methods for sensor fusion

Author: Katti Faceli, André Carlos Ponce de Leon Ferreira de Carvalho, Aluizio Fausto Ribeiro Araujo, and Carlos Henrique Costa Ribeiro
Abstract: Robôs móveis dependem de dados provenientes de sensores para ter uma representação do seu ambiente. Porém, os sensores geralmente fornecem informações incompletas, inconsistentes ou imprecisas. Técnicas de fusão de sensores têm sido empregadas com sucesso para aumentar a precisão de medidas obtidas com sensores. Este trabalho propõe e investiga o uso de técnicas de inteligência artificial para fusão de sensores com o objetivo de melhorar a precisão e acurácia de medidas de distância entre um robô e um objeto no seu ambiente de trabalho, obtidas com diferentes sensores. Vários algoritmos de aprendizado de máquina são investigados para fundir os dados dos sensores. O melhor modelo gerado com cada algoritmo é chamado de estimador. Neste trabalho, é mostrado que a utilização de estimadores pode melhorar significativamente a performance alcançada por cada sensor isoladamente. Mas os vários algoritmos de aprendizado de máquina empregados têm diferentes características, fazendo com que os estimadores tenham diferentes comportamentos em diferentes situações. Objetivando atingir um comportamento mais preciso e confiável, os estimadores são combinados em comitês. Os resultados obtidos sugerem que essa combinação pode melhorar a confiança e precisão das medidas de distâncias dos sensores individuais e estimadores usados para fusão de sensores. Mobile robots rely on sensor data to have a representation of their environment. However, the sensors usually provide incomplete, inconsistent or inaccurate information. Sensor fusion has been successfully employed to enhance the accuracy of sensor measures. This work proposes and investigates the use of artificial intelligence techniques for sensor fusion. Its main goal is to improve the accuracy and reliability of a distance between a robot and an object in its work environment using measures obtained from different sensors. Several machine learning algorithms are investigated to fuse the sensors data. The best model generated with each algorithm are called estimator. It is shown that the employment of the estimators based on artificial intelligence can improve significantly the performance achieved by each sensor alone. The machine learning algorithms employed have different characteristics, causing the estimators to have different behaviour in different situations. Aiming to achieve more accurate and reliable behavior, the estimators are combined in committees. The results obtained suggest that this combination can improve the reliability and accuracy of the distance measures by the individual sensors and estimators used for sensor fusion.
Published: 2001

40. Combining Intelligent Techniques for Sensor Fusion.

Author: Katti Faceli, André C.P.L.F. de Carvalho, and Solange O. Rezende
Subjects: MOBILE robots, ARTIFICIAL intelligence, WORK environment, ALGORITHMS
Abstract: Mobile robots rely on sensor data to build a representation of their environment. However, sensors usually provide incomplete, inconsistent or inaccurate information. Sensor fusion has been successfully employed to enhance the accuracy of sensor measures. This work proposes and investigates the use of Artificial Intelligence techniques for sensor fusion. Its main goal is to improve the accuracy and reliability of the distance measure between a robot and an object in its work environment, based on measures obtained from different sensors. Several Machine Learning algorithms are investigated to fuse the sensors data. The best model generated by each algorithm is called estimator. It is shown that the employment of estimators based on Artificial Intelligence can improve significantly the performance achieved by each sensor alone. The Machine Learning algorithms employed have different characteristics, causing the estimators to have different behaviors in different situations. Aiming to achieve an even more accurate and reliable behavior, the estimators are combined in committees. The results obtained suggest that this combination can further improve the reliability and accuracy of the distances measured by the individual sensors and estimators used for sensor fusion. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

41. Análise retórica com base em grande quantidade de dados

Author: Erick Galani Maziero, Thiago Alexandre Salgueiro Pardo, Katti Faceli, Valéria Delisandra Feltrim, Estevam Rafael Hruschka Júnior, and Maria das Graças Volpe Nunes
Abstract: Com uma quantidade quase incontável de informação textual disponível na web, a automatização de diversas tarefas referentes ao processamento automático de textos é uma necessidade inegável. Em abordagens superficiais do PLN (Processamento da Linguagem Natural), importantes propriedades do texto são perdidas, como posição, ordem, adjacência e contexto dos segmentos textuais. Uma análise textual mais profunda, como a realizada no nível do discurso, ocupa-se da busca e identificação da organização retórica do texto, gerando uma estrutura hierárquica em que as intenções do autor são explicitadas e relacionadas entre si. Para a automatização dessa tarefa, tem-se utilizado técnicas de aprendizado automático, predominantemente do paradigma supervisionado. Nesse paradigma, são necessários dados rotulados manualmente para a geração dos modelos de classificação. Como a anotação para essa tarefa é algo custoso, os resultados obtidos no aprendizado são insatisfatórios, pois estão bem aquém do desempenho humano na mesma tarefa. Nesta tese, o uso massivo de dados não rotulados no aprendizado semissupervisionado sem fim foi empregado na tarefa de identificação das relações retóricas. Foi proposto um framework que utiliza textos obtidos continuamente da web. No framework, realiza-se a monitoração da mudança de conceito, que pode ocorrer durante o aprendizado contínuo, e emprega-se uma variação dos algoritmos tradicionais de semissupervisão. Além disso, foram adaptados para o Português técnicas do estado da arte. Sem a necessidade de anotação humana, a medida-F melhorou, por enquanto, em 0,144 (de 0,543 para 0,621). Esse resultado consiste no estado da arte da análise discursiva automática para o Português. Considering the almost uncountable textual information available on the web, the auto- matization of several tasks related to the automatic text processing is an undeniable need. In superficial approaches of NLP (Natural Language Processing), important properties of the text are lost, as position, order, adjacency and context of textual segments. A de- eper analysis, as carried out in the discursive level, deals with the identification of the rhetoric organization of the text, generating a hierarchical structure. In this structure, the intentions of the author are identified and related among them. To the automati- zation of this task, most of the works have used machine learning techniques, mainly from the supervised paradigm. In this paradigm, manually labeled data is required to obtain classification models, specially to identify the rhetorical relations. As the manual annotation is a costly process, the obtained results in the task are unsatisfactory, because they are below the human perfomance. In this thesis, the massive use of unlabeled data was applied in a semi-supervised never-ending learning to identify the rhetorical relations. In this exploration, a framework was proposed, which uses texts continuously obtained from the web. In the framework, a variation of traditional semi-supervised algorithms was employed, and it uses a concept-drift monitoring strategy. Besides that, state of the art techniques for English were adapted to Portuguese. Without the human intervention, the F-measure increased, for while, 0.144 (from 0.543 to 0.621). This result consists in the state-of-the-art for Discourse Analysis in Portuguese.
Published: 2018

42. A systematic comparative evaluation of biclustering techniques

Author: Victor Alexandre Padilha, Ricardo José Gabrielli Barreto Campello, Katti Faceli, David Corrêa Martins Junior, and Dilvan de Abreu Moreira
Abstract: Análise de agrupamento é um problema fundamental de aprendizado de máquina não supervisionado em que se objetiva determinar categorias que descrevam um conjunto de objetos de acordo com suas similaridades ou inter-relacionamentos. Na formulação tradicional do problema, busca-se por partições ou hierarquias de partições contendo grupos cujos objetos são de alguma forma similares entre si e dissimilares aos objetos dos demais grupos, segundo alguma medida direta ou indireta de (dis)similaridade que leva em conta o conjunto completo de atributos que descrevem os objetos na base de dados sob análise. Entretanto, apesar de décadas de aplicações bem sucedidas, existem situações em que a natureza dos agrupamentos contidos nos dados não pode ser representada segundo este tipo de formulação. Em particular, existem situações em que grupos de objetos se caracterizam como tais apenas segundo um subconjunto dos atributos que os descrevem, sendo que tal subconjunto pode ser distinto para cada grupo. Ao contrário de algoritmos de agrupamento tradicionais, algoritmos de bi-agrupamento são capazes de agrupar simultaneamente linhas e colunas de uma matriz de dados. Tais algoritmos produzem bi-grupos formados por subconjuntos de objetos e subconjuntos de atributos de alguma forma fortemente co-relacionados. Esses algoritmos passaram a atrair a atenção da comunidade científica quando se evidenciou a relevância da tarefa de bi-agrupamento em problemas de análise de dados de expressão gênica em bioinformática. Embora em menor grau, as abordagens de bi-agrupamento também têm ganho atenção em outros domínios de aplicação, tais como mineração de textos (text mining) e filtragem colaborativa em sistemas de recomendação. O problema é que uma variedade de algoritmos de bi-agrupamento têm sido propostos na literatura baseados em diferentes princípios e suposições sobre os dados, podendo chegar a resultados completamente distintos em uma mesma aplicação. Nesse cenário, torna-se importante a realização de estudos comparativos que possam contrastar o comportamento e desempenho dos diversos algoritmos. Neste trabalho é apresentado um estudo comparativo envolvendo 17 algoritmos de bi-agrupamento (representativos das principais categorias de algoritmos existentes) em coleções de bases de dados tanto de natureza real como simulada, com particular ênfase em problemas de análise de dados de expressão gênica. Diversos aspectos metodológicos e procedimentos para a avaliação experimental foram considerados, a fim de superar as limitações de estudos comparativos anteriores da literatura. Além da comparação em si, todo o arcabouço comparativo pode ser reutilizado para a comparação de outros algoritmos no futuro. Data clustering is a fundamental problem in the unsupervised machine learning field, whose objective is to find categories that describe a dataset according to similarities between its objects. In its traditional formulation, we search for partitions or hierarchies of partitions containing clusters such that the objects contained in the same cluster are similar to each other and dissimilar to objects from other clusters according to a similarity or dissimilarity measure that uses all the data attributes in its calculation. So, it is supposed that all clusters are characterized in the same feature space. However, there are several applications where the clusters are characterized only in a subset of the attributes, which could be different from one cluster to another. Different than traditional data clustering algorithms, biclustering algorithms are able to cluster the rows and columns of a data matrix simultaneously, producing biclusters formed with strongly related subsets of objects and subsets of attributes. These algorithms started to draw the scientific communitys attention only after some studies that show their importance for gene expression data analysis. To a lesser degree, biclustering techniques have also been used in other application domains, such as text mining and collaborative filtering in recommendation systems. The problem is that several biclustering algorithms have been proposed in the past recent years with different principles and assumptions, which could result in different outcomes in the same dataset. So, it becomes important to perform comparative studies that could illustrate the behavior and performance of some algorithms. In this thesis, it is presented a comparative study with 17 biclustering algorithms (which are representative of the main categories of algorithms in the literature) which were tested on synthetic and real data collections, with particular emphasis on gene expression data analysis. Several methodologies and experimental evaluation procedures were taken into account during the research, in order to overcome the limitations of previous comparative studies from the literature. Beyond the presented comparison, the comparative methodology developed could be reused to compare other algorithms in the future.
Published: 2016

43. Sentiment analysis in short texts from social networks

Author: Nádia Félix Felipe da Silva, Eduardo Raul Hruschka, Romis Ribeiro de Faissol Attux, Katti Faceli, Thiago Alexandre Salgueiro Pardo, and Ivan Luiz Marques Ricarte
Subjects: Sociology
Abstract: A análise de sentimentos é um campo de estudo com recente popularização devido ao crescimento da Internet e do conteúdo que é gerado por seus usuários, principalmente nas redes sociais, nas quais as pessoas publicam suas opiniões em uma linguagem coloquial e em muitos casos utilizando de artifícios gráficos para tornar ainda mais sucintos seus diálogos. Esse cenário é observado no Twitter, uma ferramenta de comunicação que pode facilmente ser usada como fonte de informação para várias ferramentas automáticas de inferência de sentimentos. Esforços de pesquisas têm sido direcionados para tratar o problema de análise de sentimentos em redes sociais sob o ponto de vista de um problema de classificação, com pouco consenso sobre qual é o classificador com melhor poder preditivo, bem como qual é a configuração fornecida pela engenharia de atributos que melhor representa os textos. Outro problema é que em um cenário supervisionado, para a etapa de treinamento do modelo de classificação, é imprescindível se dispor de exemplos rotulados, uma tarefa árdua e que demanda esforço humano em grande parte das aplicações. Esta tese tem por objetivo investigar o uso de agregadores de classificadores (classifier ensembles), explorando a diversidade e a potencialidade de várias abordagens supervisionadas quando estas atuam em conjunto, além de um estudo detalhado da fase que antecede a escolha do classificador, a qual é conhecida como engenharia de atributos. Além destes aspectos, um estudo mostrando que o aprendizado não supervisionado pode fornecer restrições complementares úteis para melhorar a capacidade de generalização de classificadores de sentimento é realizado, fornecendo evidências de que ganhos já observados em outras áreas do conhecimento também podem ser obtidos no domínio em questão. A partir dos promissores resultados experimentais obtidos no cenário de aprendizado supervisionado, alavancados pelo uso de técnicas não supervisionadas, um algoritmo existente, denominado de C3E (Consensus between Classification and Clustering Ensembles) foi adaptado e estendido para o cenário semissupervisionado. Este algoritmo refina a classificação de sentimentos a partir de informações adicionais providas pelo agrupamento em um procedimento de autotreinamento (self-training). Tal abordagem apresenta resultados promissores e competitivos com abordagens que representam o estado da arte em outros domínios. Sentiment analysis is a field of study that shows recent popularization due to the growth of Internet and the content that is generated by its users. More recently, social networks have emerged, where people post their opinions in colloquial and compact language. This is what happens in Twitter, a communication tool that can easily be used as a source of information for various automatic tools of sentiment inference. Research efforts have been directed to deal with the problem of sentiment analysis in social networks from the point of view of a classification problem, where there is no consensus about what is the best classifier, and what is the best configuration provided by the feature engineering process. Another problem is that in a supervised setting, for the training stage of the classification model, we need labeled examples, which are hard to get in the most of applications. The objective of this thesis is to investigate the use of classifier ensembles, exploring the diversity and the potential of various supervised approaches when these work together, as well as to provide a study about the phase that precedes the choice of the classifier, which is known as feature engineering. In addition to these aspects, a study showing that unsupervised learning techniques can provide useful and additional constraints to improve the ability of generalization of the classifiers is also carried out. Based on the promising results got in supervised learning settings, an existing algorithm called C3E (Consensus between Classification and Clustering Ensembles) was adapted and extended for the semi-supervised setting. This algorithm refines the sentiment classification from additional information provided by clusters of data, in a self-training procedure. This approach shows promising results when compared with state of the art algorithms.
Published: 2016

44. Abordagens evolutivas para agrupamento relacional de dados

Author: Danilo Horta, Ricardo José Gabrielli Barreto Campello, Katti Faceli, and Solange Oliveira Rezende
Subjects: Earthquake engineering, Web mining, Secrecy, Evolutionary algorithm, Context (language use), Data mining, Cluster analysis, computer.software_genre, Focus (optics), computer
Abstract: O agrupamento de dados é uma técnica fundamental em aplicações de diversos campos do mercado e da ciência, como, por exemplo, no comércio, na biologia, na psiquiatria, na astronomia e na mineração da Web. Ocorre que em um subconjunto desses campos, como engenharia industrial, ciências sociais, engenharia sísmica e recuperação de documentos, as bases de dados são usualmente descritas apenas pelas proximidades entre os objetos (denominadas bases de dados relacionais). Mesmo em aplicações nas quais os dados não são naturalmente relacionais, o uso de bases relacionais permite que os dados em si sejam mantidos sob sigilo, o que pode ser de grande valia para bancos ou corretoras, por exemplo. Nesta dissertação é apresentada uma revisão de algoritmos de agrupamento de dados que lidam com bases de dados relacionais, com foco em algoritmos que produzem partições rígidas (hard ou crisp) dos dados. Particular ênfase é dada aos algoritmos evolutivos, que têm se mostrado capazes de resolver problemas de agrupamento de dados com relativa acurácia e de forma computacionalmente eficiente. Nesse contexto, propõe-se nesta dissertação um novo algoritmo evolutivo de agrupamento capaz de operar sobre dados relacionais e também capaz de estimar automaticamente o número de grupos nos dados (usualmente desconhecido em aplicações práticas). É demonstrado empiricamente que esse novo algoritmo pode superar métodos tradicionais da literatura em termos de eficiência computacional e acurácia Data clustering is a fundamental technique for applications in several fields of science and marketing, as commerce, biology, psychiatry, astronomy, and Web mining. However, in a subset of these fields, such as industrial engineering, social sciences, earthquake engineering, and retrieval of documents, datasets are usually described only by proximities between their objects (called relational datasets). Even in applications where the data are not naturally relational, the use of relational datasets preserves the datas secrecy, which can be of great value to banks or brokers, for instance. This dissertation presents a review of data clustering algorithms which deals with relational datasets, with a focus on algorithms that produce hard or crisp partitions of data. Particular emphasis is given to evolutionary algorithms, which have proved of being able to solve problems of data clustering accurately and efficiently. In this context, we propose a new evolutionary algorithm for clustering able to operate on relational datasets and also able to automatically estimate the number of clusters (which is usually unknown in practical applications). It is empirically shown that this new algorithm can overcome traditional methods described in the literature in terms of computational efficiency and accuracy
Published: 2015

45. Agrupamento hierárquico semissupervisionado ativo baseado em confiança e sua aplicação para extração de hierarquias de tópicos a partir de coleções de documentos

Author: Bruno Magalhães Nogueira, Solange Oliveira Rezende, Alípio Mário Guedes Jorge, Inês de Castro Dutra, Katti Faceli, and Ronaldo Cristiano Prati
Subjects: business.industry, Computer science, Active learning (machine learning), Artificial intelligence, Document clustering, Machine learning, computer.software_genre, business, Cluster analysis, computer, Semi supervised clustering
Abstract: Topic hierarchies are efficient ways of organizing document collections. These structures help users to manage the knowledge contained in textual data. These hierarchies are usually obtained through unsupervised hierarchical clustering algorithms. By not considering the context of the user in the formation of the hierarchical groups, unsupervised topic hierarchies may not attend the user\'s expectations in some cases. One possible solution for this problem is to employ semi-supervised clustering algorithms. These algorithms incorporate the user\'s knowledge through the usage of constraints to the clustering process. However, in the context of semi-supervised hierarchical clustering, the works in the literature do not efficient explore the selection of cases (instances or cluster) to add constraints, neither the interaction of the user with the clustering process. In this sense, in this work we introduce two semi-supervised hierarchical clustering algorithms: HCAC (Hierarchical Confidence-based Active Clustering) and HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). These algorithms employ an active learning approach based in the confidence of cluster merges. When a low confidence merge is detected, the user is invited to decide, from a pool of candidate pairs of clusters, the best cluster merge in that point. In this work, we employ HCAC and HCAC-LC in the extraction of topic hierarchies through the SMITH framework, which is also proposed in this thesis. This framework provides a series of well defined activities that allow the user\'s interaction in the generation of topic hierarchies. The active learning approach used in the HCAC-based algorithms, the kind of queries employed in these algorithms, as well as the SMITH framework for the generation of semi-supervised topic hierarchies are innovations to the state of the art proposed in this thesis. Our experimental results indicate that HCAC and HCAC-LC outperform other semi-supervised hierarchical clustering algorithms in diverse scenarios. The results also indicate that semi-supervised topic hierarchies obtained through the SMITH framework are more intuitive and easier to navigate than unsupervised topic hierarchies Hierarquias de tópicos são formas eficientes de organização de coleções de documentos, auxiliando usuários a gerir o conhecimento materializado nessas publicações textuais. Tais hierarquias são usualmente construídas por meio de algoritmos de agrupamento hierárquico não supervisionado. Entretanto, por não considerarem o contexto do usuário na formação dos grupos, hierarquias de tópicos não supervisionadas nem sempre conseguem atender as suas expectativas. Uma solução para este problema e o emprego de algoritmos de agrupamento semissupervisionado, os quais incorporam o conhecimento de domínio do usuário por meio de restrições. Entretanto, para o contexto de agrupamento hierárquico semissupervisionado, não são eficientemente explorados na literatura métodos de seleção de casos (instâncias ou grupos) para receber restrições, bem como não há formas eficientes de interação do usuário com o processo de agrupamento hierárquico. Dessa maneira, neste trabalho, dois algoritmos de agrupamento hierárquico semissupervisionado são propostos: HCAC (Hierarchical Confidence-based Active Clustering) e HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). Estes algoritmos empregam uma abordagem de aprendizado ativo baseado na confiança de uma junção de clusters. Quando uma junção de baixa confiança e detectada, o usuário e convidado a decidir, em um conjunto de pares de grupos candidatos, a melhor junção naquele ponto. Estes algoritmos são aqui utilizados na extração de hierarquias de tópicos por meio do framework SMITH, também proposto nesse trabalho. Este framework fornece uma série de atividades bem definidas que possibilitam a interação do usuário para a obtenção de hierarquias de tópicos. A abordagem de aprendizado ativo utilizado nos algoritmos HCAC e HCAC-LC, o tipo de restrição utilizada nestes algoritmos, bem como o framework SMITH para obtenção de hierarquias de tópicos semissupervisionadas são inovações ao estado da arte propostos neste trabalho. Os resultados obtidos indicam que os algoritmos HCAC e HCAC-LC superam o desempenho de outros algoritmos hierárquicos semissupervisionados em diversos cenários. Os resultados também indicam que hierarquias de tópico semissupervisionadas obtidas por meio do framework SMITH são mais intuitivas e fáceis de navegar do que aquelas não supervisionadas
Published: 2013

46. Unsupervised learning of topic hierarchies from dynamic text collections

Author: Ricardo Marcondes Marcacini, Solange Oliveira Rezende, Heloisa de Arruda Camargo, and Katti Faceli
Subjects: Information retrieval, Computer science, Process (engineering), Knowledge organization, media_common.quotation_subject, Unsupervised learning, Quality (business), Exploratory search, Representation (mathematics), Cluster analysis, Hierarchical clustering, media_common
Abstract: A necessidade de extrair conhecimento útil e inovador de grandes massas de dados textuais, tem motivado cada vez mais a investigação de métodos para Mineração de Textos. Dentre os métodos existentes, destacam-se as iniciativas para organização de conhecimento por meio de hierarquias de tópicos, nas quais o conhecimento implícito nos textos é representado em tópicos e subtópicos, e cada tópico contém documentos relacionados a um mesmo tema. As hierarquias de tópicos desempenham um papel importante na recupera ção de informação, principalmente em tarefas de busca exploratória, pois permitem a análise do conhecimento de interesse em diversos níveis de granularidade e exploração interativa de grandes coleções de documentos. Para apoiar a construção de hierarquias de tópicos, métodos de agrupamento hierárquico têm sido utilizados, uma vez que organizam coleções textuais em grupos e subgrupos, de forma não supervisionada, por meio das similaridades entre os documentos. No entanto, a maioria dos métodos de agrupamento hierárquico não é adequada em cenários que envolvem coleções textuais dinâmicas, pois são exigidas frequentes atualizações dos agrupamentos. Métodos de agrupamento que respeitam os requisitos existentes em cenários dinâmicos devem processar novos documentos assim que são adicionados na coleção, realizando o agrupamento de forma incremental. Assim, neste trabalho é explorado o uso de métodos de agrupamento incremental para o aprendizado não supervisionado de hierarquias de tópicos em coleções textuais dinâmicas. O agrupamento incremental é aplicado na construção e atualização de uma representação condensada dos textos, que mantém um sumário das principais características dos dados. Os algoritmos de agrupamento hierárquico podem, então, ser aplicados sobre as representa ções condensadas, obtendo-se a organização da coleção textual de forma mais eficiente. Foram avaliadas experimentalmente três estratégias de agrupamento incremental da literatura, e proposta uma estratégia alternativa mais apropriada para hierarquias de tópicos. Os resultados indicaram que as hierarquias de tópicos construídas com uso de agrupamento incremental possuem qualidade próxima às hierarquias de tópicos construídas por métodos não incrementais, com significativa redução do custo computacional The need to extract new and useful knowledge from large textual collections has motivated researchs on Text Mining methods. Among the existing methods, initiatives for the knowledge organization by topic hierarchies are very popular. In the topic hierarchies, the knowledge is represented by topics and subtopics, and each topic contains documents of similar content. They play an important role in information retrieval, especially in exploratory search tasks, allowing the analysis of knowledge in various levels of granularity and interactive exploration of large document collections. Hierarchical clustering methods have been used to support the construction of topic hierarchies. These methods organize textual collections in clusters and subclusters, in an unsupervised manner, using similarities among documents. However, most existing hierarchical clustering methods is not suitable for scenarios with dynamic text collections, since frequent clustering updates are necessary. Clustering methods that meet these requirements must process new documents that are inserted into textual colections, in general, through incremental clustering. Thus, we studied the incremental clustering methods for unsupervised learning of topic hierarchies for dynamic text collections. The incremental clustering is used to build and update a condensed representation of texts, which maintains a summary of the main features of the data. The hierarchical clustering algorithms are applied in these condensed representations, obtaining the textual organization more efficiently. We experimentally evaluate three incremental clustering algorithms available in the literature. Also, we propose an alternative strategy more appropriate for construction of topic hieararchies. The results indicated that the topic hierarchies construction using incremental clustering have quality similar to non-incremental methods. Furthermore, the computational cost is considerably reduced using incremental clustering methods
Published: 2011

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

46 results on '"Katti Faceli"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources