Author: "José A. Riquelme" / Topic: data mining - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"José A. Riquelme"' showing total 78 results

Start Over Author "José A. Riquelme" Topic data mining

78 results on '"José A. Riquelme"'

1. Enhancing Object Detection for Autonomous Driving by Optimizing Anchor Generation and Addressing Class Imbalance

Author: José C. Riquelme, Manuel Carranza-García, Jorge García-Gutiérrez, Pedro Lara-Benítez, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Ministerio de Ciencia, Innovación y Universidades (MICINN). España, and Junta de Andalucía
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, 0209 industrial biotechnology, Class imbalance, Object detection, Computer Science - Artificial Intelligence, Computer science, Cognitive Neuroscience, Computer Vision and Pattern Recognition (cs.CV), Autonomous vehicles, Computer Science - Computer Vision and Pattern Recognition, Evolutionary algorithm, Convolutional Neural Network, Context (language use), 02 engineering and technology, computer.software_genre, Machine Learning (cs.LG), 020901 industrial engineering & automation, Artificial Intelligence, Header, 0202 electrical engineering, electronic engineering, information engineering, Cluster analysis, Ensemble forecasting, Deep learning, Anchor optimization, Computer Science Applications, Artificial Intelligence (cs.AI), Benchmark (computing), Key (cryptography), Convolutional neural networks, 020201 artificial intelligence & image processing, Data mining, computer
Abstract: Object detection has been one of the most active topics in computer vision for the past years. Recent works have mainly focused on pushing the state-of-the-art in the general-purpose COCO benchmark. However, the use of such detection frameworks in specific applications such as autonomous driving is yet an area to be addressed. This study presents an enhanced 2D object detector based on Faster RCNN that is better suited for the context of autonomous vehicles. Two main aspects are improved: the anchor generation procedure and the performance drop in minority classes. The default uniform anchor configuration is not suitable in this scenario due to the perspective projection of the vehicle cameras. Therefore, we propose a perspective-aware methodology that divides the image into key regions via clustering and uses evolutionary algorithms to optimize the base anchors for each of them. Furthermore, we add a module that enhances the precision of the second-stage header network by including the spatial information of the candidate regions proposed in the first stage. We also explore different reweighting strategies to address the foreground-foreground class imbalance, showing that the use of a reduced version of focal loss can significantly improve the detection of difficult and underrepresented objects in two-stage detectors. Finally, we design an ensemble model to combine the strengths of the different learning strategies. Our proposal is evaluated with the Waymo Open Dataset, which is the most extensive and diverse up to date. The results demonstrate an average accuracy improvement of 6.13% mAP when using the best single model, and of 9.69% mAP with the ensemble. The proposed modifications over the Faster R-CNN do not increase computational cost and can easily be extended to optimize other anchor-based detection frameworks. Ministerio de Ciencia, Innovación y Universidades TIN2017-88209-C2 Junta de Andalucía US-1263341 Junta de Andalucía P18-RT-2778
Published: 2021
Full Text: View/download PDF

2. MRQAR: A generic MapReduce framework to discover quantitative association rules in big data problems

Author: José C. Riquelme-Santos, Jesús Alcalá-Fdez, D. Martín, Francisco Herrera, María Martínez-Ballesteros, and Diego García-Gil
Subjects: Information Systems and Management, Association rule learning, business.industry, Computer science, media_common.quotation_subject, Big data, Evolutionary algorithm, 02 engineering and technology, computer.software_genre, Field (computer science), Management Information Systems, Task (project management), Set (abstract data type), Artificial Intelligence, 020204 information systems, Spark (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Quality (business), Data mining, business, computer, Software, media_common
Abstract: Many algorithms have emerged to address the discovery of quantitative association rules from datasets in the last years. However, this task is becoming a challenge because the processing power of most existing techniques is not enough to handle the large amount of data generated nowadays. These vast amounts of data are known as Big Data. A number of previous studies have been focused on mining boolean or nominal association rules from Big Data problems, nevertheless, the data in real-world applications usually consist of quantitative values and designing data mining algorithms able to extract quantitative association rules presents a challenge to workers in this research field. In spite of the fact that we can find classical methods to discover boolean or nominal association rules in the most well-known repositories of Big Data algorithms, such repositories do not provide methods to discover quantitative association rules. Indeed, no methodologies have been proposed in the literature without prior discretization in Big Data. Hence, this work proposes MRQAR, a new generic parallel framework to discover quantitative association rules in large amounts of data, designed following the MapReduce paradigm using Apache Spark. MRQAR performs an incremental learning able to run any sequential quantitative association rule algorithm in Big Data problems without needing to redesign such algorithms. As a case study, we have integrated the multiobjective evolutionary algorithm MOPNAR into MRQAR to validate the generic MapReduce framework proposed in this work. The results obtained in the experimental study performed on five Big Data problems prove the capability of MRQAR to obtain reduced set of high quality rules in reasonable time.
Published: 2018
Full Text: View/download PDF

3. On the evolutionary weighting of neighbours and features in the k-nearest neighbour rule

Author: Jorge García-Gutiérrez, José C. Riquelme-Santos, Daniel Mateos-García, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, and Universidad de Sevilla. TIC-254: Data Science and Big Data Lab
Subjects: 0209 industrial biotechnology, business.industry, Computer science, Cognitive Neuroscience, Neighbours weighting, Feature weighting, Pattern recognition, 02 engineering and technology, computer.software_genre, Evolutionary computation, Computer Science Applications, Weighting, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Data mining, K nearest neighbour, business, Evolutionary Computation, Classifier (UML), computer
Abstract: This paper presents an evolutionary method for modifying the behaviour of the k-Nearest-Neighbour clas sifier (kNN) called Simultaneous Weighting of Attributes and Neighbours (SWAN). Unlike other weighting methods, SWAN presents the ability of adjusting the contribution of the neighbours and the significance of the features of the data. The optimization process focuses on the search of two real-valued vectors. One of them represents the votes of neighbours, and the other one represents the weight of each feature. The synergy between the two sets of weights found in the optimization process helps to improve significantly, the classification accuracy. The results on 35 datasets from the UCI repository suggest that SWAN statistically outperforms the other weighted kNN methods
Published: 2019

4. Data Science and Big Data in Energy Forecasting

Author: José C. Riquelme, Francisco Martínez-Álvarez, and Alicia Troncoso
Subjects: Control and Optimization, Wind power, Renewable Energy, Sustainability and the Environment, Computer science, business.industry, lcsh:T, 020209 energy, Big data, Energy Engineering and Power Technology, Energy forecasting, forecasting, 02 engineering and technology, data mining, Data science, lcsh:Technology, big data, 0202 electrical engineering, electronic engineering, information engineering, Relevance (information retrieval), Electrical and Electronic Engineering, time series, business, Engineering (miscellaneous), Energy (signal processing), Energy (miscellaneous), energy
Abstract: This editorial summarizes the performance of the special issue entitled Data Science and Big Data in Energy Forecasting, which was published at MDPI’s Energies journal. The special issue took place in 2017 and accepted a total of 13 papers from 7 different countries. Electrical, solar and wind energy forecasting were the most analyzed topics, introducing new methods with applications of utmost relevance.
Published: 2018

5. Merging subsets of attributes to improve a hybrid consistency-based filter: a case of study in product unit neural networks

Author: Antonio J. Tallón-Ballesteros, Roberto Ruiz, and José C. Riquelme
Subjects: Artificial neural network, Computer science, business.industry, Evolutionary algorithm, Pattern recognition, Feature selection, 02 engineering and technology, computer.software_genre, Human-Computer Interaction, Artificial Intelligence, Filter (video), Consistency (statistics), 020204 information systems, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, Data mining, Artificial intelligence, business, computer, Software, Evolutionary programming
Abstract: This paper presents a quality enhancement of the selected features by a hybrid filter-based jointly on feature ranking and feature subset selection FR-FSS using a consistency-based measure via merging new features which are obtained applying other FR-FSS evaluated with a correlation metric. The goal is to overcome the accuracy of a neural network classifier containing product units as hidden nodes combined with a feature selection pre-processing step by means of a single consistency-based FR-FSS filter. Neural models are trained with a refined evolutionary programming approach called two-stage evolutionary algorithm. The experimentation has been carried out in eight complex classification problems, seven out of them from UCI University of California at Irvine repository and one real-world problem, with high test error rates around 20% with powerful classifiers such as 1-nearest neighbour or C4.5. Non-parametric statistical tests revealed that the new proposal significantly improves the accuracy of the neural models.
Published: 2016
Full Text: View/download PDF

6. An evolutionary voting for k-nearest neighbours

Author: José C. Riquelme-Santos, Jorge García-Gutiérrez, Daniel Mateos-García, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: 0209 industrial biotechnology, media_common.quotation_subject, Closeness, Weighted voting, Value (computer science), Evolutionary computation, 02 engineering and technology, computer.software_genre, Nearest-neigbour, 020901 industrial engineering & automation, Artificial Intelligence, Voting, 0202 electrical engineering, electronic engineering, information engineering, K nearest neighbour, Mathematics, media_common, General Engineering, Nearest neighbour, Computer Science Applications, Constraint (information theory), ComputingMethodologies_PATTERNRECOGNITION, 020201 artificial intelligence & image processing, Data mining, computer, Algorithm
Abstract: We optimize the voting system for the k nearest neighbours.We use evolutionary computation.We study the influence of the closeness of neighbours on the search process.The results are statistically validated. This work presents an evolutionary approach to modify the voting system of the k-nearest neighbours (kNN) rule we called EvoNN. Our approach results in a real-valued vector which provides the optimal relative contribution of the k-nearest neighbours. We compare two possible versions of our algorithm. One of them (EvoNN1) introduces a constraint on the resulted real-valued vector where the greater value is assigned to the nearest neighbour. The second version (EvoNN2) does not include any particular constraint on the order of the weights. We compare both versions with classical kNN and 4 other weighted variants of the kNN on 48 datasets of the UCI repository. Results show that EvoNN1 outperforms EvoNN2 and statistically obtains better results than the rest of the compared methods.
Published: 2016
Full Text: View/download PDF

7. Improving a multi-objective evolutionary algorithm to discover quantitative association rules

Author: José C. Riquelme, Alicia Troncoso, Francisco Martínez-Álvarez, and María Martínez-Ballesteros
Subjects: Association rule learning, business.industry, Evolutionary algorithm, Sorting, 02 engineering and technology, Machine learning, computer.software_genre, Multi-objective optimization, Distance measures, Evolutionary computation, Human-Computer Interaction, Set (abstract data type), Artificial Intelligence, Hardware and Architecture, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, sort, 020201 artificial intelligence & image processing, Artificial intelligence, Data mining, business, computer, Software, Information Systems, Mathematics
Abstract: This work aims at correcting flaws existing in multi-objective evolutionary schemes to discover quantitative association rules, specifically those based on the well-known non-dominated sorting genetic algorithm-II (NSGA-II). In particular, a methodology is proposed to find the most suitable configurations based on the set of objectives to optimize and distance measures to rank the non-dominated solutions. First, several quality measures are analyzed to select the best set of them to be optimized. Furthermore, different strategies are applied to replace the crowding distance used by NSGA-II to sort the solutions for each Pareto-front since such distance is not suitable for handling many-objective problems. The proposed enhancements have been integrated into the multi-objective algorithm called MOQAR. Several experiments have been carried out to assess the algorithm's performance by using different configuration settings, and the best ones have been compared to other existing algorithms. The results obtained show a remarkable performance of MOQAR in terms of quality measures.
Published: 2015
Full Text: View/download PDF

8. A comparison of machine learning regression techniques for LiDAR-derived estimation of forest variables

Author: Jorge García-Gutiérrez, Francisco Martínez-Álvarez, Alicia Troncoso, and José C. Riquelme
Subjects: Artificial neural network, business.industry, Computer science, Cognitive Neuroscience, Gaussian, Feature selection, Machine learning, computer.software_genre, Regression, Computer Science Applications, Random forest, Support vector machine, symbols.namesake, Lidar, Artificial Intelligence, Linear regression, symbols, Artificial intelligence, Data mining, business, computer
Abstract: Light Detection and Ranging (LiDAR) is a remote sensor able to extract three-dimensional information. Environmental models in forest areas have been benefited by the use of LiDAR-derived information in the last years. A multiple linear regression (MLR) with previous stepwise feature selection is the most common method in the literature to develop those models. MLR defines the relation between the set of field measurements and the statistics extracted from a LiDAR flight. Machine learning has emerged as a suitable tool to improve classic stepwise MLR results on LiDAR. Unfortunately, few studies have been proposed to compare the quality of the multiple machine learning approaches. This paper presents a comparison between the classic MLR-based methodology and regression techniques in machine learning (neural networks, support vector machines, nearest neighbour, ensembles such as random forests) with special emphasis on regression trees. The selected techniques are applied to real LiDAR data from two areas in the province of Lugo (Galizia, Spain). The results confirm that classic MLR is outperformed by machine learning techniques and concretely, our experiments suggest that Support Vector Regression with Gaussian kernels statistically outperforms the rest of the techniques.
Published: 2015
Full Text: View/download PDF

9. Local models-based regression trees for very short-term wind speed prediction

Author: José C. Riquelme, Sancho Salcedo-Sanz, C. Casanova-Mateo, Alicia Troncoso, Luis Prieto, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Engineering, Artificial neural network, Renewable Energy, Sustainability and the Environment, business.industry, Astrophysics::High Energy Astrophysical Phenomena, Computation, regression trees, computer.software_genre, Wind speed prediction, Wind speed, Regression, Term (time), Support vector machine, Very short-term forecasting horizon, Data mining, business, computer, Simulation
Abstract: This paper evaluates the performance of different types of Regression Trees (RTs) in a real problem of very short-term wind speed prediction from measuring data in wind farms. RT is a solidly established methodology that, contrary to other soft-computing approaches, has been under-explored in problems of wind speed prediction in wind farms. In this paper we comparatively evaluate eight different types of RTs algorithms, and we show that they are able obtain excellent results in real problems of very short-term wind speed prediction, improving existing classical and soft-computing approaches such as multi-linear regression approaches, different types of neural networks and support vector regression algorithms in this problem.We also show that RTs have a very small computation time, that allows the retraining of the algorithms whenever new wind speed data are collected from the measuring towers. Ministerio de Ciencia y Tecnología ECO2010-22065-C03-02 Ministerio de Ciencia y Tecnología TIN2011-28956-C02 Junta de Andalucía P12-TIC-1728 Universidad Pablo de Olavide APPB813097
Published: 2015
Full Text: View/download PDF

10. An approach to validity indices for clustering techniques in Big Data

Author: José Cristóbal Riquelme Santos, María Martínez-Ballesteros, Jorge García-Gutiérrez, José María Luna-Romera, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla. TIC-254: Data Science and Big Data Lab, and Ministerio de Economía y Competitividad (MINECO). España
Subjects: Big Data, Computer science, business.industry, Correlation clustering, Big data, Computational intelligence, 02 engineering and technology, computer.software_genre, Clustering, Clustering validity indices, Artificial Intelligence, 020204 information systems, Spark (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Intra-cluster distance, Cluster analysis, business, computer, Data objects
Abstract: Clustering analysis is one of the most used Machine Learning techniques to discover groups among data objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist several cluster validity indices that help us to approximate the optimal number of clusters of the dataset. However, such indices are not suitable to deal with Big Data due to its size limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low computational time. Our indices are based on redefinitions of traditional indices by simplifying the intra-cluster distance calculation. Two types of tests have been carried out over 28 synthetic datasets to analyze the performance of the proposed indices. First, we test the indices with small and medium size datasets to verify that our indices have a similar effectiveness to the traditional ones. Subsequently, tests on datasets of up to 11 million records and 20 features have been executed to check their efficiency. The results show that both indices can handle Big Data in a very low computational time with an effectiveness similar to the traditional indices using Apache Spark framework. Ministerio de Economía y Competitividad TIN2014-55894-C2-1-R
Published: 2018

11. SmartFD: A Real Big Data Application for Electrical Fraud Detection

Author: Francisco Martínez-Álvarez, J. Tejedor, J. A. Fábregas, Alicia Troncoso, José C. Riquelme, A. Arcos, David Gutiérrez-Avilés, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla. Departamento de Organización Industrial y Gestión de Empresas I, and Ministerio de Economía y Competitividad (MINECO). España
Subjects: Consumption (economics), Big Data, Computer science, business.industry, Sensors, 020209 energy, Big data, 02 engineering and technology, computer.software_genre, Missing data, Classification, Field (computer science), Production planning, Fraud detection, 0202 electrical engineering, electronic engineering, information engineering, Data deduplication, Data mining, Cluster analysis, Raw data, business, computer
Abstract: The main objective of this paper is the application of big data analytics to a real case in the field of smart electric networks. Smart meters are not only elements to measure consumption, but they also con stitute a network of millions of sensors in the electricity network. These sensors provide a huge amount of data that, once analyzed, can lead to significant advances for the society. In this way, tools are being developed in order to reach certain goals, such as obtaining a better consumption estimation (which would imply a better production planning), finding better rates based on the time discrimination or the contracted power, or minimizing the non-technical losses in the network, whose actual costs are eventually paid by end-consumers, among others. In this work, real data from Spanish consumers have been analyzed to detect fraud in con sumption. First, 1 TB of raw data was preprocessed in a HDFS-Spark infrastructure. Second, data duplication and outliers were removed, and missing values handled with specific big data algorithms. Third, cus tomers were characterized by means of clustering techniques in different scenarios. Finally, several key factors in fraud consumption were found. Very promising results were achieved, verging on 80% accuracy Ministerio de Economía y Competitividad TIN2014-55894-C2-R Ministerio de Economía y Competitividad TIN2017-88209-C2-R
Published: 2018

12. Enhancing the scalability of a genetic algorithm to discover quantitative association rules in large-scale datasets

Author: Jaume Bacardit, José C. Riquelme, Alicia Troncoso, and María Martínez-Ballesteros
Subjects: Association rule learning, business.industry, Computer science, Quality control and genetic algorithms, Evolutionary algorithm, Machine learning, computer.software_genre, Multi-objective optimization, Computer Science Applications, Theoretical Computer Science, Task (project management), Computational Theory and Mathematics, Artificial Intelligence, Scalability, Genetic algorithm, Artificial intelligence, Data mining, business, Representation (mathematics), computer, Software
Abstract: Association rule mining is a well-known methodology to discover significant and apparently hidden relations among attributes in a subspace of instances from datasets. Genetic algorithms have been extensively used to find interesting association rules. However, the rule-matching task of such techniques usually requires high computational and memory requirements. The use of efficient computational techniques has become a task of the utmost importance due to the high volume of generated data nowadays. Hence, this paper aims at improving the scalability of quantitative association rule mining techniques based on genetic algorithms to handle large-scale datasets without quality loss in the results obtained. For this purpose, a new representation of the individuals, new genetic operators and a windowing-based learning scheme are proposed to achieve successfully such challenging task. Specifically, the proposed techniques are integrated into the multi-objective evolutionary algorithm named QARGA-M to assess their performances. Both the standard version and the enhanced one of QARGA-M have been tested in several datasets that present different number of attributes and instances. Furthermore, the proposed methodologies have been integrated into other existing techniques based in genetic algorithms to discover quantitative association rules. The comparative analysis performed shows significant improvements of QARGA-M and other existing genetic algorithms in terms of computational costs without losing quality in the results when the proposed techniques are applied.
Published: 2015
Full Text: View/download PDF

13. Application of the Weighted Nearest Neighbor Method to Power System Forecasting Problems

Author: Antonio Gómez-Expósito, Alicia Troncoso, Jesús M. Riquelme-Santos, Catalina Gómez-Quiles, José L. Martínez-Ramos, and José C. Riquelme
Subjects: Engineering, Energy demand, Artificial neural network, business.industry, Nearest neighbour algorithm, computer.software_genre, Price prediction, Electric power system, Pattern recognition (psychology), Electricity market, Data mining, business, computer, Energy (signal processing)
Abstract: This chapter describes a forecasting methodology based on the Weighted nearest neighbors (WNNs) techniques. This technique provides a very simple approach to forecast power system variables characterized by daily and weekly repetitive patterns, such as energy demand and prices. Three case studies are used in the chapter to illustrate the potential of the WNN method: the hourly energy demand in the Spanish power system; the hourly marginal prices of the day???ahead Spanish electricity market; and the hourly demand of a particular customer. Recently, data mining techniques based on the k???nearest neighbors (kNN) method have been applied to the next???day load forecasting problem. In the last few years, machine learning techniques, such as artificial neural networks (ANNs), have been applied to energy price prediction owing to their relatively good performance in load forecasting and load pattern recognition. The chapter computes some of the prediction errors to assess the performance of the WNN and competing forecasting methodologies.
Published: 2017
Full Text: View/download PDF

14. Selecting the best measures to discover quantitative association rules

Author: Francisco Martínez-Álvarez, José C. Riquelme, María Martínez-Ballesteros, and Alicia Troncoso
Subjects: Fitness function, Association rule learning, business.industry, Cognitive Neuroscience, media_common.quotation_subject, Evolutionary algorithm, Function (mathematics), Machine learning, computer.software_genre, Computer Science Applications, Variety (cybernetics), Set (abstract data type), Artificial Intelligence, Principal component analysis, Quality (business), Artificial intelligence, Data mining, business, computer, media_common, Mathematics
Abstract: The majority of the existing techniques to mine association rules typically use the support and the confidence to evaluate the quality of the rules obtained. However, these two measures may not be sufficient to properly assess their quality due to some inherent drawbacks they present. A review of the literature reveals that there exist many measures to evaluate the quality of the rules, but that the simultaneous optimization of all measures is complex and might lead to poor results. In this work, a principal components analysis is applied to a set of measures that evaluate quantitative association rules' quality. From this analysis, a reduced subset of measures has been selected to be included in the fitness function in order to obtain better values for the whole set of quality measures, and not only for those included in the fitness function. This is a general-purpose methodology and can, therefore, be applied to the fitness function of any algorithm. To validate if better results are obtained when using the function fitness composed of the subset of measures proposed here, the existing QARGA algorithm has been applied to a wide variety of datasets. Finally, a comparative analysis of the results obtained by means of the application of QARGA with the original fitness function is provided, showing a remarkable improvement when the new one is used.
Published: 2014
Full Text: View/download PDF

15. Discovering gene association networks by multi-objective evolutionary quantitative association rules

Author: María Martínez-Ballesteros, José C. Riquelme, Isabel A. Nepomuceno-Chamorro, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Ministerio de Ciencia y Tecnología (MCYT). España, and Junta de Andalucía
Subjects: Association rule learning, Computer Networks and Communications, Computer science, Microarray analysis techniques, Process (engineering), gene networks, Applied Mathematics, Evolutionary algorithm, Gene regulatory network, Microarray analysis, computer.software_genre, Expression (mathematics), Theoretical Computer Science, Computational Theory and Mathematics, Benchmark (computing), Gene chip analysis, quantitative association rules, Data mining, Multi-objective evolutionary algorithms, computer
Abstract: In the last decade, the interest in microarray technology has exponentially increased due to its ability to monitor the expression of thousands of genes simultaneously. The reconstruction of gene association networks from gene expression profiles is a relevant task and several statistical techniques have been proposed to build them. The problem lies in the process to discover which genes are more relevant and to identify the direct regulatory relationships among them. We developed a multi-objective evolutionary algorithm for mining quantitative association rules to deal with this problem. We applied our methodology named GarNet to a well-known microarray data of yeast cell cycle. The performance analysis of GarNet was organized in three steps similarly to the study performed by Gallo et al. GarNet outperformed the benchmark methods in most cases in terms of quality metrics of the networks, such as accuracy and precision, which were measured using YeastNet database as true network. Furthermore, the results were consistent with previous biological knowledge. Ministerio de Ciencia y Tecnología TIN2011-28956-C02-02 Junta de Andalucía P11-TIC-7528
Published: 2014
Full Text: View/download PDF

16. Low Dimensionality or Same Subsets as a Result of Feature Selection: An In-Depth Roadmap

Author: José C. Riquelme and Antonio J. Tallón-Ballesteros
Subjects: Selection (relational algebra), Computer science, business.industry, Dimensionality reduction, Class (philosophy), Feature selection, Pattern recognition, 02 engineering and technology, computer.software_genre, Correlation, Consistency (database systems), Feature (computer vision), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Artificial intelligence, business, computer, Curse of dimensionality
Abstract: This paper addresses the situation that may happen after the application of feature subset selection in terms of a reduced number of selected features or even same solutions obtained by different algorithms. The data mining community has been working for a long time with the assumption that meaningful attributes are either highly correlated with the class or represent a consistent subset, that is, with no inconsistencies. We have analysed around a hundred data sets very varied with a number of attributes below one hundred, a number of instances not greater than fifty thousand and a number of classes below fifty. Basically, in the first round we applied two different feature subset selection methods to pick up the figures in terms of reduced dimensionality. After that, we divided them into different groups according to the number of selected attributes. Next, we deepened the analysis in every category and we added a new feature selection procedure. Finally, we assessed the performance of the original problem and the reduced subsets with four classifiers providing some prospective directions.
Published: 2017
Full Text: View/download PDF

17. A study of the suitability of autoencoders for preprocessing data in breast cancer experimentation

Author: María Martínez-Ballesteros, Ricardo González-Cámpora, Jorge García-Gutiérrez, José María Luna-Romera, Laura Macías-García, José C. Riquelme-Santos, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla. TIC-254: Data Science and Big Data Lab, and Ministerio de Economía y Competitividad (MINECO). España
Subjects: 0301 basic medicine, Computer science, Health Informatics, Breast Neoplasms, computer.software_genre, Machine learning, Machine Learning, 03 medical and health sciences, Breast cancer, 0302 clinical medicine, medicine, Preprocessor, Humans, Preprocessing, Observer Variation, Artificial neural network, business.industry, Deep learning, Experimental data, Autoencoder, medicine.disease, Prognosis, Biomedical data, Regression, Computer Science Applications, 030104 developmental biology, 030220 oncology & carcinogenesis, Female, Data mining, Artificial intelligence, Neural Networks, Computer, business, Breast carcinoma, computer
Abstract: Breast cancer is the most common cause of cancer death in women. Today, post-transcriptional protein products of the genes involved in breast cancer can be identified by immunohistochemistry. However, this method has problems arising from the intra-observer and inter-observer variability in the assess ment of pathologic variables, which may result in misleading conclusions. Using an optimal selection of preprocessing techniques may help to reduce observer variability. Deep learning has emerged as a powerful technique for any tasks related to machine learning such as classification and regression. The aim of this work is to use autoencoders (neural networks commonly used to feed deep learning architec tures) to improve the quality of the data for developing immunohistochemistry signatures with prognos tic value in breast cancer. Our testing on data from 222 patients with invasive non-special type breast carcinoma shows that an automatic binarization of experimental data after autoencoding could outper form other classical preprocessing techniques (such as human-dependent or automatic binarization only) when applied to the prognosis of breast cancer by immunohistochemical signatures Ministerio de Economía y Competitividad TIN2014-55894-C2-1-R
Published: 2016

18. Discovery of motifs to forecast outlier occurrence in time series

Author: José C. Riquelme, Jesús S. Aguilar-Ruiz, Francisco Martínez-Álvarez, Alicia Troncoso, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Ministerio de Ciencia y Tecnología (MCYT). España, and Junta de Andalucía
Subjects: Training set, Motifs, Series (mathematics), Computer science, computer.software_genre, Identification (information), ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence, Pattern recognition, Time series forecasting, Signal Processing, Outlier, Pattern recognition (psychology), Outliers, Computer Vision and Pattern Recognition, Data mining, Time series, Metaheuristic, computer, Software
Abstract: The forecasting process of real-world time series has to deal with especially unexpected values, commonly known as outliers. Outliers in time series can lead to unreliable modeling and poor forecasts. Therefore, the identification of future outlier occurrence is an essential task in time series analysis to reduce the average forecasting error. The main goal of this work is to predict the occurrence of outliers in time series, based on the discovery of motifs. In this sense, motifs will be those pattern sequences preceding certain data marked as anomalous by the proposed metaheuristic in a training set. Once the motifs are discovered, if data to be predicted are preceded by any of them, such data are identified as outliers, and treated separately from the rest of regular data. The forecasting of outlier occurrence has been added as an additional step in an existing time series forecasting algorithm (PSF), which was based on pattern sequence similarities. Robust statistical methods have been used to evaluate the accuracy of the proposed approach regarding the forecasting of both occurrence of outliers and their corresponding values. Finally, the methodology has been tested on six electricity-related time series, in which most of the outliers were properly found and forecasted. Ministerio de Ciencia y Tecnología TIN2007- 68084-C-00 Junta de Andalucia P07-TIC- 02611
Published: 2011
Full Text: View/download PDF

19. Mining quantitative association rules based on evolutionary computation and its application to atmospheric pollution

Author: Alicia Troncoso, José C. Riquelme, Francisco Martínez-Álvarez, and María Martínez-Ballesteros
Subjects: Discretization, Series (mathematics), Association rule learning, Computer science, Evolutionary algorithm, computer.software_genre, Wind speed, Evolutionary computation, Computer Science Applications, Theoretical Computer Science, Noise, Computational Theory and Mathematics, Artificial Intelligence, Genetic algorithm, Data mining, computer, Software
Abstract: This research presents the mining of quantitative association rules based on evolutionary computation techniques. First, a real-coded genetic algorithm that extends the well-known binary-coded CHC algorithm has been projected to determine the intervals that define the rules without needing to discretize the attributes. The proposed algorithm is evaluated in synthetic datasets under different levels of noise in order to test its performance and the reported results are then compared to that of a multi-objective differential evolution algorithm, recently published. Furthermore, rules from real-world time series such as temperature, humidity, wind speed and direction of the wind, ozone, nitrogen monoxide and sulfur dioxide have been discovered with the objective of finding all existing relations between atmospheric pollution and climatological conditions.
Published: 2010
Full Text: View/download PDF

20. Finding Defective Software Modules by Means of Data Mining Techniques

Author: Jesús S. Aguilar-Ruiz, Daniel Rodriguez, José C. Riquelme, and Roberto Ruiz
Subjects: General Computer Science, Artificial neural network, business.industry, Computer science, Feature selection, computer.software_genre, Machine learning, Software modules, Software, Robustness (computer science), Software fault tolerance, Genetic algorithm, Artificial intelligence, Data mining, Electrical and Electronic Engineering, business, computer
Abstract: The characterization of defective modules in software engineering remains a challenge. In this work, we use data mining techniques to search for rules that indicate modules with a high probability of being defective. Using datasets from the PROMISE repository1, we first applied feature selection to work only with those attributes from the datasets capable of predicting defective modules. Then, a genetic algorithm search for rules characterising subgroups with a high probability of being defective. This algorithm overcomes the problem of unbalanced datasets where the number of non-defective samples in the dataset highly outnumbers the defective ones.
Published: 2009
Full Text: View/download PDF

21. An Approach to Silhouette and Dunn Clustering Indices Applied to Big Data in Spark

Author: José C. Riquelme-Santos, María Martínez-Ballesteros, José María Luna-Romera, Jorge García-Gutiérrez, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla. TIC-254: Data Science and Big Data Lab, and Ministerio de Economía y Competitividad (MINECO). España
Subjects: Big Data, Spark, Engineering, business.industry, 030503 health policy & services, Big data, Clustering index, Dunn, computer.software_genre, Silhouette, 03 medical and health sciences, 0302 clinical medicine, Spark (mathematics), 030212 general & internal medicine, Data mining, 0305 other medical science, business, Cluster analysis, computer
Abstract: K-Means and Bisecting K-Means clustering algorithms need the optimal number into which the dataset may be divided. Spark implementations of these algorithms include a method that is used to calculate this number. Unfortunately, this measurement presents a lack of precision because it only takes into account a sum of intra-cluster distances misleading the results. Moreover, this measurement has not been well-contrasted in previous researches about clustering indices. Therefore, we introduce a new Spark implementation of Silhouette and Dunn indices. These clustering indices have been tested in previous works. The results obtained show the potential of Silhouette and Dunn to deal with Big Data. Ministerio de Economía y Competitividad TIN2014-55894-C2-1-R
Published: 2016

22. A Preliminary Study of the Suitability of Deep Learning to Improve LiDAR-Derived Biomass Estimation

Author: Eduardo González-Ferreiro, José C. Riquelme-Santos, Jorge García-Gutiérrez, Daniel Mateos-García, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, and Universidad de Sevilla. TIC-254: Data Science and Big Data Lab
Subjects: LiDAR, 010504 meteorology & atmospheric sciences, Computer science, 0211 other engineering and technologies, Feature selection, 02 engineering and technology, Overfitting, computer.software_genre, 01 natural sciences, Linear regression, Preprocessor, 021101 geological & geomatics engineering, 0105 earth and related environmental sciences, Soft computing, Artificial neural network, business.industry, Deep learning, Remote sensing, Regression, Lidar, Data mining, Artificial intelligence, business, computer
Abstract: Light Detection and Ranging (LiDAR) is a remote sensor able to extract three-dimensional information about forest structure. Bio physical models have taken advantage of the use of LiDAR-derived infor mation to improve their accuracy. Multiple Linear Regression (MLR) is the most common method in the literature regarding biomass estima tion to define the relation between the set of field measurements and the statistics extracted from a LiDAR flight. Unfortunately, there exist open issues regarding the generalization of models from one area to another due to the lack of knowledge about noise distribution, relation ship between statistical features and risk of overfitting. Autoencoders (a type of deep neural network) has been applied to improve the results of machine learning techniques in recent times by undoing possible data corruption process and improving feature selection. This paper presents a preliminary comparison between the use of MLR with and without preprocessing by autoencoders on real LiDAR data from two areas in the province of Lugo (Galizia, Spain). The results show that autoen coders statistically increased the quality of MLR estimations by around 15–30%.
Published: 2016

23. Application of fuzzy logic and data mining techniques as tools for qualitative interpretation of acid mine drainage processes

Author: J. Aroba, M. L. de la Torre, José Manuel Andújar, Jose Antonio Grande, and José C. Riquelme
Subjects: Interpretation (logic), Computer tools, Computer science, General Engineering, Contrast (statistics), computer.software_genre, Acid mine drainage, Fuzzy logic, Set (abstract data type), Qualitative analysis, Earth and Planetary Sciences (miscellaneous), General Earth and Planetary Sciences, Environmental Chemistry, Data mining, Cluster analysis, computer, General Environmental Science, Water Science and Technology
Abstract: In this article, a set of clustering algorithms based on Fuzzy Logic and Data Mining are applied, allowing to obtain data in the form of linguistic rules and charts about the behaviour of the Tinto and Odiel river estuary (SW Spain) affected by Acid Mine Drainage (AMD). In order to provide researchers with no skills on data mining techniques an easy and intuitive interpretation, we have developed a computer tool based on fuzzy logic that allows immediate qualitative analysis of the data contained in a data from the estuary water chemical analyses, and serves as a contrast to functioning models previously proposed with classical statistics.
Published: 2007
Full Text: View/download PDF

24. A Survey on Data Mining Techniques Applied to Electricity-Related Time Series Forecasting

Author: Alicia Troncoso, José C. Riquelme, Francisco Martínez-Álvarez, and Gualberto Asencio-Cortés
Subjects: Engineering, Control and Optimization, lcsh:T, Renewable Energy, Sustainability and the Environment, business.industry, Electricity price, Energy Engineering and Power Technology, forecasting, data mining, computer.software_genre, lcsh:Technology, Variety (cybernetics), Work (electrical), Electricity, Data mining, Electrical and Electronic Engineering, Time series, time series, business, Engineering (miscellaneous), computer, energy, Energy (miscellaneous)
Abstract: Data mining has become an essential tool during the last decade to analyze large sets of data. The variety of techniques it includes and the successful results obtained in many application fields, make this family of approaches powerful and widely used. In particular, this work explores the application of these techniques to time series forecasting. Although classical statistical-based methods provides reasonably good results, the result of the application of data mining outperforms those of classical ones. Hence, this work faces two main challenges: (i) to provide a compact mathematical formulation of the mainly used techniques, (ii) to review the latest works of time series forecasting and, as case study, those related to electricity price and demand markets.
Published: 2015
Full Text: View/download PDF

25. Incremental wrapper-based gene selection from microarray data for cancer classification

Author: Jesús S. Aguilar-Ruiz, Roberto Ruiz, José C. Riquelme, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, and Comisión Interministerial de Ciencia y Tecnología (CICYT). España
Subjects: Microarray, Microarray analysis techniques, Computer science, Feature selection, Computational biology, computer.software_genre, Gene expression profiling, ComputingMethodologies_PATTERNRECOGNITION, classification, Artificial Intelligence, Gene selection, Signal Processing, Gene chip analysis, Microarray databases, Computer Vision and Pattern Recognition, Data mining, DNA microarray, Heuristics, microarrays, Gene, computer, Software
Abstract: Gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. We present a new heuristic to select relevant gene subsets in order to further use them for the classification task. Our method is based on the statistical significance of adding a gene from a ranked-list to the final subset. The efficiency and effectiveness of our technique is demonstrated through extensive comparisons with other representative heuristics. Our approach shows an excellent performance, not only at identifying relevant genes, but also with respect to the computational cost. CICYT TIN2004-00159 CICYT TIN2004- 06689-C03-03
Published: 2006
Full Text: View/download PDF

26. Knowledge-Based Fast Evaluation for Evolutionary Learning

Author: Raúl Giráldez, Jesús S. Aguilar-Ruiz, and José C. Riquelme
Subjects: Computer science, business.industry, Supervised learning, Evolutionary algorithm, Interactive evolutionary computation, Decision rule, computer.software_genre, Machine learning, Evolutionary computation, Computer Science Applications, Human-Computer Interaction, Knowledge extraction, Human-based evolutionary computation, Control and Systems Engineering, Genetic algorithm, Data mining, Artificial intelligence, Electrical and Electronic Engineering, business, computer, Software, Information Systems
Abstract: The increasing amount of information available is encouraging the search for efficient techniques to improve the data mining methods, especially those which consume great computational resources, such as evolutionary computation. Efficacy and efficiency are two critical aspects for knowledge-based techniques. The incorporation of knowledge into evolutionary algorithms (EAs) should provide either better solutions (efficacy) or the equivalent solutions in shorter time (efficiency), regarding the same evolutionary algorithm without incorporating such knowledge. In this paper, we categorize and summarize some of the incorporation of knowledge techniques for evolutionary algorithms and present a novel data structure, called efficient evaluation structure (EES), which helps the evolutionary algorithm to provide decision rules using less computational resources. The EES-based EA is tested and compared to another EA system and the experimental results show the quality of our approach, reducing the computational cost about 50%, maintaining the global accuracy of the final set of decision rules.
Published: 2005
Full Text: View/download PDF

27. A multi-scale smoothing kernel for measuring time-series similarity

Author: José C. Riquelme, Alicia Troncoso, Marta Arias, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Universitat Politècnica de Catalunya. Departament de Ciències de la Computació, and Universitat Politècnica de Catalunya. LARCA - Laboratori d'Algorísmia Relacional, Complexitat i Aprenentatge
Subjects: Cognitive Neuroscience, computer.software_genre, Kernel principal component analysis, Similarity, Artificial Intelligence, String kernel, Polynomial kernel, Machine learning, Aprenentatge automàtic, Data mining, Mathematics, Support vector machines, Distance, Computer Science Applications, Kernel, Kernel method, Kernel embedding of distributions, Variable kernel density estimation, Kernel (statistics), Radial basis function kernel, Informàtica::Intel·ligència artificial [Àrees temàtiques de la UPC], Mineria de dades, Algorithm, computer, Time-series classification
Abstract: In this paper a kernel for time-series data is introduced so that it can be used for any data mining task that relies on a similarity or distance metric. The main idea of our kernel is that it should recognize as highly similar time-series that are essentially the same but may be slightly perturbed from each other: for example, if one series is shifted with respect to the other or if it slightly misaligned. Namely, our kernel tries to focus on the shape of the time-series and ignores small perturbations such as misalignments or shifts. First, a recursive formulation of the kernel directly based on its definition is proposed. Then it is shown how to efficiently compute the kernel using an equivalent matrix-based formulation. To validate the proposed kernel three experiments have been carried out. As an initial step, several synthetic datasets have been generated from UCR time-series repository and the KDD challenge of 2007 with the purpose of validating the kernel-derived distance over shifted time-series. Also, the kernel has been applied to the original UCR time-series to analyze its potential in time-series classification in conjunction with Support Vector Machines. Finally, two real-world applications related to ozone concentration in atmosphere and electricity demand have been considered. Ministerio de Ciencia y Tecnología TIN2011-27479-C04-03 Ministerio de Ciencia y Tecnología TIN2011-28956-C02 Generalitat de Catalunya 2009-SGR-1428 Junta de Andalucía P12-TIC-1728 Universidad Pablo de Olavide APPB813097 Unión Europea Pascal2 Network of Excellence FP7-ICT-216886 Generalita de Catalunya BE-DGR2011
Published: 2015

28. Data set Editing by Ordered Projection

Author: José C. Riquelme, Miguel Toro, and Jesús S. Aguilar
Subjects: Reduction (complexity), Data set, Artificial Intelligence, Computer science, Preprocessor, Point (geometry), Computer Vision and Pattern Recognition, Data mining, Projection (set theory), Binary logarithm, computer.software_genre, computer, Theoretical Computer Science
Abstract: This paper presents a new approach to data set editing. The algorithm (EOP: Editing by Ordered Projection) has some interesting characteristics: important reduction of the number of examples from the database; lower computational cost (O(mn \log n)) with respect to other typical algorithms due to the absence of distance calculations; conservation of the decision boundaries, especially from the point of view of the application of axis-parallel classifiers. The performance of EOP is analysed in two ways: percentage of reduction and classification. EOP has been compared to IB2, ENN and SHRINK concerning the percentage of reduction and the computational cost. In addition, we have analysed the accuracy of k-NN and C4.5 after applying the reduction techniques. An extensive empirical study using databases with continuous attributes from the UCI repository shows that EOP is a valuable preprocessing method for the later application of any axis-parallel learning algorithm.
Published: 2001
Full Text: View/download PDF

29. TriGen: A genetic algorithm to mine triclusters in temporal gene expression data

Author: David Gutiérrez-Avilés, José C. Riquelme, Cristina Rubio-Escudero, Francisco Martínez-Álvarez, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Computer science, Microarray analysis techniques, Cognitive Neuroscience, Computational biology, computer.software_genre, Synthetic data, Computer Science Applications, genetic algorithms, Biclustering, Artificial Intelligence, Genetic algorithm, Gene expression, Data mining, time series, Cluster analysis, Tricluster, computer, Gene, Microarray data
Abstract: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of longitudinal experiments in which the genes are evaluated under certain conditions at several time points. We present the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously. We have used TriGen to mine datasets related to synthetic data, yeast (Saccharomyces cerevisiae) cell cycle and human inflammation and host response to injury experiments. TriGen has proved to be capable of extracting groups of genes with similar patterns in subsets of conditions and times, and these groups have shown to be related in terms of their functional annotations extracted from the Gene Ontology. Ministerio de Ciencia y Tecnología TIN2011-28956-C00 Ministerio de Ciencia y Tecnología TIN2009-13950 Junta de Andalucía TIC-7528
Published: 2014

30. A Comparative Study of Machine Learning Regression Methods on LiDAR Data: A Case Study

Author: José C. Riquelme, Jorge García-Gutiérrez, Francisco Martínez-Álvarez, and Alicia Troncoso
Subjects: Soft computing, Artificial neural network, Computer science, business.industry, Feature selection, computer.software_genre, Machine learning, Regression, Random forest, Support vector machine, Lidar, Linear regression, Data mining, Artificial intelligence, business, computer
Abstract: Light Detection and Ranging (LiDAR) is a remote sensor able to extract vertical information from sensed objects. LiDAR-derived information is nowadays used to develop environmental models for describing fire behaviour or quantifying biomass stocks in forest areas. A multiple linear regression (MLR) with previous stepwise feature selection is the most common method in the literature to develop LiDAR-derived models. MLR defines the relation between the set of field measurements and the statistics extracted from a LiDAR flight. Machine learning has recently been paid an increasing attention to improve classic MLR results. Unfortunately, few studies have been proposed to compare the quality of the multiple machine learning approaches. This paper presents a comparison between the classic MLR-based methodology and common regression techniques in machine learning (neural networks, regression trees, support vector machines, nearest neighbour, and ensembles such as random forests). The selected techniques are applied to real LiDAR data from two areas in the province of Lugo (Galizia, Spain). The results show that support vector regression statistically outperforms the rest of techniques when feature selection is applied. However, its performance cannot be said statistically different from that of Random Forests when previous feature selection is skipped.
Published: 2014
Full Text: View/download PDF

31. Preliminary Comparison of Techniques for Dealing with Imbalance in Software Defect Prediction

Author: Israel Herraiz, José C. Riquelme, Rachel Harrison, Daniel Rodriguez, Javier Dolado, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Computer science, business.industry, Defect Prediction, Sampling (statistics), Imbalanced data, Machine learning, computer.software_genre, Class (biology), Replication (computing), Data mining algorithm, Software bug, Data quality, Preprocessor, Data mining, Artificial intelligence, business, Data Quality, computer
Abstract: Imbalanced data is a common problem in data mining when dealing with classi cation problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to opti- mize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the prob- lem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the re- sults depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsisten- cies are removed as a preprocessing step. Unión Europea ICEBERG 324356 MICYT TIN2007- 68084-C02-02 MICYT TIN2013-46928-C3-2-R
Published: 2014

32. Improving the k-Nearest Neighbour Rule by an Evolutionary Voting Approach

Author: José C. Riquelme-Santos, Daniel Mateos-García, Jorge García-Gutiérrez, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: business.industry, Computer science, media_common.quotation_subject, Process (computing), Novelty, computer.software_genre, Machine learning, fuzzy kNN, Evolutionary computation, evolutionary computation, Voting, Artificial intelligence, Data mining, kNN voting, business, K nearest neighbour, Fuzzy knn, computer, media_common
Abstract: This work presents an evolutionary approach to modify the voting system of the k-Nearest Neighbours (kNN). The main novelty of this article lies on the optimization process of voting regardless of the distance of every neighbour. The calculated real-valued vector through the evolutionary process can be seen as the relative contribution of every neighbour to select the label of an unclassified example. We have tested our approach on 30 datasets of the UCI repository and results have been compared with those obtained from other 6 variants of the kNN predictor, resulting in a realistic improvement statistically supported.
Published: 2014
Full Text: View/download PDF

33. Data Mining Methods Applied to a Digital Forensics Task for Supervised Machine Learning

Author: José C. Riquelme and Antonio J. Tallón-Ballesteros
Subjects: Artificial neural network, Computer science, business.industry, Digital forensics, Decision tree, Machine learning, computer.software_genre, Cross-validation, Task (project management), Bayes' theorem, Artificial intelligence, Data mining, Focus (optics), business, computer, Kappa
Abstract: Digital forensics research includes several stages. Once we have collected the data the last goal is to obtain a model in order to predict the output with unseen data. We focus on supervised machine learning techniques. This chapter performs an experimental study on a forensics data task for multi-class classification including several types of methods such as decision trees, bayes classifiers, based on rules, artificial neural networks and based on nearest neighbors. The classifiers have been evaluated with two performance measures: accuracy and Cohen’s kappa. The followed experimental design has been a 4-fold cross validation with thirty repetitions for non-deterministic algorithms in order to obtain reliable results, averaging the results from 120 runs. A statistical analysis has been conducted in order to compare each pair of algorithms by means of t-tests using both the accuracy and Cohen’s kappa metrics.
Published: 2014
Full Text: View/download PDF

34. Tackling Ant Colony Optimization Meta-Heuristic as Search Method in Feature Subset Selection Based on Correlation or Consistency Measures

Author: José C. Riquelme and Antonio J. Tallón-Ballesteros
Subjects: Heuristic (computer science), business.industry, Ant colony optimization algorithms, Feature selection, Best-first search, Machine learning, computer.software_genre, Consistency (database systems), Feature (computer vision), Artificial intelligence, Data mining, business, computer, Selection (genetic algorithm), Mathematics, Statistical hypothesis testing
Abstract: This paper introduces the use of an ant colony optimization (ACO) algorithm, called Ant System, as a search method in two well-known feature subset selection methods based on correlation or consistency measures such as CFS (Correlation-based Feature Selection) and CNS (Consistency-based Feature Selection). ACO guides the search using a heuristic evaluator. Empirical results on twelve real-world classification problems are reported. Statistical tests have revealed that InfoGain is a very suitable heuristic for CFS or CNS feature subset selection methods with ACO acting as search method. The use of InfoGain is shown to be the significantly better heuristic over a range of classifiers. The results achieved by means of ACO-based feature subset selection with the suitable heuristic evaluator are better for most of the problems comparing with those obtained with CFS or CNS combined with Best First search.
Published: 2014
Full Text: View/download PDF

35. Search and Linguistic Description of Connected Regions In Quantitative Data

Author: José C. Riquelme and Miguel Toro
Subjects: business.industry, Volume (computing), Parameter space, computer.software_genre, Transformation (function), Qualitative analysis, Pattern recognition (psychology), Genetic algorithm, Data mining, Linguistic description, Artificial intelligence, business, computer, Natural language processing, Mathematics
Abstract: The aim of this paper is to resume a great volume of quantitative knowledge in a qualitative model formed by linguistic rules. The initial information will be formed by the numeric and quantitative data that a real system spplies from its perfomance. First connected regions that have a similar behaviour will be found, and later every region will be described by means of linguistic terms. A transformation of the parameter space is proposed in order to reduce the number of regions, and so, the number of rules.
Published: 1997
Full Text: View/download PDF

36. A Sensitivity Analysis for Quality Measures of Quantitative Association Rules

Author: Francisco Martínez-Álvarez, José C. Riquelme, María Martínez-Ballesteros, and Alicia Troncoso
Subjects: Fitness function, Association rule learning, Computer science, media_common.quotation_subject, Quality (business), Data mining, Sensitivity (control systems), computer.software_genre, computer, media_common
Abstract: There exist several fitness function proposals based on a combination of weighted objectives to optimize the discovery of association rules. Nevertheless, some differences in the measures used to assess the quality of association rules could be obtained according to the values of such weights. Therefore, in such proposals it is very important the user’s decision in order to specify the weights or coefficients of the optimized objectives. Thus, this work presents an analysis on the sensitivity of several quality measures when the weights included in the fitness function of the existing QARGA algorithm are modified. Finally, a comparative analysis of the results obtained according to the weights setup is provided.
Published: 2013
Full Text: View/download PDF

37. A study of subgroup discovery approaches for defect prediction

Author: Daniel Rodriguez, Rachel Harrison, José C. Riquelme, Roberto Ruiz, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, and Ministerio de Ciencia e Innovación (MICIN). España
Subjects: Computer science, business.industry, Defect Prediction, Imbalanced datasets, media_common.quotation_subject, Rules, Software development, Context (language use), computer.software_genre, Machine learning, Computer Science Applications, Statistical classification, Software bug, Simple (abstract algebra), Subgroup discovery, Preprocessor, Quality (business), Artificial intelligence, Data mining, business, computer, Software, Information Systems, Eclipse, media_common
Abstract: Context: Although many papers have been published on software defect prediction techniques, machine learning approaches have yet to be fully explored. Objective: In this paper we suggest using a descriptive approach for defect prediction rather than the pre-cise classification techniques that are usually adopted. This allows us to characterise defective modules with simple rules that can easily be applied by practitioners and deliver a practical (or engineering) approach rather than a highly accurate result. Method: We describe two well-known subgroup discovery algorithms, the SD algorithm and the CN2-SD algorithm to obtain rules that identify defect prone modules. The empirical work is performed with pub-licly available datasets from the Promise repository and object-oriented metrics from an Eclipse reposi-tory related to defect prediction. Subgroup discovery algorithms mitigate against characteristics of datasets that hinder the applicability of classification algorithms and so remove the need for preprocess-ing techniques. Results: The results show that the generated rules can be used to guide testing effort in order to improve the quality of software development projects. Such rules can indicate metrics, their threshold values and relationships between metrics of defective modules. Conclusions: The induced rules are simple to use and easy to understand as they provide a description rather than a complete classification of the whole dataset. Thus this paper represents an engineering approach to defect prediction, i.e., an approach which is useful in practice, easily understandable and can be applied by practitioners. ICEBERG IAPP-2012-324356 MICINN TIN2011-28956-C02-00
Published: 2013

38. EVOR-STACK: A label-dependent evolutive stacking on remote sensing data fusion

Author: José C. Riquelme-Santos, Daniel Mateos-García, Jorge García-Gutiérrez, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Computer science, Cognitive Neuroscience, Feature vector, Evolutionary algorithm, Feature weighting, computer.software_genre, Evolutionary computation, remote sensing, Artificial Intelligence, Label dependence, Ensembles, Riparian zone, Remote sensing, geography, geography.geographical_feature_category, Land use, business.industry, Hybrid artificial intelligence systems, Pattern recognition, Data fusion, Sensor fusion, Computer Science Applications, Weighting, Support vector machine, Lidar, Feature (computer vision), evolutionary computation, Data mining, Artificial intelligence, business, computer
Abstract: Land use and land covers (LULC) maps are remote sensing products that are used to classify areas into different landscapes. Data fusion for remote sensing is becoming an important tool to improve classical approaches. In addition, artificial intelligence techniques such as machine learning or evolutive computation are often applied to improve the final LULC classification. In this paper, a hybrid artificial intelligence method based on an ensemble of multiple classifiers to improve LULC map accuracy is shown. The method works in two processing levels: first, an evolutionary algorithm (EA) for label-dependent feature weighting transforms the feature space by assigning different weights to every attribute depending on the class. Then a statistical raster from LIDAR and image data fusion is built following a pixel-oriented and feature-based strategy that uses a support vector machine (SVM) and a weighted k-NN restricted stacking. A classical SVM, the original restricted stacking (R-STACK) and the current improved method (EVOR-STACK) are compared. The results show that the evolutive approach obtains the best results in the context of the real data from a riparian area in southern Spain.
Published: 2012

39. Searching for rules to detect defective modules: A subgroup discovery approach

Author: Jesús S. Aguilar-Ruiz, Daniel Rodriguez, José C. Riquelme, Roberto Ruiz, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, and Ministerio de Educación y Ciencia (MEC). España
Subjects: Information Systems and Management, Computer science, Imbalanced datasets, Rules, Machine learning, computer.software_genre, Evolutionary computation, Theoretical Computer Science, Task (project management), Artificial Intelligence, Redundancy (engineering), Software system, business.industry, Defect Prediction, Software development, Decision rule, Software metric, Computer Science Applications, Statistical classification, Control and Systems Engineering, Subgroup discovery, Artificial intelligence, Data mining, business, computer, Software
Abstract: Data mining methods in software engineering are becoming increasingly important as they can support several aspects of the software development life-cycle such as quality. In this work, we present a data mining approach to induce rules extracted from static software metrics characterising fault-prone modules. Due to the special characteristics of the defect prediction data (imbalanced, inconsistency, redundancy) not all classification algorithms are capable of dealing with this task conveniently. To deal with these problems, Subgroup Discovery (SD) algorithms can be used to find groups of statistically different data given a property of interest. We propose EDER-SD (Evolutionary Decision Rules for Subgroup Discovery), a SD algorithm based on evolutionary computation that induces rules describing only fault-prone modules. The rules are a well-known model representation that can be easily understood and applied by project managers and quality engineers. Thus, rules can help them to develop software systems that can be justifiably trusted. Contrary to other approaches in SD, our algorithm has the advantage of working with continuous variables as the conditions of the rules are defined using intervals. We describe the rules obtained by applying our algorithm to seven publicly available datasets from the PROMISE repository showing that they are capable of characterising subgroups of fault-prone modules. We also compare our results with three other well known SD algorithms and the EDER-SD algorithm performs well in most cases. Ministerio de Educación y Ciencia TIN2007-68084-C02-00 Ministerio de Educación y Ciencia TIN2010-21715-C02-01
Published: 2012

40. Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches

Author: José C. Riquelme, Miguel García-Torres, Jesús S. Aguilar-Ruiz, Roberto Ruiz, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Clustering high-dimensional data, Computer science, business.industry, General Engineering, Feature selection, Pattern recognition, Filter (signal processing), computer.software_genre, Computer Science Applications, Ranking, Artificial Intelligence, Feature (machine learning), Data mining, Artificial intelligence, business, computer, Selection (genetic algorithm)
Abstract: We address the feature subset selection problem for classification tasks. We examine the performance of two hybrid strategies that directly search on a ranked list of features and compare them with two widely used algorithms, the fast correlation based filter (FCBF) and sequential forward selection (SFS). The pro-posed hybrid approaches provide the possibility of efficiently applying any subset evaluator, with a wrap-per model included, to large and high-dimensional domains. The experiments performed show that our two strategies are competitive and can select a small subset of features without degrading the classifica-tion error or the advantages of the strategies under study.
Published: 2012

41. Inferring gene-gene associations from Quantitative Association Rules

Author: Isabel A. Nepomuceno-Chamorro, María Martínez-Ballesteros, and José C. Riquelme
Subjects: Relation (database), Association rule learning, Exponential growth, Computer science, Gene regulatory network, Evolutionary algorithm, Unsupervised learning, Data mining, computer.software_genre, computer, Gene, Evolutionary computation
Abstract: The microarray technique is able to monitor the change in concentration of RNA in thousands of genes simultaneously. The interest in this technique has grown exponentially in recent years and the difficulties in analyzing data from such experiments, which are characterized by the high number of genes to be analyzed in relation to the low number of experiments or samples available. Microarray experiments are generating datasets that can help in reconstructing gene networks. One of the most important problems in network reconstruction is finding, for each gene in the network, which genes can affect it and how. Association Rules are an approach of unsupervised learning to relate attributes to each other. In this work we use Quantitative Association Rules in order to define interrelations between genes. These rules work with intervals on the attributes, without discretizing the data before and they are generated by a multi-objective evolutionary algorithm. In most cases the extracted rules confirm the existing knowledge about cell-cycle gene expression, while hitherto unknown relationships can be treated as new hypotheses.
Published: 2011
Full Text: View/download PDF

42. Improving the Accuracy of a Two-Stage Algorithm in Evolutionary Product Unit Neural Networks for Classification by Means of Feature Selection

Author: Roberto Ruiz, Antonio J. Tallón-Ballesteros, José C. Riquelme, and César Hervás-Martínez
Subjects: Artificial neural network, Empirical comparison, business.industry, Computer science, Nonparametric statistics, Feature selection, Machine learning, computer.software_genre, Reduction (complexity), Two stage algorithm, Product (mathematics), Artificial intelligence, Data mining, business, computer, Unit (ring theory)
Abstract: This paper introduces a methodology that improves the accuracy of a two-stage algorithm in evolutionary product unit neural networks for classification tasks by means of feature selection. A couple of filters have been taken into consideration to try out the proposal. The experimentation has been carried out on seven data sets from the UCI repository that report test mean accuracy error rates about twenty percent or above with reference classifiers such as C4.5 or 1-NN. The study includes an overall empirical comparison between the models obtained with and without feature selection. Also several classifiers have been tested in order to illustrate the performance of the different filters considered. The results have been contrasted with nonparametric statistical tests and show that our proposal significantly improves the test accuracy of the previous models for the considered data sets. Moreover, the current proposal is much more efficient than a previous methodology developed by us; lastly, the reduction percentage in the number of inputs is above a fifty five, on average.
Published: 2011
Full Text: View/download PDF

43. Analysis of Measures of Quantitative Association Rules

Author: María Martínez-Ballesteros and José C. Riquelme
Subjects: Association rule learning, business.industry, Computer science, media_common.quotation_subject, Evolutionary algorithm, computer.software_genre, Machine learning, Genetic algorithm, Quality (business), Artificial intelligence, Data mining, business, computer, media_common
Abstract: This paper presents the analysis of relationships among different interestingness measures of quality of association rules as first step to select the best objectives in order to develop a multi-objective algorithm. For this purpose, the discovering of association rules is based on evolutionary techniques. Specifically, a genetic algorithm has been used in order to mine quantitative association rules and determine the intervals on the attributes without discretizing the data before. The algorithm has been applied in real-word climatological datasets based on Ozone and Earthquake data.
Published: 2011
Full Text: View/download PDF

44. Revisiting the Yeast Cell Cycle Problem with the Improved TriGen Algorithm

Author: David Gutiérrez-Avilés, José C. Riquelme, Cristina Rubio-Escudero, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: biology, Computer science, Microarray analysis techniques, Saccharomyces cerevisiae, temporary data, yeast, computer.software_genre, biology.organism_classification, Expression (mathematics), genetic algorithms, Biclustering, Genetic algorithm, Algorithm design, Data mining, DNA microarray, Cluster analysis, Algorithm, computer, microarrays
Abstract: Analyzing microarray data represents a computational challenge due to the characteristics of these data. Clustering techniques are widely applied to create groups of genes that exhibit a similar behavior under the conditions tested. Biclustering emerges as an improvement of classical clustering since it relaxes the constraints for grouping allowing genes to be evaluated only under a subset of the conditions and not under all of them. However, this technique is not appropriate for the analysis of temporal microarray data in which the genes are evaluated under certain conditions at several time points. On a previous work we presented the TriGen algorithm, a genetic algorithm that finds triclusters of gene expression that take into account the experimental conditions and the time points simultaneously, and was applied to the yeast (Saccharomyces Cerevisiae) cell cycle problem. In this article we present some improvements on the genetic algorithm and we also present the results of applying the improved TriGen algorithm to the yeast cell cycle problem, where the goal is to identify all genes whose expression levels are regulated by the cell cycle.
Published: 2011

45. Computational Intelligence Techniques for Predicting Earthquakes

Author: José C. Riquelme, Alicia Troncoso, Antonio Morales-Esteban, and Francisco Martínez-Álvarez
Subjects: Association rule learning, business.industry, Computer science, Earthquake prediction, Computational intelligence, Machine learning, computer.software_genre, Regression, Temporal database, Tree (data structure), Artificial intelligence, Data mining, Natural disaster, business, computer, Piecewise linear model
Abstract: Nowadays, much effort is being devoted to develop techniques that forecast natural disasters in order to take precautionary measures. In this paper, the extraction of quantitative association rules and regression techniques are used to discover patterns which model the behavior of seismic temporal data to help in earthquakes prediction. Thus, a simple method based on the k-smallest and k-greatest values is introduced for mining rules that attempt at explaining the conditions under which an earthquake may happen. On the other hand patterns are discovered by using a tree-based piecewise linear model. Results from seismic temporal data provided by the Spanish's Geographical Institute are presented and discussed, showing a remarkable performance and the significance of the obtained results.
Published: 2011
Full Text: View/download PDF

46. An evolutionary algorithm to discover quantitative association rules in multidimensional time series

Author: Alicia Troncoso, María Martínez-Ballesteros, Francisco Martínez-Álvarez, José C. Riquelme, Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos, Ministerio de Ciencia y Tecnología (MCYT). España, and Junta de Andalucía
Subjects: Time series, Discretization, Series (mathematics), Association rule learning, Antecedent (logic), business.industry, Evolutionary algorithm, Computational intelligence, computer.software_genre, Machine learning, Evolutionary algorithms, Theoretical Computer Science, Set (abstract data type), Quantitative association rules, Genetic algorithm, Geometry and Topology, Data mining, Artificial intelligence, business, computer, Software, Mathematics
Abstract: An evolutionary approach for finding existing relationships among several variables of a multidimensional time series is presented in this work. The proposed model to discover these relationships is based on quantitative association rules. This algorithm, called QARGA (Quantitative Association Rules by Genetic Algorithm), uses a particular codification of the individuals that allows solving two basic problems. First, it does not perform a previous attribute discretization and, second, it is not necessary to set which variables belong to the antecedent or consequent. Therefore, it may discover all underlying dependencies among different variables. To evaluate the proposed algorithm three experiments have been carried out. As initial step, several public datasets have been analyzed with the purpose of comparing with other existing evolutionary approaches. Also, the algorithm has been applied to synthetic time series (where the relationships are known) to analyze its potential for discovering rules in time series. Finally, a real-world multidimensional time series composed by several climatological variables has been considered. All the results show a remarkable performance of QARGA. Ministerio de Ciencia y Tecnología TIN2007- 68084-C02-02 Junta de Andalucia P07-TIC- 02611
Published: 2011

47. Evolutionary q-Gaussian Radial Basis Functions for Improving Prediction Accuracy of Gene Classification Using Feature Selection

Author: César Hervás-Martínez, Francisco Fernández-Navarro, Pedro Antonio Gutiérrez, Roberto Ruiz, José C. Riquelme, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Radial basis function network, Mean squared error, business.industry, Computer science, Pattern recognition, Feature selection, computer.software_genre, q-Gaussian, Artificial Intelligence, Image processing and computer vision, Radial basis function, Artificial intelligence, Data mining, Computation by abstract devices, business, Classifier (UML), computer
Abstract: This paper proposes a Radial Basis Function Neural Network (RBFNN) which reproduces different Radial Basis Functions (RBFs) by means of a real parameter q, named q-Gaussian RBFNN. The architecture, weights and node topology are learnt through a Hybrid Algorithm (HA) with the iRprop + algorithm as the local improvement procedure. In order to test its overall performance, an experimental study with four gene microarray datasets with two classes taken from bioinformatic and biomedical domains is presented. The Fast Correlation–Based Filter (FCBF) was applied in order to identify salient expression genes from thousands of genes in microarray data that can directly contribute to determining the class membership of each pattern. After different gene subsets were obtained, the proposed methodology was performed using the selected gene subsets as the new input variables. The results confirm that the q-Gaussian RBFNN classifier leads to promising improvement on accuracy.
Published: 2010

48. Label Dependent Evolutionary Feature Weighting for Remote Sensing Data

Author: Jorge García-Gutiérrez, José C. Riquelme-Santos, Daniel Mateos-García, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: Artificial neural network, Computer science, business.industry, Computation, Feature vector, Orthophoto, Feature weighting, Pattern recognition, Evolutionary computation, Land cover, Remote sensing, computer.software_genre, Weighting, Label dependence, Data mining, Artificial intelligence, business, Classifier (UML), computer
Abstract: Nearest neighbour (NN) is a very common classifier used to develop important remote sensing products like land use and land cover (LULC) maps. Evolutive computation has often been used to obtain feature weighting in order to improve the results of the NN. In this paper, a new algorithm based on evolutionary computation which has been called Label Dependent Feature Weighting (LDFW) is proposed. The LDFW method transforms the feature space assigning different weights to every feature depending on each class. This multilevel feature weighting algorithm is tested on remote sensing data from fusion of sensors (LIDAR and orthophotography). The results show an improvement on the NN and resemble the results obtained with a neural network which is the best classifier for the study area.
Published: 2010
Full Text: View/download PDF

49. A SVM and k-NN Restricted Stacking to Improve Land Use and Land Cover Classification

Author: Jorge García-Gutiérrez, Daniel Mateos-García, José C. Riquelme-Santos, and Universidad de Sevilla. Departamento de Lenguajes y Sistemas Informáticos
Subjects: geography, Image fusion, geography.geographical_feature_category, Land use, Pixel, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Context (language use), Land cover, computer.software_genre, Support vector machine, Database management, ComputingMethodologies_PATTERNRECOGNITION, Lidar, Artificial Intelligence, Computation by abstract devices, Data mining, Spatial analysis, computer, Riparian zone
Abstract: Land use and land cover (LULC) maps are remote sensing products that are used to classify areas into different landscapes. The newest techniques have been applied to improve the final LULC classification and most of them are based on SVM classifiers. In this paper, a new method based on a multiple classifiers ensemble to improve LULC map accuracy is shown. The method builds a statistical raster from LIDAR and image fusion data following a pixel-oriented strategy. Then, the pixels from a training area are used to build a SVM and k-NN restricted stacking taking into account the special characteristics of spatial data. A comparison between a SVM and the restricted stacking is carried out. The results of the tests show that our approach improves the results in the context of the real data from a riparian area of Huelva (Spain).
Published: 2010
Full Text: View/download PDF

50. Improving Time Series Forecasting by Discovering Frequent Episodes in Sequences

Author: José C. Riquelme, Alicia Troncoso, and Francisco Martínez-Álvarez
Subjects: Series (mathematics), Computer science, Outlier, Data mining, Time series, computer.software_genre, computer
Abstract: This work aims to improve an existing time series forecasting algorithm ---LBF--- by the application of frequent episodes techniques as a complementary step to the model. When real-world time series are forecasted, there exist many samples whose values may be specially unexpected. By the combination of frequent episodes and the LBF algorithm, the new procedure does not make better predictions over these outliers but, on the contrary, it is able to predict the apparition of such atypical samples with a great accuracy. In short, this work shows how to detect the occurrence of anomalous samples in time series improving, thus, the general forecasting scheme. Moreover, this hybrid approach has been successfully tested on electricity-related time series.
Published: 2009
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

78 results on '"José A. Riquelme"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources