736 results
Search Results
2. Predicting the Decomposition Level of Forest Trees Through Ensembling Methods
- Author
-
Jeyabharathy, S., Arumugam, Padmapriya, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Bhattacharya, Mahua, editor, Kharb, Latika, editor, and Chahal, Deepak, editor
- Published
- 2021
- Full Text
- View/download PDF
3. Event Detection in Twitter Big Data by Virtual Backbone Deep Learning
- Author
-
Rezaei, Zahra, Komleh, Hossein Ebrahimpour, Eslami, Behnaz, Barbosa, Simone Diniz Junqueira, Editorial Board Member, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Kotenko, Igor, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Grandinetti, Lucio, editor, Mirtaheri, Seyedeh Leili, editor, and Shahbazian, Reza, editor
- Published
- 2019
- Full Text
- View/download PDF
4. Advanced Machine Learning Models for Large Scale Gene Expression Analysis in Cancer Classification: Deep Learning Versus Classical Models
- Author
-
Zenbout, Imene, Meshoul, Souham, Barbosa, Simone Diniz Junqueira, Series Editor, Filipe, Joaquim, Series Editor, Kotenko, Igor, Series Editor, Sivalingam, Krishna M., Series Editor, Washio, Takashi, Series Editor, Yuan, Junsong, Series Editor, Zhou, Lizhu, Series Editor, Tabii, Youness, editor, Lazaar, Mohamed, editor, Al Achhab, Mohammed, editor, and Enneya, Nourddine, editor
- Published
- 2018
- Full Text
- View/download PDF
5. Classification of Summarized Sensor Data Using Sampling and Clustering: A Performance Analysis
- Author
-
P.G., Lavanya, Mallappa, Suresha, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Santosh, K.C., editor, Hangarge, Mallikarjun, editor, Bevilacqua, Vitoantonio, editor, and Negi, Atul, editor
- Published
- 2017
- Full Text
- View/download PDF
6. An Apache Spark Implementation for Sentiment Analysis on Twitter Data
- Author
-
Baltas, Alexandros, Kanavos, Andreas, Tsakalidis, Athanasios K., Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Sellis, Timos, editor, and Oikonomou, Konstantinos, editor
- Published
- 2017
- Full Text
- View/download PDF
7. Solar Radio Astronomical Big Data Classification
- Author
-
Xu, Long, Weng, Ying, Chen, Zhuo, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Xie, Jiang, editor, Chen, Zhangxin, editor, Douglas, Craig C., editor, Zhang, Wu, editor, and Chen, Yan, editor
- Published
- 2016
- Full Text
- View/download PDF
8. Machine learning and big data analytics in bipolar disorder : A position paper from the International Society for Bipolar Disorders Big Data Task Force
- Author
-
Passos, Ives C., Ballester, Pedro L., Barros, Rodrigo C., Librenza-Garcia, Diego, Mwangi, Benson, Birmaher, Boris, Brietzke, Elisa, Hajek, Tomas, Lopez Jaramillo, Carlos, Mansur, Rodrigo B., Alda, Martin, Haarman, Bartholomeus C. M., Isometsa, Erkki, Lam, Raymond W., McIntyre, Roger S., Minuzzi, Luciano, Kessing, Lars V., Yatham, Lakshmi N., Duffy, Anne, Kapczinski, Flavio, Department of Psychiatry, HUS Psychiatry, and University of Helsinki
- Subjects
bipolar disorder ,PREDICTING SUICIDALITY ,RISK ,MOOD DISORDERS ,SYMPTOMS ,predictive psychiatry ,education ,3112 Neurosciences ,deep learning ,data mining ,ASSOCIATION ,personalized psychiatry ,DEPRESSION ,CLASSIFICATION ,3124 Neurology and psychiatry ,risk prediction ,machine learning ,big data ,LITHIUM RESPONSE ,SCHIZOPHRENIA ,NEUROPROGRESSION - Abstract
Objectives The International Society for Bipolar Disorders Big Data Task Force assembled leading researchers in the field of bipolar disorder (BD), machine learning, and big data with extensive experience to evaluate the rationale of machine learning and big data analytics strategies for BD. Method A task force was convened to examine and integrate findings from the scientific literature related to machine learning and big data based studies to clarify terminology and to describe challenges and potential applications in the field of BD. We also systematically searched PubMed, Embase, and Web of Science for articles published up to January 2019 that used machine learning in BD. Results The results suggested that big data analytics has the potential to provide risk calculators to aid in treatment decisions and predict clinical prognosis, including suicidality, for individual patients. This approach can advance diagnosis by enabling discovery of more relevant data-driven phenotypes, as well as by predicting transition to the disorder in high-risk unaffected subjects. We also discuss the most frequent challenges that big data analytics applications can face, such as heterogeneity, lack of external validation and replication of some studies, cost and non-stationary distribution of the data, and lack of appropriate funding. Conclusion Machine learning-based studies, including atheoretical data-driven big data approaches, provide an opportunity to more accurately detect those who are at risk, parse-relevant phenotypes as well as inform treatment selection and prognosis. However, several methodological challenges need to be addressed in order to translate research findings to clinical settings.
- Published
- 2019
9. The algorithm at work? Explanation and repair in the enactment of similarity in art data.
- Author
-
Sachs, S. E.
- Subjects
- *
ALGORITHMS , *PAPER arts , *INTERNET marketing , *IMAGE databases , *EMERGING markets , *ELECTRIC breakdown - Abstract
This paper examines the work practices involved in making data legible to machines and machine output legible to humans. The study is based on ethnographic research of a team of art experts at DNArt – a data classification system that features a growing database of art images, a classification scheme, a similarity matching algorithm, and a website that together serve as a consumer judgment device in an emerging online market for art. I analyze interactions from meeting observations, interviews, documentation, and online interaction data to show how non-technical art experts explain and repair sociotechnical breakdowns – when their expectations for similarity between art images and artists differ from the similarity relations produced by the algorithm. By repairing breakdowns, the art experts construct the algorithm anew, as a legitimate revealer of similarity in art. In doing so, the team's repair work is folded back into the black box of the algorithm, rendering it invisible and unacknowledged, sometimes even by the experts themselves. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
10. Chicken swarm foraging algorithm for big data classification using the deep belief network classifier
- Author
-
R, Sathyaraj, RamanathanL, LavanyaK, V, Balasubramanian, and J, Saira Banu
- Published
- 2021
- Full Text
- View/download PDF
11. Building a training dataset for classification under a cost limitation
- Author
-
Chen, Yen-Liang, Cheng, Li-Chen, and Zhang, Yi-Jun
- Published
- 2021
- Full Text
- View/download PDF
12. Prediction of new prescription requirements for diabetes patients using big data technologies
- Author
-
Bakırarar, Batuhan, Yüksel, Cemil, and Yavuz, Yasemin
- Published
- 2022
- Full Text
- View/download PDF
13. Working Papers Presented in Arcada Workshop on Analytics in May 25, 2015
- Author
-
Pulkkis (Ed.), Göran and Yrkeshögskolan Arcada
- Subjects
Data mining--Analysis ,Big data ,Internet ,Cluster analysis ,Electronic commerce ,Customer relations ,Retail trade ,Fuzzy numbers ,Classification ,Financial risk ,Content analysis ,Business intelligence - Abstract
The Department of Business Management and Analytics in Arcada University of Applied Sciences arranged a Workshop on Analytics in May 25, 2015. Four Working Papers presented in this workshop are published in this report.
- Published
- 2015
14. Identification and classification of urban employment centers based on big data: A case study of Beijing.
- Author
-
Wang, Liang and Cui, He
- Subjects
BIG data ,DIVERSITY in the workplace ,URBAN transportation ,K-means clustering ,CLASSIFICATION - Abstract
The layout, scale and spatial form of urban employment centers are important guidelines for the rational layout of public service facilities such as urban transportation, medical care, and education. In this paper, we use Internet cell phone positioning data to identify the workplace and residence of users in the Beijing city area and obtain commuting data of the employed to measure the employment center system in Beijing. Firstly, the employment density distribution is generated using the data of the working places of the employed persons, and the employment centers are identified based on the employment density of Beijing. Then, we use the business registration data of employment centers to measure the industrial diversity within the employment centers by using the ecological Shannon Wiener Diversity Index, and combine the commuting links between employment centers and places of residence to measure the energy level of each employment center, analyze the hinterland and sphere of influence of each center, and finally using the industrial diversity index of employment centers and the average commuting time of employed persons, combined with the K-Means clustering algorithm, to classify the employment centers in Beijing. The employment center identification and classification method based on big data constructed in this study can help solve the limitations of the previous employment center system research in terms of center identification and commuting linkage measurement due to large spatial units and lack of commuting data to a certain extent. The study can provide reference for the regular understanding and technical analysis of employment centers and provide help for the employment multi-center system in Beijing in terms of quantifying the employment spatial structure, guiding the construction of multi-center system, and adjusting the land use rules. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. A model of the relationship between the variations of effectiveness and fairness in information retrieval
- Author
-
Melucci, Massimo
- Published
- 2024
- Full Text
- View/download PDF
16. A novel ensemble-based paradigm to process large-scale data.
- Author
-
Trinh, Thanh, Le, HoangAnh, VuongThi, Nhung, HoangDuc, Hai, and VuThi, KieuAnh
- Abstract
Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. A binary hybrid sine cosine white shark optimizer for feature selection.
- Author
-
Hammouri, Abdelaziz I., Braik, Malik Sh., Al-hiary, Heba H., and Abdeen, Rawan A.
- Subjects
WHITE shark ,METAHEURISTIC algorithms ,COSINE function ,ELECTRONIC data processing ,BIG data ,FEATURE selection - Abstract
Feature Selection (FS), a pre-processing step used in the majority of big data processing applications, aims to eliminate irrelevant and redundant features from the data. Its purpose is to select a final set of data characteristics that best represent the data as a whole. To achieve this, it explores every potential solution in order to identify the optimal subset. Meta-heuristic algorithms have been found to be particularly effective in solving FS problems, especially for high-dimensional datasets. This work adopts a recently developed meta-heuristic called the White Shark Optimizer (WSO) due to its simplicity and low computational overhead. However, WSO faces challenges in effectively balancing exploration and exploitation, particularly in complex multi-peak search problems. It tends to converge prematurely and get stuck in local optima, which can lead to poor search performance when dealing with FS problems. To overcome these issues, this paper presents three enhanced binary variants of WSO for well-known FS problems. These variants are as follows: (1) Binary Distribution-based WSO (BDWSO), where the algorithm refines the positions of white sharks by considering both the average and standard deviation of the current shark, the local best shark, and the global best shark. This strategy is designed to alleviate issues of premature convergence and stagnation during iterations , (2) Binary Sine Cosine WSO (BSCWSO), which uses sine and cosine adaptive functions for the social and cognitive components of the position update rule, and (3) Binary Hybrid Sine Cosine WSO (BHSCWSO), which employs sine and cosine acceleration factors to regulate local search and achieve convergence to the global optimal solution. Additionally, the population was initialized using the Opposition-Based Learning (OBL) mechanism, and the sine map was used to modify the inertia weight of WSO. These revised variants of WSO were established to have a better harmony between the exploration and exploitation facets. The proposed methods were extensively compared to the fundamental binary WSO and other well-known algorithms in the field. The experimental findings and comparisons demonstrate that the proposed methods outperform the conventional and most evaluated similar algorithms in terms of robustness and solution quality. In terms of classification accuracy, number of selected features, specificity, sensitivity, and fitness values, the proposed BHSCWSO optimizer performed better than all other proposed peer optimizers, including BSWO, BDWSO, and BSCWSO, in 11, 8, 13, 18, and 10 datasets, respectively. The proposed BHSCWSO optimizer showed performance levels of more than 90% in terms of accuracy, sensitivity, and specificity metric measures on 15, 14, and 11 of the 24 datasets deemed. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Big data in transportation: a systematic literature analysis and topic classification.
- Author
-
Tzika-Kostopoulou, Danai, Nathanail, Eftihia, and Kokkinos, Konstantinos
- Subjects
BIG data ,CONVOLUTIONAL neural networks ,SMART cities ,URBAN planning ,CLASSIFICATION - Abstract
This paper identifies trends in the application of big data in the transport sector and categorizes research work across scientific subfields. The systematic analysis considered literature published between 2012 and 2022. A total of 2671 studies were evaluated from a dataset of 3532 collected papers, and bibliometric techniques were applied to capture the evolution of research interest over the years and identify the most influential studies. The proposed unsupervised classification model defined categories and classified the relevant articles based on their particular scientific interest using representative keywords from the title, abstract, and keywords (referred to as top words). The model's performance was verified with an accuracy of 91% using Naïve Bayesian and Convolutional Neural Networks approach. The analysis identified eight research topics, with urban transport planning and smart city applications being the dominant categories. This paper contributes to the literature by proposing a methodology for literature analysis, identifying emerging scientific areas, and highlighting potential directions for future research. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Optimizing Attribute Reduction in Rough Set Theory using Re-heat Simulated Annealing for Classification and Data Mining.
- Author
-
Bamhdi, Alwi M., Barros, Ana Luiza, Makda, Tahira Jehan, Fernandez, Marcial, Patel, Ahmed, and Golafshan, Laleh
- Subjects
ROUGH sets ,METAHEURISTIC algorithms ,CLASSIFICATION algorithms ,SIMULATED annealing ,DATA mining - Abstract
Data classification is a crucial aspect of knowledge discovery using machinelearning algorithm for supervised learning approach where the goal is to predict the categorical labels of new instances based on past observations. This research presents an innovative classification technique that utilizes Rough Set Attribute Reduction. The proposed method introduces the Re-heat Simulated Annealing (Re-heat SA) algorithm as a meta-heuristic approach. Rough set theory, a mathematical tool dealing with uncertainty and fuzziness in data, is employed to uncover hidden patterns in big data through feature selection. This paper introduces a novel meta-heuristic classification approach that utilizes rough set attribute reduction to achieve optimal accuracy. Re-heat SA effectively optimizes the problem by controlling the dependency degree to identify the minimal reducts required for classification prediction using the Rosetta software. Experimental results demonstrate that Re-heat SA outperforms comparable classification algorithms in discovering classification rules. The results reveal that three datasets achieved 100% accuracy, four datasets achieved accuracy rates ranging from 60% to 99%, and six datasets achieved accuracy rates ranging from 30% to 59%. Additionally, this paper discusses the need for standardization concerning the machine learning pipeline processes as big data and its handling grows exponentially. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Smart Data Driven Decision Trees Ensemble Methodology for Imbalanced Big Data.
- Author
-
García-Gil, Diego, García, Salvador, Xiong, Ning, and Herrera, Francisco
- Abstract
Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have been shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high-performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing random discretization, principal components analysis, and clustering-based random oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms random forest. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Diabetic prediction and classification of risk level using ODDTADC method in big data analytics.
- Author
-
Jenefer, G. Geo, Deepa, A. J., and Linda, M. Mary
- Abstract
Diabetes is regarded as one of the deadliest chronic illnesses that increases blood sugar. But there is no reliable method for predicting diabetic severity that shows how the disease will affect various body organs in the future. Therefore, this paper introduced Optimized Dual Directional Temporal convolution and Attention based Density Clustering (ODDTADC) method for predicting and classifying risk level in diabetic patients. In the diabetic prediction stage, the prediction is done by using an Integrated Dual Directional Temporal Convolution and an Enriched Remora Optimization Algorithm. Here, dual directional temporal convolution is used to extract temporal features by integrating dilated convolution and casual convolution in the feature extraction layer. Then, the attention module is used instead of max-pooling to emphasize the various features' importance in the feature aggregation layer. The Enriched Remora Optimization Algorithm is used to find optimal hyper parameters for Integrated Dual Directional Temporal Convolution. In the classification of stages based on risk level, the values from stage-I are fed into the Attention based Density Spatial Clustering of Applications with Noise, which allocate various weights based on their density values in the Core Points. Based on the results, the Nested Long Short-Term Memory is utilized to classify the risk levels of diabetic patients over a period of two or three years. Experimental evaluations were performed on five datasets, including PIMA Indian Diabetics Database, UCI Machine Learning Repository Diabetics Dataset, Heart Diseases Dataset, Chronic Disease Dataset and Diabetic Retinopathy Debrecen Dataset. The proposed ODDTADC method demonstrates superior performance compared to existing methods, achieving remarkable results in accuracy (98.21%), recall (94.46%), kappa coefficient (98.95%), precision (98.74%), F1-score (99.01%) and Matthew’s correlation coefficient (MCC) (0.87%). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. THE USE OF ROUGH CLASSIFICATION AND TWO THRESHOLD TWO DIVISORS FOR DEDUPLICATION.
- Author
-
Jehlol, Hashem B. and George, Loay E.
- Subjects
MULTICORE processors ,BIG data ,DATA warehousing ,CLASSIFICATION ,PARALLEL processing - Abstract
The data deduplication technique efficiently reduces and removes redundant data in big data storage systems. The main issue is that the data deduplication requires expensive computational effort to remove duplicate data due to the vast size of big data. The paper attempts to reduce the time and computation required for data deduplication stages. The chunking and hashing stage often requires a lot of calculations and time. This paper initially proposes an efficient new method to exploit the parallel processing of deduplication systems with the best performance. The proposed system is designed to use multicore computing efficiently. First, The proposed method removes redundant data by making a rough classification for the input into several classes using the histogram similarity and k-mean algorithm. Next, a new method for calculating the divisor list for each class was introduced to improve the chunking method and increase the data deduplication ratio. Finally, the performance of the proposed method was evaluated using three datasets as test examples. The proposed method proves that data deduplication based on classes and a multicore processor is much faster than a single-core processor. Moreover, the experimental results showed that the proposed method significantly improved the performance of Two Threshold Two Divisors (TTTD) and Basic Sliding Window BSW algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
23. Machine learning and big data analytics in bipolar disorder
- Author
-
Luciano Minuzzi, Erkki Isometsä, Elisa Brietzke, Diego Librenza-Garcia, Anne Duffy, Martin Alda, Benson Mwangi, Flávio Kapczinski, Rodrigo B. Mansur, Boris Birmaher, Bartholomeus C M Haarman, Roger S. McIntyre, Lars Vedel Kessing, Raymond W. Lam, Lakshmi N. Yatham, Pedro Ballester, Tomas Hajek, Ives Cavalcante Passos, Carlos López Jaramillo, and Rodrigo C. Barros
- Subjects
SYMPTOMS ,Computer science ,Big data ,Scientific literature ,computer.software_genre ,Field (computer science) ,Terminology ,risk prediction ,0302 clinical medicine ,big data ,SCHIZOPHRENIA ,NEUROPROGRESSION ,bipolar disorder ,RISK ,ASSOCIATION ,Prognosis ,DEPRESSION ,3. Good health ,Psychiatry and Mental health ,Phenotype ,machine learning ,LITHIUM RESPONSE ,MOOD DISORDERS ,Schizophrenia (object-oriented programming) ,Advisory Committees ,Clinical Decision-Making ,education ,Machine learning ,Risk Assessment ,CLASSIFICATION ,Suicidal Ideation ,03 medical and health sciences ,medicine ,Humans ,Bipolar disorder ,Biological Psychiatry ,PREDICTING SUICIDALITY ,business.industry ,Deep learning ,predictive psychiatry ,Data Science ,deep learning ,data mining ,medicine.disease ,personalized psychiatry ,030227 psychiatry ,Position paper ,Artificial intelligence ,business ,computer ,030217 neurology & neurosurgery - Abstract
OBJECTIVES: The International Society for Bipolar Disorders Big Data Task Force assembled leading researchers in the field of bipolar disorder (BD), machine learning, and big data with extensive experience to evaluate the rationale of machine learning and big data analytics strategies for BD.METHOD: A task force was convened to examine and integrate findings from the scientific literature related to machine learning and big data based studies to clarify terminology and to describe challenges and potential applications in the field of BD. We also systematically searched PubMed, Embase, and Web of Science for articles published up to January 2019 that used machine learning in BD.RESULTS: The results suggested that big data analytics has the potential to provide risk calculators to aid in treatment decisions and predict clinical prognosis, including suicidality, for individual patients. This approach can advance diagnosis by enabling discovery of more relevant data-driven phenotypes, as well as by predicting transition to the disorder in high-risk unaffected subjects. We also discuss the most frequent challenges that big data analytics applications can face, such as heterogeneity, lack of external validation and replication of some studies, cost and non-stationary distribution of the data, and lack of appropriate funding.CONCLUSION: Machine learning-based studies, including atheoretical data-driven big data approaches, provide an opportunity to more accurately detect those who are at risk, parse-relevant phenotypes as well as inform treatment selection and prognosis. However, several methodological challenges need to be addressed in order to translate research findings to clinical settings.
- Published
- 2019
24. Support vector machine and classification, kernel trick for separating of data points.
- Author
-
Negi, Harendra Singh, Dimri, Sushil Chandra, Kumar, Bhawnesh, and Ram, Mangey
- Subjects
STATISTICAL learning ,SUPPORT vector machines ,BIG data ,CLASSIFICATION algorithms ,CLASSIFICATION - Abstract
A great deal of study has been done recently on support vector machines (SVMs) and how they are used in various scientific domains. The statistical learning theory's mathematical foundation allows SVM to offer logical solutions to issues. A portion of the training input serves as the SVM's solution. SVM is frequently employed in applications involving feature reduction. regression. and novelty detection. Additionally, studies conducted in certain fields where SVM performs badly have prompted the creation of alternative SVM applications. including big data sets, multiclassification SVM. and imbalanced data sets SVM. Furthermore, SVM is used with other more sophisticated techniques. such as evolve algorithms. to enhance classification performance and optimize parameters. SVM algorithms are now widely used in research and applications across a number of scientific and engineering domains. The SVM might be one of the primary options for big data compatibility and big data classification because of its advantages. Data preparation methods must be created in order to mask data into the appropriate format for learning in order to do this. This paper provides a quick overview of SVM. list several applications, draw attention to current issues and patterns, and point out sonic of SVM's drawbacks in this essay. This document can be used to categorize vast volumes of data and recommend research areas for further investigation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
25. Parallel and streaming wavelet neural networks for classification and regression under apache spark.
- Author
-
Eduru, Harindra Venkatesh, Vivek, Yelleti, Ravi, Vadlamani, and Shankar, Orsu Shiva
- Subjects
BIG data ,CLASSIFICATION ,GAS detectors ,GAUSSIAN function - Abstract
Wavelet neural networks (WNN) have been applied in many fields to solve regression as well as classification problems. After the advent of big data, as data gets generated at a brisk pace, it is imperative to analyze it as soon as it is generated owing to the fact that the nature of the data may change dramatically in short time intervals. This is necessitated by the fact that big data is all pervasive and throws computational challenges for data scientists. Therefore, in this paper, we built an efficient Scalable, Parallelized Wavelet Neural Network (SPWNN) which employs the parallel stochastic gradient algorithm (SGD) algorithm. SPWNN is designed and developed under both static and streaming environments in the horizontal parallelization framework. SPWNN is implemented by using Morlet and Gaussian functions as activation functions. This study is conducted on big datasets like gas sensor data which has more than 4 million samples and medical research data which has more than 10,000 features, which are high dimensional in nature. The experimental analysis indicates that in the static environment, SPWNN with Morlet activation function outperformed SPWNN with Gaussian on the classification datasets. However, in the case of regression, there is no clear trend was observed. In contrast, in the streaming environment i.e., Gaussian outperformed Morlet on the classification and Morlet outperformed Gaussian on the regression datasets. Overall, the proposed SPWNN architecture achieved a speedup of 1.22 - 1.78. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Ecosystem Integrity Remote Sensing—Modelling and Service Tool—ESIS/Imalys.
- Author
-
Selsam, Peter, Bumberger, Jan, Wellmann, Thilo, Pause, Marion, Gey, Ronny, Borg, Erik, and Lausch, Angela
- Subjects
ECOLOGICAL integrity ,REMOTE sensing ,LANDSCAPE assessment ,FRAGMENTED landscapes ,ZONE melting ,ENVIRONMENTAL indicators ,ECOSYSTEM services ,ECOSYSTEMS - Abstract
One of the greatest challenges of our time is monitoring the rapid environmental changes taking place worldwide at both local and global scales. This requires easy-to-use and ready-to-implement tools and services to monitor and quantify aspects of bio- and geodiversity change and the impact of land use intensification using freely available and global remotely sensed data, and to derive remotely sensed indicators. Currently, there are no services for quantifying both raster- and vector-based indicators in a "compact tool". Therefore, the main innovation of ESIS/Imalys is having a remote sensing (RS) tool that allows for RS data processing, data management, and continuous and discrete quantification and derivation of RS indicators in one tool. With the ESIS/Imalys project (Ecosystem Integrity Remote Sensing—Modelling and Service Tool), we try to present environmental indicators on a clearly defined and reproducible basis. The Imalys software library generates the RS indicators and remote sensing products defined for ESIS. This paper provides an overview of the functionality of the Imalys software library. An overview of the technical background of the implementation of the Imalys library, data formats and the user interfaces is given. Examples of RS-based indicators derived using the Imalys tool at pixel level and at zone level (vector level) are presented. Furthermore, the advantages and disadvantages of the Imalys tool are discussed in detail in order to better assess the value of Imalys for users and developers. The applicability of the indicators will be demonstrated through three ecological applications, namely: (1) monitoring landscape diversity, (2) monitoring landscape structure and landscape fragmentation, and (3) monitoring land use intensity and its impact on ecosystem functions. Despite the integration of large amounts of data, Imalys can run on any PC, as the processing and derivation of indicators has been greatly optimised. The Imalys source code is freely available and is hosted and maintained under an open source license. Complete documentation of all methods, functions and derived indicators can be found in the freely available Imalys manual. The user-friendliness of Imalys, despite the integration of a large amount of RS data, makes it another important tool for ecological research, modelling and application for the monitoring and derivation of ecosystem indicators from local to global scale. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. A Needle in a Cosmic Haystack: A Review of FRB Search Techniques.
- Author
-
Rajwade, Kaustubh M. and van Leeuwen, Joeri
- Subjects
PULSAR detection ,RADIO interference ,OBSERVATORIES ,BIG data ,RADIO telescopes ,SOLAR radio bursts - Abstract
Ephemeral Fast Radio Bursts (FRBs) must be powered by some of the most energetic processes in the Universe. That makes them highly interesting in their own right, and as precise probes for estimating cosmological parameters. This field thus poses a unique challenge: FRBs must be detected promptly and immediately localised and studied based only on that single millisecond-duration flash. The problem is that the burst occurrence is highly unpredictable and that their distance strongly suppresses their brightness. Since the discovery of FRBs in single-dish archival data in 2007, detection software has evolved tremendously. Pipelines now detect bursts in real time within a matter of seconds, operate on interferometers, buffer high-time and frequency resolution data, and issue real-time alerts to other observatories for rapid multi-wavelength follow-up. In this paper, we review the components that comprise a FRB search software pipeline, we discuss the proven techniques that were adopted from pulsar searches, we highlight newer, more efficient techniques for detecting FRBs, and we conclude by discussing the proposed novel future methodologies that may power the search for FRBs in the era of big data astronomy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. MapReduce Solutions Classification by Their Implementation.
- Author
-
Orynbekova, Kamila, Bogdanchikov, Andrey, Cankurt, Selcuk, Adamov, Abzatdin, and Kadyrov, Shirali
- Subjects
EDUCATION associations ,STATISTICAL significance ,CLASSIFICATION ,STATISTICS - Abstract
Distributed Systems are widely used in industrial projects and scientific research. The Apache Hadoop environment, which works on the MapReduce paradigm, lost popularity because new, modern tools were developed. For example, Apache Spark is preferred in some cases since it uses RAM resources to hold intermediate calculations; therefore, it works faster and is easier to use. In order to take full advantage of it, users must think about the MapReduce concept. In this paper, a usual solution and MapReduce solution of ten problems were compared by their pseudocodes and categorized into five groups. According to these groups' descriptions and pseudocodes, readers can get a concept of MapReduce without taking specific courses. This paper proposes a five-category classification methodology to help distributed-system users learn the MapReduce paradigm fast. The proposed methodology is illustrated with ten tasks. Furthermore, statistical analysis is carried out to test if the proposed classification methodology affects learner performance. The results of this study indicate that the proposed model outperforms the traditional approach with statistical significance, as evidenced by a p-value of less than 0.05. The policy implication is that educational institutions and organizations could adopt the proposed classification methodology to help learners and employees acquire the necessary knowledge and skills to use distributed systems effectively. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. Classification of Texts Using LSTM and LDA.
- Author
-
Seong-Yeon Park, Hyun-Kyung Noh, Seung-Yeon Hwang, Jae-Kon Oh, and Jeong-Joon Kim
- Subjects
INDUSTRY 4.0 ,PROGRAMMING languages ,CLASSIFICATION ,INFORMATION technology ,INFORMATION processing ,BIG data - Abstract
With the approach of the Fourth Industrial Revolution, information has become a powerful resource for society and the economy to operate and develop. In particular, as customized algorithms have become an essential service in most systems, the importance of big data-based information processing technology is also deepening. However, human languages have extreme variability compared to programming languages, so interpretation and processing are difficult. Efficient measures need to be taken to extract the desired information from these unstructured data. Therefore, in this paper, we identify and develop a more effective analytical system by classifying papers that are subdivided into five categories within the topic of 'technique' into LSTM and LDA techniques after learning. [ABSTRACT FROM AUTHOR]
- Published
- 2021
30. Analysis of Health Insurance Big Data for Early Detection of Disabilities: Algorithm Development and Validation
- Author
-
Mun-Taek Choi, Jung Bae Kang, Tae Rim Lee, and Seung-Hyun Jeong
- Subjects
Visual impairment ,Big data ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Early detection ,Health Informatics ,Sample (statistics) ,early detection of disabilities ,03 medical and health sciences ,0302 clinical medicine ,feature selection ,Health Information Management ,big data ,medicine ,Health insurance ,Medical diagnosis ,Multiple classification ,030223 otorhinolaryngology ,Original Paper ,business.industry ,05 social sciences ,050301 education ,classification ,Cohort ,health insurance ,medicine.symptom ,business ,0503 education ,Algorithm - Abstract
Background Early detection of childhood developmental delays is very important for the treatment of disabilities. Objective To investigate the possibility of detecting childhood developmental delays leading to disabilities before clinical registration by analyzing big data from a health insurance database. Methods In this study, the data from children, individuals aged up to 13 years (n=2412), from the Sample Cohort 2.0 DB of the Korea National Health Insurance Service were organized by age range. Using 6 categories (having no disability, having a physical disability, having a brain lesion, having a visual impairment, having a hearing impairment, and having other conditions), features were selected in the order of importance with a tree-based model. We used multiple classification algorithms to find the best model for each age range. The earliest age range with clinically significant performance showed the age at which conditions can be detected early. Results The disability detection model showed that it was possible to detect disabilities with significant accuracy even at the age of 4 years, about a year earlier than the mean diagnostic age of 4.99 years. Conclusions Using big data analysis, we discovered the possibility of detecting disabilities earlier than clinical diagnoses, which would allow us to take appropriate action to prevent disabilities.
- Published
- 2020
31. Urban Complexity and the Dynamic Evolution of Urban Land Functions in Yiwu City: A Micro-Analysis with Multi-Source Big Data.
- Author
-
Zhou, Liangliang, Shi, Yishao, and Xie, Mengqiu
- Subjects
URBAN land use ,CITIES & towns ,MACHINE learning ,BIG data ,URBAN planning ,CLASSIFICATION ,DATA warehousing - Abstract
The diversification of business forms leads to functional and spatial complexity in cities. The efficient determination of the complexity of an urban system is the basis for the scientific monitoring of the multi-functional aggregation within cities. Previous studies on the urban spatial structure were limited by the difficulty of collecting micro-data and the high time cost, and they focused on the macro-spatial structure, lacking fine-grained investigations of the micro-spatial structure. Additionally, high-resolution remote sensing images, which mainly rely on the textural characteristics of the spectrum of ground objects, cannot detect the social and economic functions of ground objects. Thus, it is difficult to meet the actual needs of urban planning and management. The purpose of this paper is to automatically identify the spatial heterogeneity and temporal variation of urban land use functions in the context of complex urban systems. The TF-IDF (term frequency–inverse document frequency) algorithm, a machine learning classification algorithm, and other methods are applied to identify the urban functions and distribution characteristics of the main urban area based on the POI (point of interest) data and urban form data. The results show the following: (1) From 2012 to 2022, all types of land use in Yiwu city grew at different rates, with logistics and warehousing space growing the fastest, which is in line with Yiwu's goal of building a national logistics center for trade and services. (2) The residential area has a spatial structure with a dense central circle and a scattered periphery extending from northeast to southwest and from east to west. (3) The commercial service sector shows clear spatial differentiation between the core and the periphery. The commercial functional areas of Niansanli, Houzhai, and Chengxi, where the number of commercial POIs is relatively small, are located at the intersection of the administrative subdistricts near the city center, indicating that the commercial economic activities of the downtown subdistrict have a certain spillover effect on adjacent subdistricts. (4) The public facilities of each subdistrict are generally located in the core of each subdistrict, which ensures better convenience and accessibility. (5) Industrial land with a large total area that is scattered and mixed with urban residential land gradually tends to be centralized, forming an industrial belt around the city. This study comprehensively considers the aggregation relationship between urban buildings and land use and improves the accuracy of land identification and functional zoning. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Compact Data Learning for Machine Learning Classifications.
- Author
-
Kim, Song-Kyoo
- Subjects
MACHINE learning ,ARTIFICIAL intelligence ,STATISTICAL accuracy ,BIG data ,CLASSIFICATION ,ARRHYTHMIA - Abstract
This paper targets the area of optimizing machine learning (ML) training data by constructing compact data. The methods of optimizing ML training have improved and become a part of artificial intelligence (AI) system development. Compact data learning (CDL) is an alternative practical framework to optimize a classification system by reducing the size of the training dataset. CDL originated from compact data design, which provides the best assets without handling complex big data. CDL is a dedicated framework for improving the speed of the machine learning training phase without affecting the accuracy of the system. The performance of an ML-based arrhythmia detection system and its variants with CDL maintained the same statistical accuracy. ML training with CDL could be maximized by applying an 85% reduced input dataset, which indicated that a trained ML system could have the same statistical accuracy by only using 15% of the original training dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Wetland Classification, Attribute Accuracy, and Scale.
- Author
-
Carlson, Kate, Buttenfield, Barbara P., and Qiang, Yi
- Subjects
WETLANDS ,BIG data ,CLASSIFICATION ,DATA visualization ,SPATIAL resolution ,MULTISCALE modeling - Abstract
Quantification of all types of uncertainty helps to establish reliability in any analysis. This research focuses on uncertainty in two attribute levels of wetland classification and creates visualization tools to guide analysis of spatial uncertainty patterns over several scales. A novel variant of confusion matrix analysis compares the Cowardin and Hydrogeomorphic wetland classification systems, identifying areas and types of misclassification for binary and multivariate categories. The specific focus on uncertainty in the paper refers to categorical consistency, that is, agreement between the two classification systems, rather than comparing observed data to ground truth. Consistency is quantified using confusion matrix analysis. Aggregation across progressive focal windows transforms the confusion matrix into a multiscale data pyramid for quick determination of where attribute uncertainty is highly variant, and at what spatial resolutions classification inconsistencies emerge. The focal pyramids summarize precision, recall, and F1 scores to visualize classification differences across spatial scales. Findings show that the F1 scores appear most informative on agreement about wetlands misclassification at both coarse and fine attribute scales. The pyramid organizes multi-scale uncertainty in a single unified framework and can be "sliced" to view individual focal levels of attribute consistency. Results demonstrate how the confusion matrix can be used to quantify the percentage of a study area in which inconsistencies occur reflecting wetland presence and type. The research provides confusion metrics and display tools to focus attention on specific areas of large data sets where attribute uncertainty patterns may be complex, thus reducing land managers' workloads by highlighting areas of uncertainty where field checking might be appropriate, and improving analytics by providing visualization tools to quickly see where such areas occur. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. An Emergency Event Detection Ensemble Model Based on Big Data.
- Author
-
Alfalqi, Khalid and Bellaiche, Martine
- Subjects
BIG data ,DATABASES ,SOCIAL networks ,MACHINE learning ,DECISION trees ,SOCIAL systems - Abstract
Emergency events arise when a serious, unexpected, and often dangerous threat affects normal life. Hence, knowing what is occurring during and after emergency events is critical to mitigate the effect of the incident on humans' life, on the environment and our infrastructures, as well as the inherent financial consequences. Social network utilization in emergency event detection models can play an important role as information is shared and users' status is updated once an emergency event occurs. Besides, big data proved its significance as a tool to assist and alleviate emergency events by processing an enormous amount of data over a short time interval. This paper shows that it is necessary to have an appropriate emergency event detection ensemble model (EEDEM) to respond quickly once such unfortunate events occur. Furthermore, it integrates Snapchat maps to propose a novel method to pinpoint the exact location of an emergency event. Moreover, merging social networks and big data can accelerate the emergency event detection system: social network data, such as those from Twitter and Snapchat, allow us to manage, monitor, analyze and detect emergency events. The main objective of this paper is to propose a novel and efficient big data-based EEDEM to pinpoint the exact location of emergency events by employing the collected data from social networks, such as "Twitter" and "Snapchat", while integrating big data (BD) and machine learning (ML). Furthermore, this paper evaluates the performance of five ML base models and the proposed ensemble approach to detect emergency events. Results show that the proposed ensemble approach achieved a very high accuracy of 99.87% which outperform the other base models. Moreover, the proposed base models yields a high level of accuracy: 99.72%, 99.70% for LSTM and decision tree, respectively, with an acceptable training time. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. Classification and Visual Design Analysis of Network Expression Based on Big Data Multimodal Intelligence Technology.
- Author
-
Ping, Zou and Liu, Yueyan
- Subjects
DATABASES ,MULTIMODAL user interfaces ,ELECTRONIC data processing ,MODERN society ,CLASSIFICATION ,DATA analysis ,BIG data ,ACQUISITION of data - Abstract
The rapid development of the Internet in modern society has promoted the development of many different network platforms. In the context of big data, many types of multimodal data such as pictures, videos, and texts are generated in the platform. Through the analysis of multimodal data, we can provide better services for users. The traditional big data analysis platform cannot achieve a completely stable state for the analysis of multimodal data. The construction of multimodal intelligent platform can achieve efficient analysis of relevant data, so as to create greater economic benefits for the society. This paper mainly studies the historical development trend of big data multimodal intelligence technology and the data processing method of multimodal intelligence technology applied to network expression classification, including data acquisition, storage, and analysis. Finally, it studied the fusion algorithm between multimodal data and visual design, as well as the classification of network expression and the application result analysis of visual design in big data multimodal intelligence technology. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Data Mining tools and techniques in construction by Knowledge Areas: State of the Art situation.
- Author
-
Arba, Danilo
- Subjects
DATA mining ,PROJECT managers ,CONSTRUCTION ,INFORMATION technology ,BIG data - Abstract
Managing Project Controls, both from an owners perspective as from a contractor organization, is commonly, and as per the Compendium of the Guild of Project Controls2, divided in 12 modules that make excellent use of Information technologies, generating an incredible amount of data, making analyzing and organizing the data crucial for the effective use of the data to improve initiating, planning, controlling and closing in the construction sector. This paper aims to establish which data mining tools and techniques are best for use in construction to help to maximize opportunities and reduce risks in the 12 modules of the Compendium of the Guild of Project Controls. Project Managers and Project Controls professionals can leverage the use of the validated data to ensure better decision making, faster and with the most significant benefit to the projects. [ABSTRACT FROM AUTHOR]
- Published
- 2020
37. HTwitt: a hadoop-based platform for analysis and visualization of streaming Twitter data
- Author
-
Demirbaga, Umit
- Published
- 2023
- Full Text
- View/download PDF
38. Intelligent Processing and Classification of Multisource Health Big Data from the Perspective of Physical and Medical Integration.
- Author
-
Tang, Haiou
- Subjects
FEATURE selection ,BIG data ,COMPUTER science ,TECHNOLOGY assessment ,DATA scrubbing ,ELECTRONIC data processing ,INFORMATION technology ,CLASSIFICATION - Abstract
With the development of computer science and information technology, human society is gradually stepping into the Internet and big data. The medical and health industry can realize the integration and readjustment of existing resources, improve the operation efficiency of the industry, and tap the huge potential of the industry with the support of big data technology. However, the medical data in the new era has the characteristics of massive, high latitude, complex structure, and complex information, which is not conducive to the direct classification of health data. The preprocessing of health data can improve the quality of dataset, reduce the size of data, and improve the efficiency and accuracy of data classification. Based on this and according to the characteristics of health dataset and the existing pretreatment technology, this paper analyzes and improves the algorithm of abnormal data detection and data protocol in the process of reprocessing data cleaning. This paper analyzes and studies feature selection algorithms based on Bayesian inference algorithm and focuses on feature selection algorithms based on random forest. In order to solve the problem that the original algorithm ignored the relationship between the importance degrees of each feature in a single tree, a feature importance degree calculation method based on local importance degree was proposed. Through experimental analysis and comparison, the improved algorithm can select better feature subset and improve the performance of the classification model. Then, TAN classifier, BAN classifier, and MBN classifier were constructed based on preprocessed hypothyroidism data, and the performances of these four classifiers were compared through experiments. The final results show that BAN classifier has the best average classification effect. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
39. Classification Model on Big Data in Medical Diagnosis Based on Semi-Supervised Learning.
- Author
-
Wang, Lei, Qian, Qing, Zhang, Qiang, Wang, Jishuai, Cheng, Wenbo, and Yan, Wei
- Subjects
DIAGNOSIS ,SUPERVISED learning ,MACHINE learning ,KNOWLEDGE base ,BIG data ,DATA modeling ,CLASSIFICATION ,MEDICAL laboratories - Abstract
Big data in medical diagnosis can provide abundant value for clinical diagnosis, decision support and many other applications, but obtaining a large number of labeled medical data will take a lot of time and manpower. In this paper, a classification model based on semi-supervised learning algorithm using both labeled and unlabeled data is proposed to process big data in medical diagnosis, which includes structured, semi-structured and unstructured data. For the medical laboratory data, this paper proposes a self-training algorithm based on repeated labeling strategy to solve the problem that mislabeled samples weaken the performance of classifiers. Aiming at medical record data, this paper extracts features with high correlation of classification results based on domain expert knowledge base first, and then chooses the unlabeled medical record data with the highest confidence to expand the training set and optimizes the performance of the classifiers of tri-training algorithm, which uses supervised learning algorithm to train three basic classifiers. The experimental results show that the proposed medical diagnosis data classification model based on semi-supervised learning algorithm has good performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
40. 数据归一化方法综述.
- Author
-
杨寒雨, 赵晓永, and 王磊
- Subjects
ARTIFICIAL intelligence ,BIG data ,DEEP learning ,DATA mining ,CLASSIFICATION - Abstract
Copyright of Journal of Computer Engineering & Applications is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2023
- Full Text
- View/download PDF
41. Big data and machine learning framework for clouds and its usage for text classification.
- Author
-
Pintye, István, Kail, Eszter, Kacsuk, Péter, and Lovas, Róbert
- Subjects
MACHINE learning ,CLASSIFICATION ,SCALABILITY ,UNIVERSITY research ,BIG data - Abstract
Reference architectures for big data and machine learning include not only interconnected building blocks but important considerations (among others) for scalability, manageability and usability issues as well. Leveraging on such reference architectures, the automated deployment of distributed toolsets and frameworks on various clouds is still challenging due to the diversity of technologies and protocols. The paper focuses particularly on the widespread Apache Spark cluster with Jupyter as the particularly addressed framework, and the Occopus cloud‐agnostic orchestrator tool for automating its deployment and maintenance stages. The presented approach has been demonstrated and validated with a new, promising text classification application on the Hungarian academic research infrastructure, the OpenStack‐based MTA Cloud. The paper explains the concept, the applied components, and illustrates their usage with real use‐case measurements. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
42. Parallelly Running and Privacy-Preserving k -Nearest Neighbor Classification in Outsourced Cloud Computing Environments.
- Author
-
Park, Jeongsu and Lee, Dong Hoon
- Subjects
K-nearest neighbor classification ,CLOUD computing ,DISCLOSURE ,BIG data ,CLASSIFICATION - Abstract
Classification is used in various areas where k-nearest neighbor classification is the most popular as it produces efficient results. Cloud computing with powerful resources is one reliable option for handling large-scale data efficiently, but many companies are reluctant to outsource data due to privacy concerns. This paper aims to implement a privacy-preserving k-nearest neighbor classification (PkNC) in an outsourced environment. Existing work proposed a secure protocol (SkLE/SkSE) to compute k data with the largest/smallest value privately, but this work discloses information. Moreover, SkLE/SkSE requires a secure comparison protocol, and the existing protocols also contain information disclosure problems. In this paper, we propose a new secure comparison and SkLE/SkSE protocols to solve the abovementioned information disclosure problems and implement PkNC with these novel protocols. Our proposed protocols disclose no information and we prove the security formally. Then, through extensive experiments, we demonstrate that the PkNC applying the proposed protocols is also efficient. Especially, the PkNC is suitable for big data analysis to handle large amounts of data, since our SkLE/SkSE is executed for each dataset in parallel. Although the proposed protocols do require efficiency sacrifices to improve security, the running time of our PkNC is still significantly more efficient compared with previously proposed PkNCs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. E3W—A Combined Model Based on GreedySoup Weighting Strategy for Chinese Agricultural News Classification.
- Author
-
Xiao, Zeyan, Yang, Senqi, Duan, Xuliang, Tang, Dezhao, Guo, Yan, and Li, Zhiyong
- Subjects
AGRICULTURAL technology ,NATURAL language processing ,BIG data ,CLASSIFICATION - Abstract
With the continuous development of the internet and big data, modernization and informatization are rapidly being realized in the agricultural field. In this line, the volume of agricultural news is also increasing. This explosion of agricultural news has made accurate access to agricultural news difficult, and the spread of news about some agricultural technologies has slowed down, resulting in certain hindrance to the development of agriculture. To address this problem, we apply NLP to agricultural news texts to classify the agricultural news, in order to ultimately improve the efficiency of agricultural news dissemination. We propose a classification model based on ERNIE + DPCNN, ERNIE, EGC, and Word2Vec + TextCNN as sub-models for Chinese short-agriculture text classification (E3W), utilizing the GreedySoup weighting strategy and multi-model combination; specifically, E3W consists of four sub-models, the output of which is processed using the GreedySoup weighting strategy. In the E3W model, we divide the classification process into two steps: in the first step, the text is passed through the four independent sub-models to obtain an initial classification result given by each sub-model; in the second step, the model considers the relationship between the initial classification result and the sub-models, and assigns weights to this initial classification result. The final category with the highest weight is used as the output of E3W. To fully evaluate the effectiveness of the E3W model, the accuracy, precision, recall, and F1-score are used as evaluation metrics in this paper. We conduct multiple sets of comparative experiments on a self-constructed agricultural data set, comparing E3W and its sub-models, as well as performing ablation experiments. The results demonstrate that the E3W model can improve the average accuracy by 1.02%, the average precision by 1.62%, the average recall by 1.21%, and the average F1-score by 1.02%. Overall, E3W can achieve state-of-the-art performance in Chinese agricultural news classification. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. Two-step learning for crowdsourcing data classification.
- Author
-
Yu, Hao, Li, Jiaye, Wu, Zhaojiang, Xu, Hang, and Zhu, Lei
- Subjects
CROWDSOURCING ,CLASSIFICATION algorithms ,MACHINE learning ,BIG data ,KNOWLEDGE workers - Abstract
Crowdsourcing learning (Bonald and Combes 2016; Dawid and Skene, J R Stat Soc: Series C (Appl Stat), 28(1):20–28 1979; Karger et al. 2011; Li et al, IEEE Trans Knowl Data Eng, 28(9):2296–2319 2016; Liu et al. 2012; Schlagwein and Bjorn-Andersen, J Assoc Inform Syst, 15(11):3 2014; Zhang et al. 2014) plays an increasingly important role in the era of big data (Liu et al., IEEE Trans Syst Man Cybern: Syst, 48(12): 451–2461, 2017; Zhang et al. 2014) due to its ability to easily solve large-scale data annotations (Musen et al., J Amer Med Informs Assoc, 22(6):1148–1152 2015). However, in the process of crowdsourcing learning, the uneven knowledge level of workers often leads to low accuracy of the label after marking, which brings difficulties to the subsequent processing (Edwards and Teddy 2013) and analysis of crowdsourcing data. In order to solve this problem, this paper proposes a two-step learning crowdsourced data classification algorithm, which optimizes the original label data by simultaneously considering the two issues of different worker abilities and the similarity between crowdsourced data (Kasikci et al. 2013) samples, so as to get more accurate label data. The two-step learning algorithm mainly includes two steps. Firstly, the worker's ability to label different samples is obtained by constructing and training the worker's ability model, and then the similarity between samples is calculated by the cosine measurement method (Muflikhah and Baharudin 2009), and finally the original label data is optimized by combining the above two results. The experimental results also show that the two-step learning classification algorithm proposed in this article has achieved better experimental results than the comparison algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
45. Hybrid approaches to optimization and machine learning methods: a systematic literature review.
- Author
-
Azevedo, Beatriz Flamia, Rocha, Ana Maria A. C., and Pereira, Ana I.
- Subjects
LITERATURE reviews ,BIG data ,MACHINE learning ,DATABASES ,SWOT analysis ,DOCUMENT clustering - Abstract
Notably, real problems are increasingly complex and require sophisticated models and algorithms capable of quickly dealing with large data sets and finding optimal solutions. However, there is no perfect method or algorithm; all of them have some limitations that can be mitigated or eliminated by combining the skills of different methodologies. In this way, it is expected to develop hybrid algorithms that can take advantage of the potential and particularities of each method (optimization and machine learning) to integrate methodologies and make them more efficient. This paper presents an extensive systematic and bibliometric literature review on hybrid methods involving optimization and machine learning techniques for clustering and classification. It aims to identify the potential of methods and algorithms to overcome the difficulties of one or both methodologies when combined. After the description of optimization and machine learning methods, a numerical overview of the works published since 1970 is presented. Moreover, an in-depth state-of-art review over the last three years is presented. Furthermore, a SWOT analysis of the ten most cited algorithms of the collected database is performed, investigating the strengths and weaknesses of the pure algorithms and detaching the opportunities and threats that have been explored with hybrid methods. Thus, with this investigation, it was possible to highlight the most notable works and discoveries involving hybrid methods in terms of clustering and classification and also point out the difficulties of the pure methods and algorithms that can be strengthened through the inspirations of other methodologies; they are hybrid methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Data to the people: a review of public and proprietary data for transport models.
- Author
-
Mahajan, Vishal, Kuehnel, Nico, Intzevidou, Aikaterini, Cantelmo, Guido, Moeckel, Rolf, and Antoniou, Constantinos
- Subjects
DATA modeling ,ROUTE choice ,CLASSIFICATION ,SMART cards ,CELL phones ,DATA release - Abstract
Data play an indispensable role in transport modelling. The availability of data from non-conventional sources, such as mobile phones, social media, and public transport smart cards, changes the way we conduct mobility analyses and travel forecasting. Existing studies have demonstrated the multitude and varied applications of these emerging data in transport modelling. The transferability of current research and further endeavours depend mostly on the availability of these data. Therefore, the openness or public availability of the prominent data for transport modelling needs to be adequately investigated. Such a discussion should also encompass these data's application aspects to provide a holistic overview. This paper defines a typology for the data classification based on a set of availability or openness attributes from the existing literature. Subsequently, we use the developed typology to classify the prominent transport data into four categories: (i) Commercial data, (ii) Inaccessible data, (iii) Gratis and accessible data with restricted use, and (iv) Open data. Using this typology, we conclude that the public data, which refer to the data that are accessible and free of cost, are a superset of open data. Further, we discuss the applications and limitations of the selected data in transport modelling and highlight in which task(s) certain data excel. Lastly, we synthesise our review using a Strengths, Weaknesses, Opportunities and Threats (SWOT) analysis to bring out the aspects relevant to data owners and data consumers. Public availability of data can help in various modelling steps such as trip generation, accessibility, destination choice, route choice, network modelling. Complementary datasets such as General Transit Feed Specification (GTFS) and Volunteered Geographic Information (VGI) increase the usability of other data. Thus, modellers can gain from the positive cascade effect by prioritising these data. There is also a potential for data owners to release proprietary data, such as mobile phone data, with restricted-use licenses after addressing privacy risks. Our study contributes by dealing with two problems at the same time. On the one hand, the paper analyses existing data based on their potential for mobility studies. On the other hand, we classify them based on how open they are. Hence, we identify the most promising public data for developing the next generation of transport models. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
47. Students' Orientation Using Machine Learning and Big Data.
- Author
-
Ouatik, Farouk, Erritali, Mohammed, Ouatik, Fahd, and Jourhmane, Mostafa
- Subjects
MACHINE learning ,RANDOM forest algorithms ,BIG data ,INFORMATION storage & retrieval systems ,STUDENTS ,PUBLIC institutions - Abstract
Students' orientation in public institutions and choosing their academic paths or their appropriate specialization is important to students to continue their studies Easily in their school career. Therefore, we decided to make the student's orientation process automatic and individual, relying on an information system that works on Big Data technology, that enables us to process the information collected for each student (Student's points and number of absences in each subject and also their tendencies). Then we used the algorithms of machine learning, that enable us to give the appropriate specialization to each student. In this paper, we compared the accuracy and execution time of the following algorithms (Naïve Bayes, SVM, Random Forest Tree and Neural Network), where we found that Naïve Bayes is the best for this system. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
48. An efficient big data classification using elastic collision seeker optimization based faster R-CNN.
- Author
-
Chidambaram, S., Cyril, C. Pretty Diana, and Ganesh, S. Sankar
- Subjects
ELASTIC scattering ,BIG data ,MACHINE learning ,CLASSIFICATION - Abstract
Big data is a large set of data that is analyzed with the calculation to manifest myriad sources. Big data is capable of handling various challenges to processing huge amounts of data. To handle issues based on large-scale databases, a MapReduce framework is employed which provides robust and simple infrastructure for huge datasets. This paper proposes a novel Elastic collision seeker optimization based Faster R-CNN (ECSO-FRCNN) classifier for efficient big data classification. The proposed ECSO-FRCNN classifier is capable of handling missing attributes, and incremental learning and improves training performance effectively. As the proposed technique deals with large data samples, it necessitates the inclusion of the MapReduce framework. The adaption of MapReduce design in big data classification prevents the classification results from uncertainties such as data redundancy, misclassification, and storage issues. The proposed method is examined with three standard datasets, namely the skin segmentation dataset, mushroom dataset, and localization dataset, collected from the University of California, UCI machine learning repository. Finally, extensive experimental analysis is carried out for various parameters to depict the efficiency of the system. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
49. Integration and classification approach based on probabilistic semantic association for big data.
- Author
-
VandanaKolisetty, Vishnu and Rajput, Dharmendra Singh
- Subjects
BIG data ,DATA integration ,CLASSIFICATION ,DATA mapping ,DATA analysis ,DATA modeling - Abstract
The process of integration through classification provides a unified representation of diverse data sources in Big data. The main challenges of big data analysis are due to the various granularities, irreconcilable data models, and multipart interdependencies between data content. Previously designed models were facing problems in integrating and analyzing big data due to highly complex and dynamic multi-source and heterogeneous information variation and also in processing and classifying the association among the attributes in a schema. In this paper, we propose an integration and classification approach through designing a Probabilistic Semantic Association (PSA) method to generate the feature pattern for the sources of big data. The PSA approach is trained to understand the data association and dependency pattern between the data class and incoming data to map the data objects accurately. It initially builds a data integration mechanism by transforming data into structured and learn to utilize the trained knowledge to classify the probabilistic association among the data and knowledge patterns. Later it builds a data analysis mechanism to analyze the mapped data through PSA to evaluate the integration efficiency. An experimental evaluation is performed over a real-time crime dataset generated from multiple locations having various events classes. The analysis of results confined that the utilization of knowledge patterns of accurate classification to enhance the integration of multiple source data is appropriate. The measure of precision, recall, fall-out rate, and F-measure approve the efficiency of the proposed PSA method. Even in comparison with the state-of-art classification method and with SC-LDA algorithm shows an improvisation in the prediction accuracy and enhance the data integration. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. Improving classification and clustering techniques using GPUs.
- Author
-
Jararweh, Yaser, Shehab, Mohammed A., Yaseen, Qussai, and Al‐Ayyoub, Mahmoud
- Subjects
CLASSIFICATION algorithms ,SOCIAL network analysis ,GRAPHICS processing units ,CLASSIFICATION ,BIG data - Abstract
Summary: Classification and clustering techniques are used in different applications. Large‐scale big data applications such as social networks analysis applications need to process large data chunks in a short time. Classification and clustering tasks in such applications consume a lot of processing time. Improving the performance of classification and clustering algorithms enhances the performance of applications that use such type of algorithms. This paper introduces an approach for exploiting the graphics processing unit (GPU) platform to improve the performance of classification and clustering algorithms. The proposed approach uses two GPUs implementations, which are the pure GPU or GPU‐only implementation and the GPU‐CPU hybrid implementation. The results show that the hybrid implementation, which optimizes the subtask scheduling for both the CPU and the GPU processing elements, outperforms the approach that uses only the GPU. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.