236 results on '"Big data processing"'
Search Results
2. A Spark-based high utility itemset mining with multiple external utilities
- Author
-
Dharavath Ramesh, Krishan Kumar Sethi, and Munesh Chandra Trivedi
- Subjects
Big data processing ,Profit (accounting) ,Computer Networks and Communications ,Computer science ,business.industry ,Big data ,Load distribution ,Space (commercial competition) ,computer.software_genre ,Data mining algorithm ,Spark (mathematics) ,Data mining ,business ,computer ,Computer communication networks ,Software - Abstract
High utility itemset (HUI) mining is a powerful data mining technique to discover profitable patterns. The utility of an item is computed by using two measures named quantity and per-unit profit. All existing HUI mining algorithms consider a single value of external utility (per unit profit) for the entire database. However, the per-unit profit of items might fluctuate over time in many applications. This research introduces three novel strategies to comprise the external utilities of items as input for the HUI mining algorithm. Traditional HUI mining algorithms have been developed for the standalone system and do not fit for big data processing due to the limited computing resources (CPU, memory). Big data are efficiently processed on distributed frameworks like Apache Hadoop, Spark, etc. This paper introduces a distributed HUI mining algorithm named Spark-based Top-k high utility itemset (k-SHUI) miner. We also propose a fair load distribution strategy to divide the search space equally among the cluster nodes. The k-SHUI produces top-k HUIs without the requirement of the minimum utility threshold. We conducted extensive experiments on six real-life datasets to compare the proposed algorithm's performance with the existing algorithm. The experimental results demonstrate that the proposed algorithm outperforms the existing algorithms.
- Published
- 2021
3. ANN-inspired Straggler Map Reduce Detection in Big Data Processing
- Author
-
Ajay Kumar Bansal, Manmohan Sharma, and Ashu Gupta
- Subjects
Big data processing ,Computer science ,General Chemical Engineering ,Map reduce ,General Materials Science ,Data mining ,computer.software_genre ,computer ,Industrial and Manufacturing Engineering - Abstract
One of the most challenging aspects of using MapReduce to parallelize and distribute large-scale data processingis detecting straggler tasks. Identifying ongoing tasks on weak nodes is how it’s described. The total computation time isthe sum of the execution times of the two stages in the Map process (copy, combine) and the three stages in the Reducephase (shuffle, sort, and reduce). The main aim of this paper is to estimate the accurate execution time in each location. Theproposed approach uses a backpropagation neural network on Hadoop to detect straggler tasks and calculate the remainingtask execution time, which is crucial in straggler task identification. The comparative analysis is done with some efficientmodels in this domain, such as LATE, ESAMR, and the real remaining time for WordCount and Sort benchmarks. It wasfound that the proposed model is capable of detecting straggler tasks in accurately estimating execution time. It also helpsin reducing the execution time that it takes to complete a task.
- Published
- 2021
4. Big data processing technology in a distributed information system
- Author
-
M.A. Sambetbayeva, S. K. Serikbayeva, Computational Technologies, and J.A. Tussupov
- Subjects
Big data processing ,Database ,Computer science ,Information system ,computer.software_genre ,computer - Published
- 2021
5. Development of technology for controlling access to digital portals and platforms based on estimates of user reaction time built into the interface
- Author
-
S. G. Magomedov, P. V. Kolyasnikov, and E. V. Nikulchev
- Subjects
Information theory ,information security ,Computer science ,Interface (computing) ,Big data ,0211 other engineering and technologies ,security event management ,Access control ,digital platforms ,02 engineering and technology ,computer.software_genre ,Security information and event management ,Information security management ,0202 electrical engineering, electronic engineering, information engineering ,Information system ,Q350-390 ,General Environmental Science ,021110 strategic, defence & security studies ,user behavior analysis ,Multimedia ,business.industry ,access control ,Information security ,big data processing ,General Earth and Planetary Sciences ,020201 artificial intelligence & image processing ,business ,Psychomotor reaction time ,computer ,computing systems - Abstract
The paper addresses the development of technology for controlling access to digital portals and platforms based on assessments of personal characteristics of user behavior built into the interface. In distributed digital platforms and portals using personal data, big data is collected and processed using specialized applications using computer networks. In accordance with the law, the data is stored on internal corporate servers and data centers. Special attention is paid to the tasks of differentiation and control of access in modern information systems. Wide availability and mass scale of services should be accompanied by more careful control and user verification. Access control to such systems cannot be ensured only through technologies and information security tools; efficiency can be increased through software and hardware architectural solutions. The paper proposes to expand the currently developing SIEM technology (Security information and event management), which combines the concept of security event management and information security management, with blocks of user behavior analysis. As a characteristic that can be measured without overloading communication channels and is independent of the type of device used, the psychomotor reaction time is proposed, measured as the performance of actions with the interface. A technological solution has been developed for implementation in a wide range of digital platforms: banking, medical, educational, etc. The results of experimental research using a digital platform of mass psychological research are presented. For the research, data from a mass survey were used when answering (in the form of a choice from the available options) to questions about the level of education. Analysis of the reaction time data showed the possibility of standardization and the same indicators of specific users when answering different questions.
- Published
- 2020
6. Protecting Machine Learning Integrity in Distributed Big Data Networking
- Author
-
Yan Zhang, Mingyue Xiao, Yunkai Wei, Sabita Maharjan, and Yijin Chen
- Subjects
Scheme (programming language) ,Big data processing ,Network control ,Computer Networks and Communications ,System integrity ,business.industry ,Computer science ,Big data ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Architecture ,business ,computer ,Software ,Information Systems ,computer.programming_language - Abstract
A distributed big data network is the integration of big data and the underlying distributed network. This emerging paradigm brings the potential to divide big data processing tasks into smaller ones so that they can be intelligently processed in parallel with machine learning based on distributed network resources. Such a pattern requires strict system integrity, especially machine learning integrity against data tampering or network control by malicious nodes. In this article, we propose a secure architecture consisting of one HaSi scheme and two data tampering detection schemes for protecting the machine learning integrity in distributed big data networking. Illustrative results demonstrate the effect of our proposed schemes, and show that they can ensure the learning accuracy even when 30-40 percent of processing nodes are maliciously controlled. When the figure raises to 40-50 percent, the accuracy of our proposed schemes begins to fall visibly, but still outperforms the scenario without protection by up to 70-80 percent.
- Published
- 2020
7. Cloud computing model for big data processing and performance optimization of multimedia communication
- Author
-
Liang Zhao and Zhicheng Zhou
- Subjects
Big data processing ,Artificial neural network ,Multimedia ,Computer Networks and Communications ,Computer science ,business.industry ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,computer.software_genre ,0202 electrical engineering, electronic engineering, information engineering ,Cluster (physics) ,020201 artificial intelligence & image processing ,business ,computer - Abstract
In this paper, the technical aspects of big data processing of multimedia communication are deeply studied. Firstly, we propose a cloud implementation method of radial basis function neural network which is based on Map-Reduce on cloud computing cluster. Secondly, in order to meet the needs of big data processing of multimedia communication, the map-Reduce-based error back-propagation algorithm is trained to effectively map the effective mapping mechanism of multi-layer neural networks. For a cloud algorithm on a cloud computing cluster and a serial algorithm on a single processor, the time required to implement the algorithm is theoretically derived, and the cloud algorithm and performance parameters (acceleration ratio) on the cloud cluster are evaluated, the optimal number and minimum number of data nodes). Finally, the experimental results show that compared with the existing algorithms, the cloud algorithm proposed in this paper has better acceleration speed, faster convergence speed and fewer iterations.
- Published
- 2020
8. Adapting Market-Oriented Policies for Scheduling Divisible Loads on Clouds
- Author
-
MajidMimi Liza Abdul and ChupratSuriayati
- Subjects
Big data processing ,Divisible load theory ,Computer Networks and Communications ,Computer science ,business.industry ,Distributed computing ,Cloud computing ,02 engineering and technology ,Cloud service provider ,computer.software_genre ,020202 computer hardware & architecture ,Scheduling (computing) ,Hardware and Architecture ,Virtual machine ,Market oriented ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,computer - Abstract
Cloud computing has become an important alternative for solving big data processing. Nowadays, cloud service providers usually offer users a virtual machine with various combinations of prices. As each user has different circumstances, the problem of choosing the cost-minimized combination under a deadline constraint as well as user's preference is becoming more complex. This article is concerned with the investigation of adapting a user's preference policies for scheduling real-time divisible loads in a cloud computing environment. The workload allocation approach used in this research is using Divisible Load Theory. The proposed algorithm aggregates resources into groups and optimally distributes the fractions of load to the available resources according to user's preference. The proposed algorithm was evaluated by simulation experiments and compared with the baseline approach. The result obtained from the proposed algorithm reveals that a significant reduction in computation cost can be attained when the user's preferences are low priority.
- Published
- 2020
9. The Experimental Study of Performance Impairment of Big Data Processing in Dynamic and Opportunistic Environments
- Author
-
Wei Li and William W. Guo
- Subjects
Big data processing ,Computer science ,business.industry ,Performance impairment ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Machine learning ,computer.software_genre ,computer - Abstract
In contrast to HPC clusters, when big data is processing in a distributed, particularly dynamic and opportunistic environment, the overall performance must be impaired and even bottlenecked by the dynamics of overlay and the opportunism of computing nodes. The dynamics and opportunism are caused by churn and unreliability of a generic distributed environment, and they cannot be ignored or avoided. Understanding impact factors, their impact strength and the relevance between these impacts is the foundation of potential optimization. This paper derives the research background, methodology and results by reasoning the necessity of distributed environments for big data processing, scrutinizing the dynamics and opportunism of distributed environments, classifying impact factors, proposing evaluation metrics and carrying out a series of intensive experiments. The result analysis of this paper provides important insights to the impact strength of the factors and the relevance of impact across the factors. The production of the results aims at paving a way to future optimization or avoidance of potential bottlenecks for big data processing in distributed environments.
- Published
- 2020
10. Big Data Management System Security Threat Model
- Author
-
Maxim O. Kalinin, Maria A. Poltavtseva, and Dmitry P. Zegzhda
- Subjects
Big data processing ,021110 strategic, defence & security studies ,business.industry ,Computer science ,Big data management ,Data management ,0211 other engineering and technologies ,02 engineering and technology ,Information security ,Computer security ,computer.software_genre ,Control and Systems Engineering ,Signal Processing ,Computer data storage ,Threat model ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,computer ,Software - Abstract
The article considers the concept and features of Big Data management systems and their differences from traditional DBMS’s. The authors describe changes in the intruder model and new vulnerabilities in data management systems. A new threat model is developed. The article presents new problems of information security in a distributed Big Data processing and storage system.
- Published
- 2019
11. A Consistent Approach to Building Secure Big Data Processing and Storage Systems
- Author
-
Maria A. Poltavtseva
- Subjects
Big data processing ,021110 strategic, defence & security studies ,Database ,Computer science ,Big data management ,0211 other engineering and technologies ,02 engineering and technology ,computer.software_genre ,Control and Systems Engineering ,Signal Processing ,Management system ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Architecture ,computer ,Software - Abstract
This article considers the solution to the problem of building secure Big Data management systems using a consistent approach. The concept and features of Big Data management systems and their differences from traditional DBMS’s are presented. The principles of a new, consistent approach to building secure Big Data management systems are given and substantiated. The security subsystem architecture is proposed.
- Published
- 2019
12. An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster
- Author
-
M.A. Rashid, Teo Susnjak, Andre L. C. Barczak, and Nasim Ahmed
- Subjects
Technology ,System deployment ,Apache Spark ,Computer science ,business.industry ,Node (networking) ,Big data ,k-means clustering ,computer.software_genre ,performance prediction ,Computer Science Applications ,Management Information Systems ,Scheduling (computing) ,execution time prediction ,modelling ,big data processing ,Artificial Intelligence ,Spark (mathematics) ,Performance prediction ,Graph (abstract data type) ,Data mining ,business ,computer ,Information Systems - Abstract
Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.
- Published
- 2021
- Full Text
- View/download PDF
13. Beamer
- Author
-
Nifei Bi, Xiansen Chen, Aoying Zhou, and Chen Xu
- Subjects
Big data processing ,Batch training ,business.industry ,Computer science ,Deep learning ,Training (meteorology) ,Inference ,Machine learning ,computer.software_genre ,Stream processing ,End-to-end principle ,Spark (mathematics) ,Artificial intelligence ,business ,computer - Abstract
Deep learning has made extraordinary progress in the last few years, focusing on improving the accuracy and speed of standard deep learning benchmarks. Nevertheless, datasets in production environments are often messy, which makes data cleaning crucial for DNN model training and inference. Existing solutions that combine big data processing systems and deep learning systems to accomplish the data cleaning, DNN model training and inference are internally tied to one of Spark or Flink. However, Spark and Flink usually show different performance under batch and stream processing workloads. In order to employ Spark in batch training and Flink in streaming inference, existing solutions incur the burden of maintaining two data cleaning programs. In this demonstration, we showcase Beamer: an end-to-end deep learning framework for unifying the data cleaning program when employing Spark in training and Flink in inference, respectively.
- Published
- 2021
14. Fog computing framework for Big Data processing using cluster management in a resource-constraint environment
- Author
-
Srinivasa Raju Rudraraju, Atul Negi, and Nagender Kumar Suryadevara
- Subjects
Big data processing ,Database ,business.industry ,Computer science ,Big data ,computer.software_genre ,Fog computing ,Loan ,Distributed data store ,Spark (mathematics) ,Cluster (physics) ,business ,computer ,Credit risk - Abstract
This article presents the implementation details related to the distributed storage and processing of big datasets in fog computing cluster environment. The implementation details of fog computing framework using Apache Spark for big data applications in a resource-constrained environment are given. The results related to Big Data processing, modeling, and prediction in a resource-constraint fog computing framework are presented by considering the evaluation of case studies using the e-commerce customer dataset and bank loan credit risk datasets.
- Published
- 2021
15. Machine Learning Models in Smart Cities – Data-Driven Perspective
- Author
-
Federica Foiadelli, Seyed Mahdi Miraftabzadeh, and Michela Longo
- Subjects
Big data processing ,Computer science ,business.industry ,Perspective (graphical) ,Machine learning ,computer.software_genre ,Regression ,Data-driven ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial intelligence ,Regression algorithm ,business ,computer ,Supervised training - Abstract
Machine learning models can detect hidden trends and patterns in datasets, leading to valuable insights and data-driven decision-makers to improve general performance. However, it is also possible to scale up the machine learning algorithms by paralyzing such algorithms and using big data processing synchronically. The validation dataset is usually 20-25% of the original dataset, while the original dataset is considered as a massive dataset it would be 1% of the original size. The measurement or criteria for calculating error in classification and regression algorithms are discussed in the following. However, in some literature, the validation dataset and testing dataset are considered the same, and the original dataset is not divided into three phases to build a machine learning model. In this chapter, the most common supervised learning algorithms, classification, and regression are introduced.
- Published
- 2021
16. Fraud Detection on Streaming Customer Behavior Data with Unsupervised Learning Methods
- Author
-
Mehmet S. Aktas, Efsa Cakrr, Alperen Mollaoglu, and Gursel Baltaoglu
- Subjects
Big data processing ,business.industry ,Computer science ,Usability ,Machine learning ,computer.software_genre ,Usage data ,Data modeling ,Set (abstract data type) ,Scalability ,Unsupervised learning ,Artificial intelligence ,business ,computer ,Consumer behaviour - Abstract
In today's telecom industry, fraud detection is a significant research problem. As part of this research, we propose a methodology to detect fraud cases using machine learning models with big data processing and analysis platforms based on customer data in telecommunications. The prototype implementation of the proposed methodology has been designed, developed, and applied to the usage data set of telecommunications company subscribers. We perform performance tests on the developed prototype application to understand how successful the proposed methodology is and its scalability. The obtained results demonstrate the usability of the proposed method in the telecommunications sector.
- Published
- 2021
17. Utilization of Data Mining Methods in Manufacturing Industry
- Author
-
Eva Tyleckova and Darja Noskievičová
- Subjects
Big data processing ,Current age ,Computer science ,business.industry ,Process (engineering) ,media_common.quotation_subject ,computer.software_genre ,Manufacturing ,Process control ,Production (economics) ,Quality (business) ,Data mining ,business ,computer ,media_common - Abstract
The paper presents data mining as a suitable tool for analyzing data from industrial processes. The data mining methods offer a wide range of uses in the current age of digitalization, big data processing and analysis. Apart from discovering patterns and detecting relationship between individual characteristics, assuring quality of products, prediction and optimization of process performance, data mining techniques also contribute to the transition from a reactive to a predictive approach in problem solving. The first part of the paper presents the possibilities of utilization of data mining methods and techniques to analyze data from industrial processes. The second part of the paper deals with a selection of proper data mining method and its practical application on data from manufacturing industry.
- Published
- 2021
18. Expert System for the Water System Diagnostics
- Author
-
Ferdenant A. Mkrtchyan, Vladimir F. Krapivin, Costica Nitu, and Anda Sabena Dobrescu
- Subjects
Big data processing ,business.industry ,Computer science ,Big data ,Data mining ,business ,computer.software_genre ,computer ,Expert system - Abstract
New information - modeling instrumental approach is proposed to realize the diagnostics of hydrological and hydrochemical systems and processes. Decision making procedure is based on the big data processing algorithms. Real environmental problems are considered as the examples of the using this approach.
- Published
- 2021
19. Efficient Distributed Database Clustering Algorithm for Big Data Processing
- Author
-
Liantian Li
- Subjects
Big data processing ,Distributed database ,Mean squared error ,Computer science ,business.industry ,Big data ,computer.software_genre ,Automation ,Consistency (database systems) ,ComputingMethodologies_PATTERNRECOGNITION ,Data mining ,business ,Cluster analysis ,computer ,Computer Science::Databases ,Eigenvalues and eigenvectors - Abstract
When clustering efficient distributed database, the conventional algorithm has long time cost and low clustering accuracy. To solve the above problems, an efficient distributed database clustering algorithm for big data processing is designed. Calculating the eigenvalues of the database, and linking the efficient distributed database with similar characteristics. The cross correlation matrix is used to ensure the consistency of cluster label. To improve the performance of K-means algorithm, input the database to be clustered, output $k$ clustering centers, and divide the clustering groups. Mapping database to clustering center, clustering low dimensional big data. Experimental results show that the proposed algorithm can reduce the running time and mean square error of data clustering, and improve the efficiency and accuracy of clustering.
- Published
- 2021
20. A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data
- Author
-
Anuradha Chug, Amit Prakash Singh, and Khyati Ahlawat
- Subjects
Big data processing ,Computer science ,business.industry ,Big data ,Sampling (statistics) ,Machine learning ,computer.software_genre ,Class (biology) ,Majority class ,Class imbalance ,General Earth and Planetary Sciences ,Artificial intelligence ,business ,Cluster analysis ,computer ,Classifier (UML) ,General Environmental Science - Abstract
The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.
- Published
- 2021
21. Big Data Processing Platform on Intelligent Transportation Systems
- Author
-
Saida El Mendili
- Subjects
Big data processing ,Database ,Computer science ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,computer.software_genre ,computer ,Intelligent transportation system - Published
- 2019
22. An Efficient Scheme of Big Data Processing by Hierarchically Distributed Data Matrix
- Author
-
Ch. Mallikarjuna Rao and G. Sirichandana Reddy
- Subjects
Big data processing ,Scheme (programming language) ,Computer science ,computer ,Data matrix (multivariate statistics) ,Computational science ,computer.programming_language - Published
- 2019
23. LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with Spark
- Author
-
Long Wang, Byung Chul Tak, Liqiang Wang, Xiang Wei, Bingbing Rao, and Siyang Lu
- Subjects
Big data processing ,Computer Networks and Communications ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,computer.software_genre ,Hardware and Architecture ,Spark (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,020201 artificial intelligence & image processing ,Data mining ,Root cause analysis ,computer ,Software - Abstract
As big data processing is being widely adopted by many domains, massive amount of generated data become more reliant on the parallel computing platforms for analysis, wherein Spark is one of the most widely used frameworks. Spark’s abnormal tasks may cause significant performance degradation, and it is extremely challenging to detect and diagnose the root causes. To that end, we propose an innovative tool, named LADRA, for log-based abnormal tasks detection and root-cause analysis using Spark logs. In LADRA, a log parser first converts raw log files into structured data and extracts features. Then, a detection method is proposed to detect where and when abnormal tasks happen. In order to analyze root causes we further extract pre-defined factors based on these features. Finally, we leverage General Regression Neural Network (GRNN) to identify root causes for abnormal tasks. The likelihood of reported root causes are presented to users according to the weighted factors by GRNN. LADRA is an off-line tool that can accurately analyze abnormality without extra monitoring overhead. Four potential root causes, i.e., CPU, memory, network, and disk I/O, are considered. We have tested LADRA atop of three Spark benchmarks by injecting aforementioned root causes. Experimental results show that our proposed approach is more accurate in the root cause analysis than other existing methods.
- Published
- 2019
24. Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs
- Author
-
Xiaolong Feng and Jing Gao
- Subjects
Big data processing ,Data processing ,Computer Networks and Communications ,Computer science ,business.industry ,Big data ,computer.software_genre ,Computer Science Applications ,Workflow ,Computational Theory and Mathematics ,Data mining ,Gene sequence ,business ,computer - Abstract
Bioinformatics computing is a kind of big data processing problem, which usually has the characteristics of large data scale, large computational load and long computational time. Therefore, the use of big data technology in bioinformatics computing has gradually become a research hotspot, and using Hadoop for gene sequence alignment is one of it. It is a common way to use various tools to complete a job in the field of Biocomputing. In most studies of parallel alignment of gene sequences using Hadoop, third-party tools are also needed. However, there are few methods using Hadoop independently to complete gene sequences alignment. Adding data processing with other tools to Hadoop workflow not only affects the improvement of computing performance, but also complicates the application. In this paper, a parallel alignment model of gene sequences based on multiple inputs and outputs is proposed, which can independently complete parallel alignment of gene sequences in Hadoop platform without using other tools. This model not only simplifies the process flow of gene sequence alignment, but also improves the performance compared with other methods. This paper describes in detail the method of manipulating gene sequences with multiple inputs and outputs modes on Hadoop platform and the design of a computing model based on this method, and proves the superiority of this model through experiments.
- Published
- 2019
25. Big data Processing Comparison using Pig and Hive
- Author
-
J. Santosh Kumar, S. Raghavendra, and B. K. Raghavendra
- Subjects
Big data processing ,Database ,Computer science ,computer.software_genre ,computer - Published
- 2019
26. Predictive Modeling of Pavement Damage Using Machine Learning and Big Data Processing
- Author
-
Dowan Kim, Jinho Jeon, and Damryung Kim
- Subjects
Big data processing ,Computer science ,business.industry ,Artificial intelligence ,Machine learning ,computer.software_genre ,business ,computer - Published
- 2019
27. Improving the Performance of Manufacturing Technologies for Advanced Material Processing Using a Big Data and Machine Learning Framework
- Author
-
Igor Kotenko, Alexander Branitskiy, and Igor Saenko
- Subjects
010302 applied physics ,Big data processing ,Production line ,Majority rule ,Materials processing ,business.industry ,Computer science ,Knowledge processing ,Big data ,02 engineering and technology ,Thread (computing) ,021001 nanoscience & nanotechnology ,Machine learning ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,Artificial intelligence ,0210 nano-technology ,business ,computer ,Classifier (UML) - Abstract
The paper offers a new approach to improving the performance of the materials knowledge analysis based on Big Data processing and machine learning. We consider a framework in which thread functioning of five machine learning mechanisms intended for solving the classification problem is realized. Classifier operation results are exposed to majority voting. The experimental assessment of performance and accuracy of framework operation is made on the data set containing technological data of the production line. Assessment showed that the offered framework provides a scoring on productivity of materials knowledge processing by 7.4 times.
- Published
- 2019
28. Big Data Processing Technologies in Distributed Information Systems
- Author
-
Nataliya Boyko, Eleonora Benova, Yevgen Zasoba, and Nataliya Shakhovska
- Subjects
Big data processing ,Class (computer programming) ,Database ,business.industry ,Computer science ,Big data ,020206 networking & telecommunications ,Unstructured data ,02 engineering and technology ,computer.software_genre ,Upload ,0202 electrical engineering, electronic engineering, information engineering ,Information system ,General Earth and Planetary Sciences ,Table (database) ,020201 artificial intelligence & image processing ,Line (text file) ,business ,computer ,General Environmental Science - Abstract
The analysis of Big data technologies was provided. An example of MapReduce paradigm application, uploading of big volumes of data, processing and analyzing of unstructured information and its distribution into the clustered database was provided. The article summarizes the concept of "big data". Examples of methods for working with arrays of unstructured data are given. The parallel system Resilient Distributed Datasets (RDD) is organized. The class of basic database operations was realized: database con-nection, table creation, getting in line id, returning all elements of the database, update, delete and create the line.
- Published
- 2019
29. A Framework for Improving the Location-Based Service Using Casandra Technology
- Author
-
Euiin Choi, B. Temuujin, and Jaewon Park
- Subjects
Big data processing ,Process (engineering) ,business.industry ,Computer science ,Distributed computing ,Big data ,NoSQL ,computer.software_genre ,Positioning technology ,Streaming data ,Location-based service ,business ,computer ,Wearable technology - Abstract
Recently, many researches on positioning technology using LBS (Location-Based Services) have been conducted with the development of wearable devices. In addition, the data used in these devices is helping to perform LBS using Big data technology. And the existing method of finding a specific location is not suitable for collecting and processing all the data [1] [2] [3] [4]. Therefore, in order to process all streaming data in real time, it is necessary to use Big data processing technology. In this paper, we use NoSQL technology to solve this problem. We propose a framework for improving the performance of LBS using NoSQL.
- Published
- 2019
30. Resource Prediction for Big Data Processing in a Cloud Data Center : A Machine Learning Approach
- Author
-
Alanazi Rayan and Nah Yun Mook
- Subjects
Big data processing ,Resource (project management) ,Database ,Computer science ,Signal Processing ,Electrical and Electronic Engineering ,Time series ,computer.software_genre ,computer ,Cloud data center - Published
- 2018
31. Development of Privacy Preserving Clustering Process with Cost Minimization for Big Data Processing
- Author
-
R. Bharanidharan and S. Chitra
- Subjects
Privacy preserving ,Big data processing ,Development (topology) ,Computer science ,Process (computing) ,Data mining ,Minification ,Cluster analysis ,computer.software_genre ,computer - Published
- 2018
32. Deploying Apache Spark virtual clusters in cloud environments using orchestration technologies
- Author
-
O. . Borisenko, R. . Pastukhov, and S. . Kuznetsov
- Subjects
Big data processing ,Swift ,Service (systems architecture) ,Engineering ,amazon ec2 ,Big data ,Cloud computing ,computer.software_genre ,lcsh:QA75.5-76.95 ,hdfs ,apache ignite ,big data ,Spark (mathematics) ,Orchestration (computing) ,виртуальные кластеры ,General Environmental Science ,computer.programming_language ,Database ,business.industry ,Service model ,openstack ,облачные вычисления ,map-reduce ,Operating system ,General Earth and Planetary Sciences ,apache spark ,lcsh:Electronic computers. Computer science ,business ,computer - Abstract
Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual clusters with Apache Spark and other Big Data tools in Openstack. There exist three approaches to do it. The first one is to use Openstack REST APIs to create instances and then deploy the environment. This approach is used by Apache Spark core team to create clusters in propriatary Amazon EC2 cloud. Almost the same method has been implemented for Openstack environments. Although since Openstack API changes frequently this solution is deprecated since Kilo release. The second approach is to integrate virtual clusters creation as a built-in service for Openstack. ISP RAS has provided several patches implementing universal Spark Job engine for Openstack Sahara and Openstack Swift integration with Apache Spark as a drop-in replacement for Apache Hadoop. This approach allows to use Spark clusters as a service in PaaS service model. Since Openstack releases are less frequent than Apache Spark this approach may be not convenient for developers using the latest releases. The third solution implemented uses Ansible for orchestration purposes. We implement the solution in loosely coupled way and provide an ability to add any auxiliary tool or even to use another cloud environment. Also, it provides an ability to choose any Apache Spark and Apache Hadoop versions to deploy in virtual clusters. All the listed approaches are available under Apache 2.0 license.
- Published
- 2018
33. WITHDRAWN: Advanced machine learning based approach for prediction of skin cancer
- Author
-
Vyshnavi Pogaku, Goutham Raju k, Ravisankar Malladi, and Prashanthi Vempaty
- Subjects
010302 applied physics ,Big data processing ,Data collection ,Disease detection ,Computer science ,business.industry ,Decision tree ,Volume (computing) ,02 engineering and technology ,021001 nanoscience & nanotechnology ,Machine learning ,computer.software_genre ,01 natural sciences ,Random forest ,Data set ,Naive Bayes classifier ,0103 physical sciences ,Artificial intelligence ,0210 nano-technology ,business ,computer - Abstract
In the present situation, the healthcare sector is one of the leading areas in which technologies and data are easily improved. The vast volume of medical data is impossible to manage. Big data processing makes it easy to manage these data. There are several medical methods worldwide for many diseases. Machine learning is an emerging solution to disease detection and prediction. This document explains the diagnosis of disease through machine learning based on symptoms. Algorithms such as Naive Bayes, Decision Tree and the Random Forest are used in a given data collection and estimate of disease machines. The python programming language is used to execute it. The analysis demonstrates the proper detailed algorithm. By the data set the accuracy of an algorithm is determined.
- Published
- 2021
34. Design and Implementation a Secured and Distributed System using CBC, Socket, and RMI Technologies
- Author
-
Ban Quy Tran, Thang Duc Phung, and Thai Van Nguyen
- Subjects
Big data processing ,business.industry ,Computer science ,Computation ,Advanced Encryption Standard ,Word count ,computer.software_genre ,Node (computer science) ,Operating system ,Table (database) ,business ,computer ,Word (computer architecture) ,Block (data storage) - Abstract
This project aims to implement a tool that allows applications following the map-reduce principle and the AES encryption algorithm. The map-reduce policy is used in big data processing for speeding up the treatment of large data sets (large files). AES encryption algorithm is used to secure the data that is transmitted over the network. This principle is to split a large file into blocks (files) that are distributed on several nodes. An application (we call it Map) can be executed on each of these blocks in parallel. All the results (files) from these computations are then gathered on the original node where another application (we call it Reduce) is executed to compute the final result (one file). The typical application we will use is the word count application. This application counts the occurrences of each word in a big file. Each Map counts the presence of words in each block, producing a table . Then the Reduce aggregates the created tables to obtain the final result (a table )
- Published
- 2021
35. Enhancing Random Forest Classification with NLP in DAMEH: A system for DAta Management in EHealth Domain
- Author
-
Luigi Coppolino, Giovanni Mazzeo, Flora Amato, Roberto Nardone, Giovanni Cozzolino, Francesco Moscato, Amato, F., Coppolino, L., Cozzolino, G., Mazzeo, G., Moscato, F., and Nardone, R.
- Subjects
0209 industrial biotechnology ,Computer science ,Big data processing ,Cognitive Neuroscience ,Data management ,Wearable computer ,02 engineering and technology ,computer.software_genre ,Field (computer science) ,020901 industrial engineering & automation ,Artificial Intelligence ,Machine learning ,0202 electrical engineering, electronic engineering, information engineering ,eHealth ,E-health ,Multi-classification schema ,Random forests ,Data collection ,business.industry ,Computer Science Applications ,Random forest ,Statistical classification ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Natural language ,Natural language processing - Abstract
The use of pervasive IoT devices in Smart Cities, have increased the Volume of data produced in many and many field. Interesting and very useful applications grow up in number in E-health domain, where smart devices are used in order to manage huge amount of data, in highly distributed environments, in order to provide smart services able to collect data to fill medical records of patients. The problem here is to gather data, to produce records and to analyze medical records depending on their contents. Since data gathering involve very different devices (not only wearable medical sensors, but also environmental smart devices, like weather, pollution and other sensors) it is very difficult to classify data depending their contents, in order to enable better management of patients. Data from smart devices couple with medical records written in natural language: we describe here an architecture that is able to determine best features for classification, depending on existent medical records. The architecture is based on pre-filtering phase based on Natural Language Processing, that is able to enhance Machine learning classification based on Random Forests. We carried on experiments on about 5000 medical records from real (anonymized) case studies from various health-care organizations in Italy. We show accuracy of the presented approach in terms of Accuracy-Rejection curves.
- Published
- 2021
36. A Methodological Approach to the Real-Time Data Analysis from the ViaTOLL System
- Author
-
Hanna Vasiutina, Andrzej Szarata, and Vitalii Naumov
- Subjects
Big data processing ,Data processing ,biology ,Computer science ,business.industry ,Big data ,Monitoring system ,computer.software_genre ,Toll ,Key (cryptography) ,biology.protein ,Real-time data ,Data mining ,business ,computer - Abstract
Measuring the traffic parameters may be considered as one of the key operations in traffic management. With the contemporary monitoring systems, this issue transformed from the problem of gathering data to the issue of big data processing. In this chapter, we propose the methodological approach to the processing and analysing of the real-time flow of information from the traffic toll system. The core classes library developed in Python programming language implements the methodology of data processing. The results of the data analysis from the viaTOLL system (Poland) illustrate the use of the proposed approach.
- Published
- 2021
37. Triboelectric nanogenerator based self-powered sensor for artificial intelligence
- Author
-
Lijie Li, Yan Zhang, Maoliang Shen, Xin Cui, Yicheng Shao, and Yuankai Zhou
- Subjects
Big data processing ,Flexibility (engineering) ,Materials science ,Renewable Energy, Sustainability and the Environment ,business.industry ,Nanogenerator ,Cloud computing ,02 engineering and technology ,Document management system ,010402 general chemistry ,021001 nanoscience & nanotechnology ,computer.software_genre ,01 natural sciences ,Critical infrastructure ,0104 chemical sciences ,General Materials Science ,Artificial intelligence ,Electrical and Electronic Engineering ,0210 nano-technology ,business ,Sensing system ,computer ,Triboelectric effect - Abstract
Triboelectric nanogenerator based sensor has excellent material compatibility, low cost, and flexibility, which is a unique candidate technology for artificial intelligence. Triboelectric nanogenerators effectively provide critical infrastructure for new generation of sensing systems that collect information by large amounts of self-powered sensors. This review mainly discusses capability and prospect of triboelectric nanogenerators being applied to intelligent sports, security, touch control, and document management systems. The above fields have paid increasing attention in artificial intelligence technologies, such as machine learning, big data processing and cloud computing, demanding huge amount of sensors and complicated sensors network.
- Published
- 2021
38. Multi-objective evolutionary based feature selection supported by distributed multi-label classification and deep learning on image/video data
- Author
-
Gizem Nur Karagoz
- Subjects
Feature engineering ,Multi-label classification ,distributed machine learning ,Computer science ,business.industry ,Dimensionality reduction ,Deep learning ,feature extraction ,Big data ,Feature extraction ,Feature selection ,Machine learning ,computer.software_genre ,Statistical classification ,big data processing ,feature engineering ,Artificial intelligence ,business ,computer ,dimensionality reduction - Abstract
We live in an era in which a myriad of computer systems produce immense amounts of (raw) data every day. This big data must be processed efficiently to gain valuable and hidden knowledge. Complex processing pipelines need to be designed for filtering out irrelevant data, also for efficient data mining and machine learning methods must be used to discover useful correlations in the big data. The purpose of this PhD research is the implementation of multi-objective evolutionary-based dimensionality reduction on a high volume of image/video data with the support of distributed multi-label classification algorithms.
- Published
- 2021
39. Standards and EO Data Platforms
- Author
-
Karel Charvát and Ingo Simonis
- Subjects
Flexibility (engineering) ,Big data processing ,Earth observation ,Geospatial analysis ,Emergency management ,Computer science ,business.industry ,Cloud processing ,Big data ,computer.software_genre ,Data science ,Domain (software engineering) ,13. Climate action ,8. Economic growth ,business ,computer - Abstract
In the digital bio-economy like in many other sectors, standards play an important role. With “Standards”, we refer here to the protocols that describe how data and the data-exchange are defined to enable digital exchange of data between devices. This chapter evaluates how Big Data, cloud processing, and app stores together form a new market that allows exploiting the full potential of geospatial data. This chapter focuses on the essential cornerstones that help make Big Data processing a more seamless experience for bioeconomy data. The described approach is domain-independent, thus can be applied to agriculture, fisheries, and forestry as well as earth observation sciences, climate change research, or disaster management. This flexibility is essential when it comes to addressing real world complexities for any domain, as no single domain has sufficient data available within its own limits to tackle the major research challenges our world is facing.
- Published
- 2021
40. Innovative astronomical applications with a new-generation relational database
- Author
-
Yuki Okura, Hisanori Furusawa, Makoto Onizuka, Shohei Aoyama, Takafumi Ootsubo, Tadafumi Takata, Junko Furusawa, and Yoshihiko Yamada
- Subjects
Big data processing ,Test bench ,Exploit ,Database ,business.industry ,Relational database ,Computer science ,Big data ,InformationSystems_DATABASEMANAGEMENT ,computer.software_genre ,Astronomical catalog ,Relational database management system ,business ,computer ,Database transaction - Abstract
Database technology has been developing to exploit the next-generation hardware in the era of big data processing. At the same time, astronomical data size has been steadily increasing, and astronomical source catalogs obtained from largescale surveys with a wide-field camera, such as Subaru/Hyper Suprime-Cam (HSC), are a good test bench for evaluating the new database technology with a large data set. Such archive systems often employ a highly versatile relational database management system (RDBMS), but reducing the time required for data transaction and complex analysis has come to an important challenge. To tackle this difficulty, we aim to develop astronomical applications with a new catalog database using a next-generation RDBMS technology, where the query engine is designed to efficiently use computing infrastructures for processing big data. Demonstrations with science applications are essential to evaluate the new database. We verify query performance with the current HSC source catalog. For application to huge astronomical catalog databases, we are pursuing and verifying the capabilities of new database technologies. It will, in turn, enable fast ad hoc search and efficient detection of a wide range of variable events with the technology. Our pilot tests using typical astronomical queries on a cluster system shows significant improvements in response times with the aid of distributed query engines. We report performance of the test database for typical astronomical queries, and discuss optimizing the schema based on query workloads.
- Published
- 2020
41. A Parallel Computing Approach to Spatial Neighboring Analysis of Large Amounts of Terrain Data Using Spark
- Author
-
Kai Zheng, Jianbo Zhang, and Zhuangzhuang Ye
- Subjects
010504 meteorology & atmospheric sciences ,Computer science ,Big data ,0211 other engineering and technologies ,Terrain ,02 engineering and technology ,Parallel computing ,lcsh:Chemical technology ,01 natural sciences ,Biochemistry ,Article ,Analytical Chemistry ,Raster data ,spatial neighboring analysis ,Spark (mathematics) ,lcsh:TP1-1185 ,Electrical and Electronic Engineering ,Distributed File System ,Instrumentation ,Spatial analysis ,021101 geological & geomatics engineering ,0105 earth and related environmental sciences ,Spark ,business.industry ,parallel computing ,computer.file_format ,Atomic and Molecular Physics, and Optics ,big data processing ,Scalability ,Raster graphics ,business ,computer - Abstract
Spatial neighboring analysis is an indispensable part of geo-raster spatial analysis. In the big data era, high-resolution raster data offer us abundant and valuable information, and also bring enormous computational challenges to the existing focal statistics algorithms. Simply employing the in-memory computing framework Spark to serve such applications might incur performance issues due to its lack of native support for spatial data. In this article, we present a Spark-based parallel computing approach for the focal algorithms of neighboring analysis. This approach implements efficient manipulation of large amounts of terrain data through three steps: (1) partitioning a raster digital elevation model (DEM) file into multiple square tile files by adopting a tile-based multifile storing strategy suitable for the Hadoop Distributed File System (HDFS), (2) performing the quintessential slope algorithm on these tile files using a dynamic calculation window (DCW) computing strategy, and (3) writing back and merging the calculation results into a whole raster file. Experiments with the digital elevation data of Australia show that the proposed computing approach can effectively improve the parallel performance of focal statistics algorithms. The results also show that the approach has almost the same calculation accuracy as that of ArcGIS. The proposed approach also exhibits good scalability when the number of Spark executors in clusters is increased.
- Published
- 2020
42. The real-time big data processing method based on LSTM or GRU for the smart job shop production process
- Author
-
Chuang Wang, Wenbo Du, Zhixiang Zhu, and Zhifeng Yue
- Subjects
Big data processing ,0209 industrial biotechnology ,Database ,business.industry ,Computer science ,Job shop ,Deep learning ,lcsh:T57-57.97 ,lcsh:Mathematics ,020208 electrical & electronic engineering ,Big data ,02 engineering and technology ,computer.software_genre ,lcsh:QA1-939 ,020901 industrial engineering & automation ,Intelligent sensor ,lcsh:Applied mathematics. Quantitative methods ,0202 electrical engineering, electronic engineering, information engineering ,Production (economics) ,Artificial intelligence ,business ,Internet of Things ,computer - Abstract
With the wide application of intelligent sensors and internet of things (IoT) in the smart job shop, a large number of real-time production data is collected. Accurate analysis of the collected data can help producers to make effective decisions. Compared with the traditional data processing methods, artificial intelligence, as the main big data analysis method, is more and more applied to the manufacturing industry. However, the ability of different AI models to process real-time data of smart job shop production is also different. Based on this, a real-time big data processing method for the job shop production process based on Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU) is proposed. This method uses the historical production data extracted by the IoT job shop as the original data set, and after data preprocessing, uses the LSTM and GRU model to train and predict the real-time data of the job shop. Through the description and implementation of the model, it is compared with KNN, DT and traditional neural network model. The results show that in the real-time big data processing of production process, the performance of the LSTM and GRU models is superior to the traditional neural network, K nearest neighbor (KNN), decision tree (DT). When the performance is similar to LSTM, the training time of GRU is much lower than LSTM model.
- Published
- 2020
43. Big Data Processing for Intrusion Detection System Context: A Review
- Author
-
Ouajdi Korbaa, Basel Solaiman, Marwa Elayni, and Farah Jemili
- Subjects
Big data processing ,Computer science ,business.industry ,Wireless network ,Big data ,Context (language use) ,Intrusion detection system ,Information security ,Computer security ,computer.software_genre ,The Internet ,business ,computer ,Security system - Abstract
The rapid growth of data, the increasing number of network based applications, and the advent of the omnipresence of internet and connected devices have affected the importance of information security. Hence, a security system such as an Intrusion Detection System (IDS) becomes a fundamental requirement. However, the complexity of the generated data and their huge size, plus, the variation of Cyber-attacks on: the network traffic, wireless network traffic, worldwide network traffic, connected devices and 5 G communication media, lead to hinder the IDS’s efficiency. Dealing with this huge amount of traffic is challenging and requires deploying new big data security solutions. This paper proposes an overview of intrusion detection which offers a review of IDS that deploy big data technologies and provides interesting recommendations for further study.
- Published
- 2020
44. Research on an improved algorithm of Apriori based on Hadoop
- Author
-
Hongqin Wang, Lina Yuan, Huiyong Jiang, and Hongxia Wang
- Subjects
Big data processing ,Apriori algorithm ,Association rule learning ,Mobile internet ,Computer science ,business.industry ,Improved algorithm ,Big data ,InformationSystems_DATABASEMANAGEMENT ,computer.software_genre ,Data mining algorithm ,A priori and a posteriori ,Data mining ,business ,computer - Abstract
With the development of mobile Internet, association rules data mining is still a research hotspot. In this paper, the traditional mining algorithm of association rules for Apriori is analyzed, which has low efficiency and poor expansibility, because it scans the database many times and produces a large number of redundant frequent itemsets when dealing with big data. Therefore, it is proposed that the Apriori algorithm is improved by using the MapReduce model of Hadoop platform to parallelize processing, and the experimental results show that the improved Apriori algorithm has high efficiency and good stability in big data processing, it has great potential to excavate. Finally, the algorithm is applied to the data mining of student scores, which verifies its effectiveness and can provide services for education management.
- Published
- 2020
45. Cyber Security Meets Big Knowledge: Towards a Secure HACE Theorem
- Author
-
Bhavani Thuraisingham
- Subjects
Big data processing ,Knowledge graph ,business.industry ,Big data ,business ,Computer security ,computer.software_genre ,computer - Abstract
The HACE Theorem has emerged as a way to characterize big data. Over the years it has become fundamental to big data characterization as the Newton’s Laws are to Physics. Associated with the HACE theorem is the Big Data Processing Framework for storing, managing, analyzing and sharing massive amounts of heterogenous, autonomous and distributed data with complex and evolving relationships. This paper examines the security and privacy aspects for the HACE theorem. It argues that what is needed is a Policy-Aware Big Data Processing Framework for the collection, storage, management, mining, and sharing of the massive amounts of data. It also examines knowledge graphs to represent the big data and determines ways to reason about the graphs and yet maintain security and privacy.
- Published
- 2020
46. Attributes Reduction in Big Data
- Author
-
Khalil Khan, Rehan Ullah Khan, and Waleed Albattah
- Subjects
Big data processing ,Computer science ,Big data ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,lcsh:Technology ,Reduction (complexity) ,lcsh:Chemistry ,010104 statistics & probability ,Support Vector Machines ,0202 electrical engineering, electronic engineering, information engineering ,content-based filtering ,General Materials Science ,0101 mathematics ,Instrumentation ,lcsh:QH301-705.5 ,Fluid Flow and Transfer Processes ,Data processing ,attributes sampling ,business.industry ,lcsh:T ,Process Chemistry and Technology ,Perspective (graphical) ,General Engineering ,lcsh:QC1-999 ,Computer Science Applications ,Support vector machine ,machine learning ,lcsh:Biology (General) ,lcsh:QD1-999 ,lcsh:TA1-2040 ,Model learning ,Data analysis ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,lcsh:Engineering (General). Civil engineering (General) ,computer ,lcsh:Physics - Abstract
Processing big data requires serious computing resources. Because of this challenge, big data processing is an issue not only for algorithms but also for computing resources. This article analyzes a large amount of data from different points of view. One perspective is the processing of reduced collections of big data with less computing resources. Therefore, the study analyzed 40 GB data to test various strategies to reduce data processing. Thus, the goal is to reduce this data, but not to compromise on the detection and model learning in machine learning. Several alternatives were analyzed, and it is found that in many cases and types of settings, data can be reduced to some extent without compromising detection efficiency. Tests of 200 attributes showed that with a performance loss of only 4%, more than 80% of the data could be ignored. The results found in the study, thus provide useful insights into large data analytics.
- Published
- 2020
- Full Text
- View/download PDF
47. Bio-inspired technique for improving machine learning speed and big data processing
- Author
-
Andronicus Ayobami Akinyelu
- Subjects
Big data processing ,business.industry ,Computer science ,Ant colony optimization algorithms ,Big data ,Machine learning ,computer.software_genre ,Statistical classification ,Enhanced Data Rates for GSM Evolution ,Artificial intelligence ,Instance selection ,business ,computer ,Selection (genetic algorithm) - Abstract
Big data analytics (BDA) is progressively becoming a popular practice implemented by many organizations, because of its potential to discover treasured insights for improved decision-making. Machine Learning (ML) algorithms are one of the effective tools used for BDA, however, their computational complexity increases with an increase in data size. Therefore, this paper introduces a boundary detection and instance selection technique for improving the speed of MLbased big data classification models. The proposed technique (called ACOISA_ML) is inspired by edge selection in ant colony optimization. ACOISA_ML is evaluated on five ML algorithms and ten large- or medium-scale datasets, and the results show that it has the potential to reduce the training speed of ML algorithms by over 94% without significantly affecting their prediction accuracy. Moreover, the results show that it reduced the storage size of big datasets by over 55% (in most cases), thus improving the speed of big data processing.
- Published
- 2020
48. Distribution of Indivisible Resources During Big Data Processing
- Author
-
Andriy Kovalenko and Heorhii Kuchuk
- Subjects
Big data processing ,Distribution (number theory) ,Computer science ,Data mining ,computer.software_genre ,computer - Abstract
The method of load balancing for indivisible network resources when parallelizing Big Data processing is considered. A method for finding the optimal partitioning of a processing service into parallel processes is proposed.
- Published
- 2020
49. An Implementation of Genetic Algorithms in Big Data Processing for Medical Data
- Author
-
K. Selvakumar, S. Venkatakrishna, S. Tamilarasan, G. Renukadevi, and Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP)
- Subjects
Big data processing ,Environmental Engineering ,Computer science ,Genetic algorithm ,Sql server ,General Engineering ,Data mining ,2249-8958 ,computer.software_genre ,computer ,C4852029320/2020©BEIESP ,Computer Science Applications ,Big data Processing, Genetic Algorithm, Medical data, REST API, SQL Server, Thermistor, Digital Sphygmomanometer, Node MCU, HTTP Gateway - Abstract
The large amount of real time medical measurement parameters stored in the SQL server needs processing using a specific algorithm. One of the big data processing techniques is available for medical data is Genetic algorithm. The acquired medical parameters are combined together to predict or diagnose the disease using the genetic algorithm. In this paper, the genetic algorithm is used to process the medical measurements data. The medical parameters are posted temporarily in the Representational Structure (REST) Application Program Interface (API) using a gateway protocol MQTT. The genetic algorithm can easily diagnose the disease using the existing stored parameters. The medical parameters of the patient like ECG, Blood pressure and skin temperature are posted frequently in the cloud server for continuous monitoring, and the huge data is also processed using this proposed method.
- Published
- 2020
50. A Case Study of Using Big Data Processing in Education: Method of Matching Members by Optimizing Collaborative Learning Environment
- Author
-
Keiko Tsujioka
- Subjects
Big data processing ,Matching (statistics) ,Multimedia ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Collaborative learning ,computer.software_genre ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,computer - Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.