Language: chinese / Publication Year Range: Last 50 years / Topic: big data and electronic data processing - Searchworks@Jio Institute Digital Library Search Results

1. 结合增益率与堆叠自编码器的并行随机森林算法.

Author: 刘卫明, 陈伟达, 毛伊敏, and 陈志刚
Subjects: *RANDOM forest algorithms, *LATIN hypercube sampling, *ACTIVE learning, *BIG data, *ELECTRONIC data processing, *ALGORITHMS
Abstract: In the big data environment, the random forest algorithm suffers from excessive redundancy and irrelevant features, the insufficient spatial information content of feature subspace, and low parallelization efficiency. To resolve these issues, this paper presented PRFGRSAE. Firstly, this algorithm proposed a DRNGRSAE, which filtered redundant and irrelevant features of the feature set and extracted features by stacked auto-encoders to reduce the number of redundant and irrelevant features effectively. Secondly, it proposed a SSLF that combined Latin hypercube sampling and normalized correlation degree, which formed feature subspaces with high spatial expression by performing multi-layer division sampling on the feature set, and ensured the feature subspace information content. Finally, it proposed a reducer allocation strategy DSVLA combining with variable action learning automata, which allocated each cluster to reducers for processing evenly and improved the parallelization efficiency effectively. Experimental results show that compared with IMRF, KSMRF, and GAPRF algorithms, the speedup ratio and accuracy of the PRFGRSAE algorithm are significantly improved. Therefore, the algorithm can obtain higher accuracy and parallel efficiency when applied to process large data, especially for data sets with more features. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

2. 测绘大数据时代数据处理理论面临的挑战与发展.

Author: 朱建军, 宋迎春, 胡俊, 邹滨, and 吴立新
Subjects: *ARTIFICIAL intelligence, *ELECTRONIC data processing, *BIG data, *DATA mining, *DEEP learning, *ALGORITHMS
Abstract: With the development of information technology, the rise of surveying and mapping big data and artificial intelligence, the lack of data is no longer a problem. However, the existing surveying and mapping data processing technology has been pursuing the accuracy of data (micro), and big data research just allows the data to be mixed and uncertain (macro). Therefore, although the traditional surveying and mapping data processing theory has accumulated a large number of technical advantages in micro data processing, the large‑scale and complexity of big data has become increasingly prominent, in which traditional calculation model and analysis algorithm cannot effectively support the efficient analysis and processing of big data. As the key to the intelligent era, data processing theory and method, how to adapt to the challenges and opportunities of new technology is worth our deep thinking. Driven by big data, new ideas and methods such as large‑scale data mining, machine learning and deep learning are booming, which greatly promote the fusion of multi‑source heterogeneous big data inside and outside the scene, effectively extract surface feature information from a variety of sensor data, and constantly improve the ability of surveying and mapping information acquisition and analysis. We think that the theory of surveying and mapping data also needs to be followed up, and the existing data processing methods need to be intelligent. Combined with the frontier hot spots, development trends and existing challenges of intelligent surveying and mapping, this paper explores the expansion direction of data processing theory. One is to promote the further development of surveying data processing theory, and the other is to provide reference for graduate students who are interested in entering the field of surveying and mapping big data. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

3. 基于PMVS算法的大规模数据细粒度并行优化方法.

Author: 刘金硕, 李扬眉, 江庄毅, 邓　娟, 眭海刚, and Pan Jeff
Subjects: *MULTICORE processors, *FEATURE extraction, *BIG data, *ELECTRONIC data processing, *COMMUNICATION models, *IMAGE processing, *PARALLEL programming
Abstract: We address ihe problem of fine-grained parallel optimization of large-scale data. Patch-based multi-view stereo (PMVS) algorithm has been widely applied to digital city and other fields because of its good three-dimensional reconstruction effect, however, its large-scale computing algorithm has a low execution efficiency. Therefore, to address the limitation, this paper proposes a fine-grained parallel optimization method, including task allocation and load-balancing; strategies of main system memory and GPU memory? the optimization of communication. We perform CPU multi-threading operation using the pthreads function library to take full advantage of the computing power of multicore CPUs. And for GPUs, we utilize the CUDA framework while optimizing thread organization and memory access. Besides that, we propose the idea of adapting memory pool model and pipelining model to improve bandwidth availability ratio. The memory pool model reduces the impact of data resources transferring on the bus for CPUs_GPUs while waiting for resources; the pipelining model hides communication time for CPU to read data from memory. At the same time, this paper utilizes the Harris-DOG feature extraction of PMVS algorithm of sequences of images as the example to verify our optimization strategies. The experiments demonstrate that the multi-threading CPU-based strategy can achieve 4 times speed-up ratio, the highest ratio that parallel CUDA-based strategy can achieve is 34 times, and our strategy can improve the performance 30% on the basis of the parallel CUDA-based strategy. In the future, our optimization strategy can he applied to quick computing resource scheduling in big data processing of other domains. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

4. 面向大数据的图模式挖掘概率算法.

Author: 姜丽丽, 李叶飞, 豆龙龙, 陈智麒, and 钱柱中
Subjects: *ALGORITHMS, *BIG data, *ELECTRONIC data processing, *APPROXIMATION algorithms, *PROBABILITY theory
Abstract: In today' s big data era, big data processing frameworks such as MapReduce often appear slow and inefficient when processing data, specially related to graphs. Therefore, it is necessary to explore an efficient algorithm to handle this type of clique counting problem. Since the predecessor literatures have thoroughly explored the 3-clique counting, the extended version of the problem( the 4-clique counting problem) improves its position gradually. Under the guidance of a heuristic idea, this paper proposed a probability sampling algorithm based on neighboring edge sampling to solve the extended problem. With the usage of Chernoff inequality, the algorithm only needed a certain number of samplers as the performance guarantee of relative error under the approximate condition. Later, the experimental evaluation and comparison shows that the probability sampling algorithm loses a small amount of precision compared with the traditional precision algorithm, but it has great advantages in algorithm running time and space occupation. Finally, it comes to the conclusion that it has great practical value in practical applications. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

5. 基于改进人工蜂群算法与MapReduce的大数据聚类算法...

Author: 孙　倩, 陈　昊, and 李　超
Subjects: *BIG data, *BEES algorithm, *LEARNING strategies, *DATA quality, *ELECTRONIC data processing
Abstract: Aiming at the problems of low computational efficiency and low clustering performance of clustering algorithms for big data, this paper proposed a clustering algorithm of big data based on the improved ABC algorithm and MapReduce . This algorithm combined the grey wolf optimizer algorithm and ABC algorithm, and improved the exploration and exploitation of the ABC algorithm simultaneously, it could help to improve the clustering performance effectively. The algorithm utilized the chaotic map and backward learning as the initial strategy of ABC colony to improve the solution quality of search procedure. It realized the clustering algorithm based on MapReduce programming model, and realized the clustering process for big data by minimizing the quadratic sum of inner class distances . Experimental results demonstrate that the proposed algorithm improves the clustering quality of big data, and speedups the clustering procedure. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

6. 一种基于遗传算法优化的大数据特征选择方法.

Author: 张文杰 and 蒋烈辉
Subjects: *ELECTRONIC data processing, *SEARCH algorithms, *GENETIC algorithms, *BIG data, *FEATURE selection, *CLASSIFICATION
Abstract: This paper proposed a novel feature selection method based on genetic algorithm for big data processing. Firstly, this method evaluated the features of each dimension, adjusted its weight according to the difference of each feature on the similar nearest neighbor and the heterogeneous nearest neighbor, and guided the search of genetic algorithm based on the feature weight, thus improved the search ability of the algorithm and the accuracy of feature acquisition. And then it combined the feature weights to calculate the fitness of the feature, took fitness as the evaluation index, and started the genetic algorithm to obtain the optimal feature subset, finally achieved an efficient and accurate big data feature selection. The results of experiment show that this method can effectively reduce the number of classification features and improve the accuracy of feature classification. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

7. 大数据下的分布式精确模糊 KNN 分类算法.

Author: 邹劲松 and 李　芳
Subjects: *BIG data, *K-nearest neighbor classification, *CLASSIFICATION algorithms, *FUZZY sets, *ELECTRONIC data processing, *DISTRIBUTED algorithms
Abstract: For research on the efficiency of processing large data sets with K-nearest neighbor(KNN) method, this paper proposed a distributed exact fuzzy-KNN classification algorithm based on Spark framework. The method innovatively combined the Spark framework distributed map and reduce processes with the fuzzy-KNN. Firstly, it processed the training sample category information in different partitions to obtain the class membership degree. It converted the training set into a fuzzy training set with adding membership degrees, and then used the KNN algorithm to calculate the k nearest neighbor of the previously calculated class member test set. Finally, it was classified by distance weight. Experiments on mega-scale dataset samples and comparison experiments with other algorithms show that this method is feasible and effective . [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

8. 基于 HDFS 的大数据文件传输实验设计.

Author: 刘文杰
Subjects: *DISTRIBUTED computing, *COMPUTING platforms, *BIG data, *COLLEGE campuses, *ELECTRONIC data processing
Abstract: With the development of cloud computing application technology and related research, cloud programming model has also a new technological innovation. In college campus network experimental teaching system, cloud platform experiment has become main content for big data analysis. So, by using HDFS struck, building a stable, practical experiment platform to meet the experiment course system becomes a new topic of the experimental study of campus network in colleges and universities. In this paper, we use open source cloud computing platform Hadoop as the basic platform for big data analysis experiment. The basic experimental platform is used for data processing optimization. HDFS provides the underlying application support for distributed computing storage, and realizes communication between NameNode and DataNode. The user file is stored into the node through the data block, so that the read and write requests of the client can be processed timely, and the data block can be created, deleted, replicated and mapped with a unified schedule of NameNode. At the same time, we can target the experiment process according to experiment methods. [ABSTRACT FROM AUTHOR]
Published: 2019

9. 基于CART的高校教师亚健康决策模型构建.

Author: 易俗, 张一川, and 殷慧文
Subjects: *COLLEGE teachers, *CART algorithms, *DECISION trees, *ELECTRONIC data processing, *BIG data, *CONCEPTUAL structures
Abstract: The evaluation of subhealth of traditional college teachers is lack of timeliness, objectivity and efficiency. In the big data environment, the subhealth assessment model can be established more effectively by information technology, so as to support the evaluation and prediction of subhealth state of university teachers. Firstly, according to the analysis of multidimensional subhealth impact factors of university teachers, the paper constructs a decision conceptual model of the impact of multidimensional subhealth, followed by the analysis and data processing of the sample data characteristics, based on the detailed process of using CART algorithm. Finally, the parallel implementation process and experimental verification based on Spark are given. The subhealth concept model of university teachers objectively reflects the evaluation factors of sub health. The decision tree model can support the prediction and analysis of subhealth of university teachers, and experiments verify the validity, immediacy and accuracy of the model. [ABSTRACT FROM AUTHOR]
Published: 2019

10. 机器学习在湍流模型构建中的应用进展.

Author: 张伟伟, 朱林阳, 刘溢浪, and 寇家庆
Subjects: *COMPUTER performance, *ARTIFICIAL intelligence, *MACHINE learning, *ELECTRONIC data processing, *BIG data, *FEATURE selection
Abstract: With the development of high performance computer and data sharing platform, a large number of high fidelity turbulence data can be obtained. Recently, due to the evolution of artificial intelligence, like deep neural network, data-driven machine learning methods have been adopted to quantify the model uncertainty and improve and construct turbulence models. The combination of big turbulence data and artificial intelligence becomes a new area of turbulence research. Although some encouraging results have been achieved, there are still many difficulties and challenges, such as the generalization ability and robustness of the models, etc. The modeling process involves various aspects including data process, feature selection and selection and optimization of the model framework, etc. This paper analyzes and summarizes the main research progress from two aspects:the implementation methods of machine learning in turbulence modeling and the different model targets. Besides, the challenges and future works in this area are also discussed. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

11. 基于事务映射区间求交的高效频繁模式挖掘算法.

Author: 吴　磊, 程良伦, and 王　涛
Subjects: *ASSOCIATION rule mining, *APRIORI algorithm, *DATA mining, *TRANSACTION systems (Computer systems), *BIG data, *ELECTRONIC data processing
Abstract: Association rules mining is an important research topic in data mining. Big data processing requests higher requirements for the efficiency of association rules mining algorithm, where the most time consuming step is frequent pattern mining. For the problem that the state of art frequent pattern mining algorithm was not efficient, this paper proposed a frequent pattern mining algorithm based on interval interaction and transaction mapping (IITM), which combined Apriori algorithm and FP-growth algorithm. This algorithm just needed to scan the dataset twice to generate the FP tree, and then scaned the FP tree to map the ID of each transaction to the interval. It grew the frequent pattern by interval interaction and solved the problems including the Apriori algorithm needed to scan the dataset multiple times, and the FP-growth algorithm needed to iterate to generate the conditional FP tree, which reduced the efficiency of the frequent pattern mining. Experiments on real dataset show that the IITM algorithm is superior to Apriori, FP-growth, and PIETM algorithms at different support. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results

1. 结合增益率与堆叠自编码器的并行随机森林算法.

2. 测绘大数据时代数据处理理论面临的挑战与发展.

3. 基于PMVS算法的大规模数据细粒度并行优化方法.

4. 面向大数据的图模式挖掘概率算法.

5. 基于改进人工蜂群算法与MapReduce的大数据聚类算法...

6. 一种基于遗传算法优化的大数据特征选择方法.

7. 大数据下的分布式精确模糊 KNN 分类算法.

8. 基于 HDFS 的大数据文件传输实验设计.

9. 基于CART的高校教师亚健康决策模型构建.

10. 机器学习在湍流模型构建中的应用进展.

11. 基于事务映射区间求交的高效频繁模式挖掘算法.

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Journal

Database

Publisher

11 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources