390 results on '"Derong Shen"'
Search Results
202. Web Table Column Type Detection Using Deep Learning and Probability Graph Model
- Author
-
Derong Shen, Tong Guo, Yue Kou, and Tiezheng Nie
- Subjects
Dirty data ,Matching (graph theory) ,business.industry ,Computer science ,Deep learning ,010401 analytical chemistry ,02 engineering and technology ,Type (model theory) ,computer.software_genre ,01 natural sciences ,Column (database) ,0104 chemical sciences ,Knowledge base ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Table (database) ,Regular expression ,Artificial intelligence ,Data mining ,business ,computer - Abstract
The rich knowledge contains on the web plays an important role in the researches and practical applications including web search, multi-question answering, and knowledge base construction. How to correctly detect the semantic types of all the data columns is critical to understand the web table. The traditional methods have the following limitations: (1) Most of them rely on dictionary lookup and regular expression matching, and are generally not robust to dirty data; (2) They only consider character data besides numeric data which accounts for a large proportion; (3) Some models take the characteristics of a single column and do not consider the special organizational structure of the table. In this paper, a column type detection method combining deep learning and probability graph model is proposed, taking the semantic features of a single column and the interaction between multiple columns into account to improve the prediction accuracy. Experimental results show that our method has higher accuracy compared with the state-of-the-art approaches.
- Published
- 2020
203. Link Prediction Based on Smooth Evolution of Network Embedding
- Author
-
Derong Shen, Tiezheng Nie, Hao Dong, and Yue Kou
- Subjects
law ,Computer science ,Network embedding ,Network structure ,Heterogeneous information ,Data mining ,Transformer ,computer.software_genre ,computer ,law.invention - Abstract
The problem of link prediction in dynamic heterogeneous information networks has been widely studied in recent years. The technique of network embedding has been proved extremely useful for link prediction. However, the existing methods lack the close combination between deep-level features and temporal features of networks, which affects the accuracy of prediction and makes it difficult to adapt to the dynamic networks. In this paper, a Smooth Evolution model for Network Embedding (called SENE) is proposed, which considers both deep-level features and temporal features to obtain the embedded representations of the network structure, and uses the transformer mechanism to effectively obtain the smooth evolution of network embedding. Also an SENE-based link prediction algorithm is proposed, which can effectively guarantee the accuracy of link prediction. The feasibility and effectiveness of the proposed key technologies are verified by experiments.
- Published
- 2020
204. An Approach for Progressive Set Similarity Join with GPU Accelerating
- Author
-
Yue Kou, Derong Shen, Tiezheng Nie, and Lining Yu
- Subjects
020203 distributed computing ,Matching (graph theory) ,Computer science ,Computation ,02 engineering and technology ,Parallel computing ,Bloom filter ,computer.software_genre ,Inverted index ,Set (abstract data type) ,Similarity (network science) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Graphics ,computer ,Data integration - Abstract
Set similarity join (SSJoin) is an important operation for searching similarity set pairs from the given database and play a core role in data integration, data cleaning, and data mining. In contrast to the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time is improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, recent research has shown that GPUs (Graphics Processing Units) can accelerate the similarity operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with GPUs. We proposes two progressive SSJoin methods, PSSJM and PBM. PSSJM uses inverted index and PBM achieves its required functions by utilizing counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our proposed progressive method to accelerate the computation. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on GPUs has high speedups over CPU-base method.
- Published
- 2020
205. Data integration strategy for database grids based on P2P framework
- Author
-
Guangqi, Wang, Derong, Shen, Ge, Yu, Wensheng, Zhou, and Meifang, Li
- Published
- 2006
- Full Text
- View/download PDF
206. HPPQ: A parallel package queries processing approach for large-scale data
- Author
-
Yue Kou, Tiezheng Nie, Meihui Shi, Ge Yu, and Derong Shen
- Subjects
Computer Networks and Communications ,Computer science ,Volume (computing) ,Centroid ,computer.software_genre ,Computer Science Applications ,Rate of convergence ,Parallel processing (DSP implementation) ,Artificial Intelligence ,Preprocessor ,Data mining ,Data pre-processing ,Focus (optics) ,Greedy algorithm ,computer ,Information Systems - Abstract
A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries (HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed (namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.
- Published
- 2018
207. Inferring Anchor Links Based on Social Network Structure
- Author
-
Yue Kou, Ge Yu, Shuo Feng, Jingrui He, Derong Shen, and Tiezheng Nie
- Subjects
Structure (mathematical logic) ,Theoretical computer science ,Similarity (geometry) ,General Computer Science ,Social network ,business.industry ,Computer science ,Node (networking) ,similarity metric ,Anchor link prediction ,General Engineering ,02 engineering and technology ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,aligned networks ,020201 artificial intelligence & image processing ,General Materials Science ,social network ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,business ,Focus (optics) ,Precision and recall ,Link (knot theory) ,lcsh:TK1-9971 - Abstract
Nowadays, people usually participate in multiple social networks simultaneously, e.g., Facebook and Twitter. Formally, the correspondences of the accounts that belong to the same user are defined as anchor links, and the networks aligned by anchor links can be denoted as aligned networks. In this paper, we study the problem of anchor link prediction (ALP) across a pair of aligned networks based on social network structure. First, three similarity metrics (CPS, CCS, and CPS+) are proposed. Different from the previous works, we focus on the theoretical guarantees of our metrics. We prove mathematically that the node pair with the maximum CPS or CPS+ should be an anchor link with high probability and a correctly predicted anchor link must have a high value of CCS. Second, using the CPS+ and CCS, we present a two-stage iterative algorithm CPCC to solve the problem of the ALP. More specifically, we present an early termination strategy to make a tradeoff between precision and recall. At last, a series of experiments are conducted on both synthetic and real-world social networks to demonstrate the effectiveness of the CPCC.
- Published
- 2018
208. Accelerating Progressive Set Similarity Join with the CPU-GPU Architecture
- Author
-
Yue Kou, Tiezheng Nie, Derong Shen, and Lining Yu
- Subjects
Information Systems and Management ,Matching (graph theory) ,Computer science ,Search engine indexing ,Parallel computing ,Bloom filter ,computer.software_genre ,Computer Science Applications ,Management Information Systems ,Set (abstract data type) ,Similarity (network science) ,Central processing unit ,Graphics ,computer ,Information Systems ,Data integration - Abstract
Set similarity join (SSJoin) is known as an important operation for searching similarity set pairs from the given database and plays a core role in data integration, data cleaning, and data mining. Different from the traditional SSJoin methods, progressive SSJoin aims to resolve large datasets so that the efficiency of finding similarity pairs in the limited running time can be improved. Progressive SSJoin can provide possible partial matching pairs of the dataset as early as possible in the processing. Moreover, many recent researches have shown that GPUs (Graphics Processing Units) can accelerate and improve the efficiency of similarity join operation. This paper focuses on exploring progressive SSJoin algorithms and accelerating them with the CPU-GPU architecture. We propose two progressive SSJoin methods, PSSJM and PBM. PSSJM utilizes inverted indexing and PBM achieves its required functions by utilizing the counting Bloom filter and prefix filtering techniques. In addition, we proposed a GPUs-based algorithm based on our progressive SSJoin method to accelerate the processing. Comprehensive experiments with real-world datasets show that our methods can generate better quality results than the traditional method under limited time and the method implementing on CPU-GPU architecture has high speedups over the CPU-base method.
- Published
- 2021
209. Private Blocking Technique for Multi-party Privacy-Preserving Record Linkage
- Author
-
Yue Kou, Ge Yu, Derong Shen, Tiezheng Nie, and Shumin Han
- Subjects
Matching (statistics) ,Private blocking ,Computer science ,Paillier cryptosystem ,Computational Mechanics ,k-anonymity ,Privacy-preserving record linkage ,02 engineering and technology ,Similarity measure ,computer.software_genre ,Blocking (statistics) ,lcsh:QA75.5-76.95 ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,lcsh:T58.5-58.64 ,lcsh:Information technology ,Scalability ,Process (computing) ,Computer Science Applications ,Information sensitivity ,020201 artificial intelligence & image processing ,lcsh:Electronic computers. Computer science ,Data mining ,computer ,Record linkage - Abstract
The process of matching and integrating records that relate to the same entity from one or more datasets is known as record linkage, and it has become an increasingly important subject in many application areas, including business, government and health system. The data from these areas often contain sensitive information. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process, and this technique is called privacy-preserving record linkage (PPRL). With the increasing data, scalability becomes the main challenge of PPRL, and many private blocking techniques have been developed for PPRL. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious non-matching pairs without compromising privacy. However, most of them are designed for two databases and they vary widely in their ability to balance competing goals of accuracy, efficiency and security. In this paper, we propose a novel private blocking approach for PPRL based on dynamic k-anonymous blocking and Paillier cryptosystem which can be applied on two or multiple databases. In dynamic k-anonymous blocking, our approach dynamically generates blocks satisfying k-anonymity and more accurate values to represent the blocks with varying k. We also propose a novel similarity measure method which performs on the numerical attributes and combines with Paillier cryptosystem to measure the similarity of two or more blocks in security, which provides strong privacy guarantees that none information reveals even collusion. Experiments conducted on a public dataset of voter registration records validate that our approach is scalable to large databases and keeps a high quality of blocking. We compare our method with other techniques and demonstrate the increases in security and accuracy.
- Published
- 2017
210. A Common Application-Centric QoS Model for Selecting Optimal Grid Services
- Author
-
Derong, Shen, primary, Ge, Yu, additional, Tiezheng, Nie, additional, and Zhibin, Zhao, additional
- Published
- 2005
- Full Text
- View/download PDF
211. PS-GIS: personalized and semantics-based grid information services.
- Author
-
Yue Kou, Ge Yu 0001, Derong Shen, Dong Li 0023, and Tiezheng Nie
- Published
- 2007
- Full Text
- View/download PDF
212. A Top-K-based cache model for deep web query.
- Author
-
Yue Kou, Derong Shen, Ge Yu 0001, Tiezheng Nie, and Dong Li 0023
- Published
- 2007
- Full Text
- View/download PDF
213. Discovering context-aware conditional functional dependencies
- Author
-
Tiezheng Nie, Ge Yu, Derong Shen, Yuefeng Du, and Yue Kou
- Subjects
Data consistency ,General Computer Science ,Process (engineering) ,Computer science ,business.industry ,Heuristic ,Context (language use) ,0102 computer and information sciences ,02 engineering and technology ,computer.software_genre ,Machine learning ,01 natural sciences ,Measure (mathematics) ,Theoretical Computer Science ,Consistency (database systems) ,Theory of relativity ,010201 computation theory & mathematics ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Data mining ,Artificial intelligence ,Focus (optics) ,business ,computer - Abstract
Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect potential errors. Especially, we focus on automatically discovering minimal CCFDs. In this paper, we present context relativity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs.Moreover,we prove that discovering minimal CCFDs is NP-complete and we design the precise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity according to data distribution for suggestion. The repairing results are approvedmore accuracy, even evidenced by our empirical evaluation.
- Published
- 2016
214. Exploiting Unlabeled Ties for Link Prediction in Incomplete Signed Networks
- Author
-
Derong Shen, Rui Mao, Dong Li, Yichuan Shao, Yue Kou, and Tiezheng Nie
- Subjects
Prediction algorithms ,Training set ,Computer science ,business.industry ,Feature extraction ,Training (meteorology) ,Artificial intelligence ,Link (knot theory) ,Machine learning ,computer.software_genre ,business ,computer - Published
- 2019
215. Link Prediction Based on Node Embedding and Personalized Time Interval in Temporal Multi-relational Network
- Author
-
Liu Yuxin, Derong Shen, Yue Kou, and Tiezheng Nie
- Subjects
Sequence ,Property (programming) ,Computer science ,Node (networking) ,Process (computing) ,Embedding ,Construct (python library) ,Interval (mathematics) ,Data mining ,Link (knot theory) ,computer.software_genre ,computer - Abstract
Link prediction on temporal networks has a wide range of applications, such as facilitating individual relationship mining, user recommendation, and user behavior analysis. The traditional link prediction methods on temporal network only considered the structure of single-relational networks, which ignored the diversity of network link types and the influence between different link types. This paper proposes a temporal multi-relational network link prediction method combining personalized time interval and node embedding. Firstly, the node embedding is generated according to the structure of target network and auxiliary network which overcomes the defect of single network information sparseness; then, considering the diversity of link types, we construct the relationship formation sequence based on personalized time interval and the influence between different relationships for each link; next, the relationship formation sequence is modeled based on the Hawkes process, which takes product of the node embedding as the initial value to calculate the possibility of link formation. The method captures the dynamic characteristics and multi-relational property of the network, which is helpful to improve the accuracy of link prediction. Experimental results find that the proposed method has better performance and can be applied to large-scale networks.
- Published
- 2019
216. A Cross-Network User Identification Model Based on Two-Phase Expansion
- Author
-
Tiezheng Nie, Shuo Feng, Yue Kou, Derong Shen, and Li Xiang
- Subjects
Computer science ,business.industry ,Phase (waves) ,02 engineering and technology ,computer.software_genre ,Time cost ,Small set ,Set (abstract data type) ,Range (mathematics) ,Identification (information) ,Order (exchange) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Local search (optimization) ,Data mining ,business ,computer - Abstract
Cross-network user identification is a technique to infer the potential links among the shared user entities across multiple networks. However, existing methods mainly rely on a small set of seed users which might not get enough evidence for identification. In this paper, we propose a cross-network user identification model based on two-phase expansion. On one hand, in order to effectively solve the cold start problem, we propose a global seed expansion method to expand the seed set. On the other hand, we propose a local search range expansion method with the aim to ensure higher accuracy at a lower time cost. Experiments demonstrate the effectiveness of our proposed model.
- Published
- 2019
217. Attentional Memory Network with Correlation-based Embedding for time-aware POI recommendation
- Author
-
Yue Kou, Derong Shen, Tiezheng Nie, Ge Yu, and Meihui Shi
- Subjects
Information Systems and Management ,Mechanism (biology) ,Computer science ,02 engineering and technology ,computer.software_genre ,Management Information Systems ,Correlation ,Artificial Intelligence ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Data mining ,computer ,Software - Abstract
As considerable amounts of point-of-interest (POI) check-in data have been accumulated, POI recommendation has received much attention recently. It is well recognized that spatial–temporal information plays an important role in the user’s decision-making for visiting a POI. However, in time-aware POI recommendation, exploring temporal patterns on user preferences and incorporating multi-view factors for choosing preferred POIs are challenging issues to be resolved. To this end, we propose a novel Attentional Memory Network with Correlation-based Embedding (AMN-CE) for time-aware POI recommendation. Specifically, we first propose a correlation-based POI embedding method to capture geographical influence and interactive correlation between POIs. Sequentially, we design an attentional memory network, which is able to capture the micro-level relationship between time slot pairs. Furthermore, we propose a temporal-level attention mechanism to distinguish and dynamically adjust the influence strength of different time slots on user preferences at the target time slot. The experimental results on four real-life datasets demonstrate significant improvements of our proposed method compared with state-of-the-art models.
- Published
- 2021
218. Topological Features Based Entity Disambiguation
- Author
-
Tiezheng Nie, Yue Kou, Ge Yu, Chenchen Sun, and Derong Shen
- Subjects
Computer science ,02 engineering and technology ,Random walk ,Topology ,Graph ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,020201 artificial intelligence & image processing ,Cluster analysis ,Software - Abstract
This work proposes an unsupervised topological features based entity disambiguation solution. Most existing studies leverage semantic information to resolve ambiguous references. However, the semantic information is not always accessible because of privacy or is too expensive to access. We consider the problem in a setting that only relationships between references are available. A structure similarity algorithm via random walk with restarts is proposed to measure the similarity of references. The disambiguation is regarded as a clustering problem and a family of graph walk based clustering algorithms are brought to group ambiguous references. We evaluate our solution extensively on two real datasets and show its advantage over two state-of-the-art approaches in accuracy.
- Published
- 2016
219. Content-Related Repairing of Inconsistencies in Distributed Data
- Author
-
Ge Yu, Yuefeng Du, Tiezheng Nie, Yue Kou, and Derong Shen
- Subjects
Computer science ,Process (engineering) ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,020204 information systems ,Data quality ,0202 electrical engineering, electronic engineering, information engineering ,Data mining ,computer ,Software - Abstract
Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.
- Published
- 2016
220. Searching overlapping communities for group query
- Author
-
Derong Shen, Tiezheng Nie, Jing Shan, Yue Kou, and Ge Yu
- Subjects
Theoretical computer science ,Computer Networks and Communications ,Computer science ,media_common.quotation_subject ,02 engineering and technology ,computer.software_genre ,Set (abstract data type) ,Order (exchange) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,media_common ,Degree (graph theory) ,Social network ,business.industry ,Heuristic ,Node (networking) ,Online community ,Graph ,Evolving networks ,Hardware and Architecture ,020201 artificial intelligence & image processing ,Data mining ,business ,computer ,Software - Abstract
In most real life networks such as social networks and biology networks, a node often involves in multiple overlapping communities. Thus, overlapping community discovery has drawn a great deal of attention and there is a lot of research on it. However, most work has focused on community detection, which takes the whole network as input and derives all communities at one time. Community detection can only be used in offline analysis of networks and it is quite costly, not flexible and can not support dynamically evolving networks. Online community search which only finds overlapping communities containing a given node is a flexible and light-weight solution, and also supports dynamic graphs very well. However, in some scenarios, it requires overlapping community search for group query, which means that the input is a set of nodes instead of one single node. To solve this problem, we propose an overlapping community search framework for group query, including both exact and heuristic solutions. The heuristic solution has four strategies, some of which are adjustable and self-adaptive. We propose two parameters node degree and discovery power to trade off the efficiency and quality of the heuristic strategies, in order to make them satisfy different application requirements. Comprehensive experiments are conducted and demonstrate the efficiency and quality of both exact and heuristic solutions.
- Published
- 2015
221. Uncertain top-k query processing in distributed environments
- Author
-
Ge Yu, Derong Shen, and Xite Wang
- Subjects
Distributed Computing Environment ,Information Systems and Management ,Uncertain data ,Computer science ,Probabilistic logic ,Sorting ,02 engineering and technology ,Filter (signal processing) ,Data structure ,computer.software_genre ,Query optimization ,Set (abstract data type) ,Hardware and Architecture ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Data mining ,computer ,Software ,Information Systems - Abstract
The top-k query on uncertain data set has been a very hot topic these years, and there have been many studies on uncertain top-k queries. Unfortunately, most of the existing algorithms only consider centralized processing environments, and they are not suitable for the large-scale data. In this paper, it is the first attempt to process probabilistic threshold top-k queries (an important uncertain top-k query, PT-k for short) in a distributed environment. We propose 3 efficient algorithms. The serial distributed approach adopts a new method, which only requires a few amount of calculations, to serially process PT-k queries in distributed environments. The global sorting first algorithm for PT-k query processing (GSP) is designed for improving the computation speed. In GSP, a distributed sorting operation is performed, and then we compute the candidates for PT-k queries in parallel. The query results can be computed by using a novel incremental method which can reduce the number of calculations. The local filtering first algorithm for PT-k query processing is designed for reducing the network overhead. Specifically, several filtering strategies are proposed to filter out redundant data locally, and then the incremental method in GSP is used to process the PT-k queries. Finally, the effectiveness of our proposed algorithms is verified through a series of experiments.
- Published
- 2015
222. An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets
- Author
-
Yue Kou, Derong Shen, Xi-Te Wang, Ge Yu, Mei Bai, and Tiezheng Nie
- Subjects
Computer science ,computer.software_genre ,Computer Science Applications ,Theoretical Computer Science ,Tree (data structure) ,Computational Theory and Mathematics ,Hardware and Architecture ,Distributed algorithm ,Search algorithm ,Outlier ,Anomaly detection ,Data mining ,Cluster analysis ,Greedy algorithm ,Time complexity ,computer ,Software - Abstract
The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.
- Published
- 2015
223. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data
- Author
-
Derong Shen, Yue Kou, Tiezheng Nie, and Zhaozhao Xu
- Subjects
0303 health sciences ,Computer science ,Sampling (statistics) ,Health Informatics ,Matthews correlation coefficient ,Class (biology) ,Computer Science Applications ,Random forest ,03 medical and health sciences ,Identification (information) ,Statistical classification ,0302 clinical medicine ,Research Design ,Humans ,030212 general & internal medicine ,Noise (video) ,Medical diagnosis ,Algorithm ,Algorithms ,030304 developmental biology - Abstract
The problem of imbalanced data classification often exists in medical diagnosis. Traditional classification algorithms usually assume that the number of samples in each class is similar and their misclassification cost during training is equal. However, the misclassification cost of patient samples is higher than that of healthy person samples. Therefore, how to increase the identification of patients without affecting the classification of healthy individuals is an urgent problem. In order to solve the problem of imbalanced data classification in medical diagnosis, we propose a hybrid sampling algorithm called RFMSE, which combines the Misclassification-oriented Synthetic minority over-sampling technique (M-SMOTE) and Edited nearset neighbor (ENN) based on Random forest (RF). The algorithm is mainly composed of three parts. First, M-SMOTE is used to increase the number of samples in the minority class, while the over-sampling rate of M-SMOTE is the misclassification rate of RF. Then, ENN is used to remove the noise ones from the majority samples. Finally, RF is used to perform classification prediction for the samples after hybrid sampling, and the stopping criterion for iterations is determined according to the changes of the classification index (i.e. Matthews Correlation Coefficient (MCC)). When the value of MCC continuously drops, the process of iterations will be stopped. Extensive experiments conducted on ten UCI datasets demonstrate that RFMSE can effectively solve the problem of imbalanced data classification. Compared with traditional algorithms, our method can improve F-value and MCC more effectively.
- Published
- 2020
224. Erratum to: A Framework for Supporting Tree-Like Indexes on the Chord Overlay
- Author
-
Tiezheng Nie, Mingdong Zhu, Derong Shen, Ge Yu, and Yue Kou
- Subjects
Theoretical computer science ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Theory of computation ,Chord (music) ,Overlay ,Chord (peer-to-peer) ,Software ,Computer Science Applications ,Theoretical Computer Science - Abstract
Erratum: Ming-Dong Zhu, De-Rong Shen, Yue Kou, Tie-Zheng Nie, Ge Yu. A Framework for Supporting Tree-Like Indexes on the Chord Overlay. Journal of Computer Science and Technology 2013, 28(6): 962-972. DOI: 10.1007/s11390-013-1391-8.
- Published
- 2020
225. Sequence Translating Model Using Deep Neural Block Cascade Network: Taking Protein Secondary Structure Prediction as an Example
- Author
-
Tiezheng Nie, Yu Hu, Ge Yu, and Derong Shen
- Subjects
0301 basic medicine ,Sequence ,Dependency (UML) ,Computer science ,business.industry ,Deep learning ,Feature extraction ,computer.software_genre ,03 medical and health sciences ,030104 developmental biology ,Feature (machine learning) ,Data mining ,State (computer science) ,Artificial intelligence ,Layer (object-oriented design) ,business ,computer ,Block (data storage) - Abstract
The sequence data is one of the most common types the objects are described with and is also an important research subject that has caused a lot of attention to researchers. Mining hidden information from the sequence data is an important issue. Recently, the rapid developing deep learning technology achieves great success in biomedical information mining. In this paper, we propose a Deep Neural Block Cascade Networks(DeepNBCN) which is a general framework for sequence translating. Then we take protein secondary structure prediction as an important test target. DeepNBCN has a full-free adjustment mechanism of deep neural networks, in which every layer is structured as a block composed of one or more Feature Extractor named as module, and delivers features to the upper layer. DeepNBCN can model not only internal relations between amino acid sequence and Protein Secondary Structure(PSS), but also complex sequence relationship and dependency. We evaluate our model with two protein datasets. With a well designed architecture for PSS, the experimental results show that our model can achieve a better performance compared with the state of the art algorithms.
- Published
- 2018
226. A Progressive Method for Detecting Duplication Entities Based on Bloom Filters
- Author
-
Tiezheng Nie, Derong Shen, Ge Yu, Yebing Luo, and Yue Kou
- Subjects
Index (publishing) ,Computer science ,Matched filter ,Sorting ,Volume (computing) ,Data mining ,Bloom filter ,Finite time ,computer.software_genre ,computer ,Duplicate detection ,Block (data storage) - Abstract
With the volume of data grows rapidly, the cost of detecting duplication entities has increased significantly in data cleaning. However, some real-time applications only need to identify as many duplicate entities as possible in a limited time, rather than all of them. The existing works adopt the sorting method to divide similar records into blocks, and arrange the processing order of blocks to detect duplicate entity progressively. However, this method only works well when the attributes of records are suitable for sorting. Therefore, this paper proposes a novel progressive de-duplicate method for records that can't be sorted by their attributes. The method distributes records into different blocks based on their features and generates a modified bloom filter index for each block. Then it uses the bloom filter to predict the probability of duplicate entities in this block, which determines the processing order of blocks to detect the duplicate entities more quickly. The comprehensive experiment shows that the number of duplicate detection by this algorithm in the finite time is far more efficient than other algorithms involved in the related works.
- Published
- 2017
227. User Identification across Social Networks Based on Global View Features
- Author
-
Qian Wang, Yue Kou, Ge Yu, Shuo Feng, Tiezheng Nie, and Derong Shen
- Subjects
Matching (statistics) ,Exploit ,business.industry ,Computer science ,Filter (signal processing) ,Machine learning ,computer.software_genre ,Task (project management) ,Core (game theory) ,Identification (information) ,Task analysis ,Artificial intelligence ,Precision and recall ,business ,computer - Abstract
Nowadays, people prefer to take part in multiple social networks to enjoy different kinds of services. Consequently, a significant task is to identify users across networks. Most state-of-the-art works on this issue exploit user local structure features (e.g., friend, follow and followed). In this paper, we first proposes the notion of user global view features, which represent the location of users in the network. Then, we present an iterative two-stage algorithm (GAUI) using Global view features with user Attribute features to solve User Identification. In GAUI, we iteratively update pairwise similarity and predict new matching users. Certainly, we present a community based core anchor link filter strategy to reduce the computation cost, and present a stable matching based mapping strategy to improve the accuracy. At last, the experiments conducted on two real-world aligned networks demonstrate that our method has better performance on precision and recall.
- Published
- 2017
228. SAMES: deadline-constraint scheduling in MapReduce
- Author
-
Mei Bai, Xite Wang, Yue Kou, Tiezheng Nie, Derong Shen, and Ge Yu
- Subjects
General Computer Science ,Kernel (image processing) ,Computer science ,Distributed computing ,Data_MISCELLANEOUS ,Exception handling ,Parallel computing ,Theoretical Computer Science ,Scheduling (computing) - Abstract
MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed before specific deadlines. Hence, in this paper, a novel scheduling algorithm based on the most effective sequence (SAMES) is proposed for deadline-constraint jobs in MapReduce. First, according to the characteristics of MapReduce, we propose a novel sequence-based execution strategy for MapReduce jobs and a new concept, the effective sequence (ES). Then, we design some efficient approaches for finding ESes and choose the most effective sequence (MES) for job execution. We also propose methods for MES-updates and exception handling. Finally, we verify the effectiveness of SAMES through experiments. The experimental results show that SAMES is an efficient scheduling algorithm for deadline-constraint jobs in MapReduce.
- Published
- 2014
229. Determining Repairing Sequence of Inconsistencies in Content-Related Data
- Author
-
Yuefeng Du, Tiezheng Nie, Ge Yu, Derong Shen, and Yue Kou
- Subjects
Sequence ,Data consistency ,Computer science ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Conditional functional dependencies ,Sequence graph ,Consistency (database systems) ,020204 information systems ,Data quality ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Data mining ,Semaphore ,computer - Abstract
Data consistency is one of the central issues of data quality management. Content-related conditional functional dependencies (CCFDs) are practical techniques for data consistency. CCFDs catch inconsistencies by putting content-related data together. Specially, repairing sequence plays a key role in consistency repairing. Some repairing sequences may bring unexpected results (e.g., incorrect repairs and results with extra repairing-cost). Hence, reasonable repairing sequences are advocated and readily supported by commercial system for better performance. To meet this need, this paper present a method of determining repairing sequence of inconsistencies in content-related data. (1) We present repairing sequence graph about CCFDs to select the inconsistencies which should be repaired preferentially. (2) We analyze the repairing mutex and discuss the interaction between repairing sequence and repairing mutex. (3) We proof that the problem of determining repairing sequence with minimum repairing-cost is NP-complete so that our method heuristically finds the appropriate repairing sequence. Our solution performs to be effective by empirical evaluation on three datasets.
- Published
- 2017
230. A Distributed Algorithm for the Cluster-Based Outlier Detection Using Unsupervised Extreme Learning Machines
- Author
-
Mei Bai, Yue Kou, Derong Shen, Tiezheng Nie, Xite Wang, and Ge Yu
- Subjects
Speedup ,Article Subject ,Computer science ,General Mathematics ,Node (networking) ,lcsh:Mathematics ,Credit card fraud ,General Engineering ,02 engineering and technology ,computer.software_genre ,lcsh:QA1-939 ,ComputingMethodologies_PATTERNRECOGNITION ,Ranking ,Distributed algorithm ,lcsh:TA1-2040 ,020204 information systems ,Outlier ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Anomaly detection ,Data mining ,lcsh:Engineering (General). Civil engineering (General) ,computer - Abstract
Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given dataset. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection and environment monitoring. Our previous work proposed the Cluster-Based (CB) outlier and gave a centralized method using unsupervised extreme learning machines to compute CB outliers. In this paper, we propose a new distributed algorithm for the CB outlier detection (DACB). On the master node, we collect a small number of points from the slave nodes to obtain a threshold. On each slave node, we design a new filtering method that can use the threshold to efficiently speed up the computation. Furthermore, we also propose a ranking method to optimize the order of cluster scanning. At last, the effectiveness and efficiency of the proposed approaches are verified through a plenty of simulation experiments.
- Published
- 2017
231. Research on Related Entity Identification Model and Incremental Verification Algorithm for Heterogeneous Networks
- Author
-
Tai-Ming Wang, Yue Kou, Derong Shen, Ge Yu, Tiezheng Nie, and Heng Liu
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Machine learning ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,Identification (information) ,Hardware and Architecture ,Artificial intelligence ,Data mining ,business ,computer ,Software ,Heterogeneous network - Published
- 2014
232. Survey on NoSQL for Management of Big Data
- Author
-
Derong Shen, Ge Yu, Tie-Zheng Nie, Xi-Te Wang, and Yue Kou
- Subjects
Database ,business.industry ,Computer science ,Big data ,NoSQL ,computer.software_genre ,business ,computer ,Software - Published
- 2014
233. A Throughput Driven Task Scheduler for Batch Jobs in Shared MapReduce Environments
- Author
-
Xite Wang, Tiezheng Nie, Ge Yu, Yue Kou, and Derong Shen
- Subjects
Computer science ,Operating system ,Batch processing ,computer.software_genre ,Throughput (business) ,computer ,Task (project management) - Published
- 2014
234. A Framework for Supporting Tree-Like Indexes on the Chord Overlay
- Author
-
Kou Yue, Ge Yu, Mingdong Zhu, Derong Shen, and Tiezheng Nie
- Subjects
Speedup ,business.industry ,Computer science ,Distributed computing ,Data management ,Overlay ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,Distributed algorithm ,Scalability ,Chord (music) ,business ,Chord (peer-to-peer) ,Software - Abstract
With the explosive growth of data, to support efficient data management including queries and updates, the database system is expected to provide tree-like indexes, such as R-tree, M-tree, B+-tree, according to different types of data. In the distributed environment, the indexes have to be scattered across the compute nodes to improve reliability and scalability. Indexes can speed up queries, but they incur maintenance cost when updates occur. In the distributed environment, each compute node maintains a subset of an index tree, so keeping the communication cost small is more crucial, or else it occupies lots of network bandwidth and the scalability and availability of the database system are affected. Further, to achieve the reliability and scalability for queries, several replicas of the index are needed, but keeping the replicas consistent is not straightforward. In this paper, we propose a framework supporting tree-like indexes, based on Chord overlay, which is a popular P2P structure. The framework dynamically tunes the number of replicas of index to balance the query cost and the update cost. Several techniques are designed to improve the efficiency of updates without the cost of performance of the queries. We implement M-tree and R-tree in our framework, and extensive experiments on real- life and synthetic datasets verify the efficiency and scalability of our framework.
- Published
- 2013
235. Query Intent Disambiguation of Keyword-Based Semantic Entity Search in Dataspaces
- Author
-
Ge Yu, Dan Yang, Yue Kou, Tiezheng Nie, and Derong Shen
- Subjects
Computer science ,Relational database ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Query language ,Query optimization ,computer.software_genre ,Theoretical Computer Science ,Query expansion ,Dataspaces ,Web query classification ,Query by Example ,computer.programming_language ,Semantic query ,Concept search ,Web search query ,Information retrieval ,InformationSystems_DATABASEMANAGEMENT ,Computer Science Applications ,XML database ,Computational Theory and Mathematics ,Object Query Language ,Hardware and Architecture ,Sargable ,computer ,Software ,RDF query language - Abstract
Keyword query has attracted much research attention due to its simplicity and wide applications. The inherent ambiguity of keyword query is prone to unsatisfied query results. Moreover some existing techniques on Web query keyword query in relational databases and XML databases cannot be completely applied to keyword query in dataspaces. So we propose KeymanticES a novel keyword-based semantic entity search mechanism in dataspaces which combines both keyword query and semantic query features. And we focus on query intent disambiguation problem and propose a novel three-step approach to resolve it. Extensive experimental results show the effectiveness and correctness of our proposed approach.
- Published
- 2013
236. Combining Influence and Sensitivity to Factorize Matrix for Multi-Context Recommendation
- Author
-
Yue Kou, Tiezheng Nie, Ge Yu, Qingna Zhao, and Derong Shen
- Subjects
Matrix (mathematics) ,Factorization ,Computer science ,Context (language use) ,Resource management ,Algorithm design ,Data mining ,Sensitivity (control systems) ,Recommender system ,computer.software_genre ,computer ,Matrix decomposition - Abstract
With the growing amount of information available online, context-aware recommender systems have emerged to improve the precision of recommendation. Matrix factorization models are the state-of-the-art in these systems, especially for multi-context recommendation. However, existing models ignored either context influence or entity sensitivity. That is, they assume that one entity (user or item) shares the same factors across different contexts, or one context shares the same influence across different entities. In fact, for one context (or entity), its influence (or sensitivity) may be different with the changes of entities (or contexts). In this paper, we present a matrix factorization model for multi-context recommendation (namely ISMF). Unlike traditional models, ISMF considers both context influence and entity sensitivity. Instead of enforcing the same factors for each entity, we detail the factors as entity-intrinsic factors and entity-specific factors to represent entity-itself and context influence respectively. Meanwhile, we use some parameters acting on these factors to represent entity sensitivity. Also a matrix factorization algorithm for ISMF is proposed. We iteratively determine the factors and relevant parameters to maintain the precision for recommendation. The experiments demonstrate the feasibility and effectiveness of our method.
- Published
- 2016
237. Cluster-Based Outlier Detection Using Unsupervised Extreme Learning Machines
- Author
-
Yue Kou, Tiezheng Nie, Xi-Te Wang, Ge Yu, Mei Bai, and Derong Shen
- Subjects
Computer science ,business.industry ,Credit card fraud ,InformationSystems_DATABASEMANAGEMENT ,Pattern recognition ,Data set ,Task (computing) ,ComputingMethodologies_PATTERNRECOGNITION ,Outlier ,Cluster (physics) ,Unsupervised learning ,Anomaly detection ,Pruning (decision trees) ,Artificial intelligence ,business - Abstract
Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given data set. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection, environment monitoring, etc. In this paper, we proposed a new definition of outlier, called cluster-based outlier. Comparing with the existing definitions, the cluster-based outlier is more suitable for the complicated data sets that consist of many clusters with different densities. To detect cluster-based outliers, we first split the given data set into a number of clusters using unsupervised extreme learning machines. Then, we further design a pruning method technique to efficiently compute outliers in each cluster. at last, the effectiveness and efficiency of the proposed approaches are verified through plenty of simulation experiments.
- Published
- 2016
238. Scalable Private Blocking Technique for Privacy-Preserving Record Linkage
- Author
-
Yue Kou, Shumin Han, Ge Yu, Tiezheng Nie, and Derong Shen
- Subjects
Matching (statistics) ,Computer science ,Process (computing) ,02 engineering and technology ,k-anonymity ,Similarity measure ,computer.software_genre ,Blocking (statistics) ,Paillier cryptosystem ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data mining ,computer ,Record linkage - Abstract
Record linkage is the process of matching records from multiple databases that refer to the same entities and it has become an increasingly important subject in many application areas, including business, government, and health. When we collect data which is about people from these areas, integrating such data across organizations can raise privacy concerns. To prevent privacy breaches, ideally records should be linked in a private way such that no information other than the matching result is leaked in the process, this technique is called privacy-preserving record linkage (PPRL). Scalability is one of the main challenges in PPRL, therefore, many private blocking techniques have been developed for PPRL. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious non-matching pairs without compromising privacy. However, they vary widely in their ability to balance competing goals of accuracy, efficiency and security. In this paper, we propose a novel private blocking approach for PPRL based on dynamic k-anonymous blocking and Paillier cryptosystem. In dynamic k-anonymous blocking, our approach dynamically generates blocks satisfying k-anonymity and more accurate values to represent the blocks with varying k. We also propose a novel similarity measure method which performs on the numerical attributes and combines with Paillier cryptosystem to measure the similarity of two blocks in security, which provides strong privacy guarantees that none information reveals. Experiments conducted on a public dataset of voter registration records validate that our approach is scalable to large databases and keeps a high quality of blocking. We compare our method with other techniques and demonstrate the increases in security and accuracy.
- Published
- 2016
239. Anchor Link Prediction Using Topological Information in Social Networks
- Author
-
Ge Yu, Shuo Feng, Tiezheng Nie, Yue Kou, and Derong Shen
- Subjects
Social network ,Iterative method ,business.industry ,Computer science ,Reliability (computer networking) ,Node (networking) ,02 engineering and technology ,Social group ,Evolving networks ,Similarity (network science) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Precision and recall - Abstract
People today may participate in multiple social networks (Facebook, Twitter, Google+, etc.). Predicting the correspondence of the accounts that refer to the same natural person across multiple social networks is a significant and challenging problem. Formally, social networks that outline the relationships of a common group of people are defined as aligned networks, and the correspondence of the accounts that refer to the same natural person across aligned networks are defined as anchor links. In this paper, we learn the problem of Anchor Link Prediction (ALP). Firstly, two similarity metrics (Bi-Similarity BiS and Reliability Similarity ReS) are proposed to measure the similarity between nodes in aligned networks. And we prove mathematically that the node pair with the maximum BiS has higher probability to be an anchor link and a correctly predicted anchor link must have high ReS. Secondly, we present an iterative algorithm to solve the problem of ALP efficiently. Also, we discuss the termination of the algorithm to give a tradeoff between precision and recall. Finally, we conduct a series of experiments on both synthetic social networks and real social networks to confirm the effectiveness of our approach.
- Published
- 2016
240. SKM: A Schema Matching Model Based on Schema Structure and Known Matching Knowledge
- Author
-
Ge Yu, Yue Kou, Xu Zhang, Tie-Zheng Nie, En-Yun Yu, and Derong Shen
- Subjects
Schema (genetic algorithms) ,Theoretical computer science ,Computer science ,Star schema ,Semi-structured model ,Schema matching ,Software - Published
- 2009
241. KDS-CM: A cache mechanism based on top-K data source for Deep Web query
- Author
-
Ge Yu, Yue Kou, Derong Shen, Tiezheng Nie, and Dong Li
- Subjects
Deep Web ,Data source ,Smart Cache ,Multidisciplinary ,Database ,Cache coloring ,Computer science ,Cache invalidation ,Page cache ,Cache ,computer.software_genre ,computer ,Cache algorithms - Abstract
Caching is an important technique to enhance the efficiency of query processing. Unfortunately, traditional caching mechanisms are not efficient for deep Web because of storage space and dynamic maintenance limitations. In this paper, we present on providing a cache mechanism based on Top-K data source (KDS-CM) instead of result records for deep Web query. By integrating techniques from IR and Top-K, a data reorganization strategy is presented to model KDS-CM. Also some measures about cache management and optimization are proposed to improve the performances of cache effectively. Experimental results show the benefits of KDS-CM in execution cost and dynamic maintenance when compared with various alternate strategies.
- Published
- 2007
242. An efficient multi-keyword query processing strategy on P2P based Web search
- Author
-
Hongkai Zhu, Meifang Li, Ge Yu, and Derong Shen
- Subjects
Multidisciplinary ,Web search query ,Information retrieval ,Computer science ,Overlay network ,Bloom filter ,Query optimization ,computer.software_genre ,Query expansion ,Web query classification ,Sargable ,Data mining ,Precision and recall ,computer - Abstract
The paper presents a novel benefit based query processing strategy for efficient query routing. Based on DHT as the overlay network, it first applies Nash equilibrium to construct the optimal peer group based on the correlations of keywords and coverage and overlap of the peers to decrease the time cost, and then presents a two-layered architecture for query processing that utilizes Bloom filter as compact representation to reduce the bandwidth consumption. Extensive experiments conducted on a real world dataset have demonstrated that our approach obviously decreases the processing time, while improves the precision and recall as well.
- Published
- 2007
243. A Method to Discover Truth with Two Source Quality Metrics
- Author
-
Tiezheng Nie, Dong Yu, Derong Shen, Yue Kou, Mingdong Zhu, and Ge Yu
- Subjects
Measure (data warehouse) ,business.industry ,Computer science ,media_common.quotation_subject ,Probabilistic logic ,Object (computer science) ,computer.software_genre ,Machine learning ,Knowledge-based systems ,Knowledge base ,Quality (business) ,False positive rate ,Data mining ,Artificial intelligence ,business ,computer ,media_common ,Data integration - Abstract
In many web integration applications, there are usually some sources that depict the same entity object with different descriptions, which leads to lots of conflicts. Resolving conflicts and finding the truth can be used to improve the quality of integration or to build a high-quality knowledge base, etc. In the single-truth data conflicting scenario, existing methods have limitations to distinguish false negative, also named as data missing, and false positive. So their source quality measurements are inadequate. Therefore, in this paper, we use recall and false positive rate to measure source quality and present a method to discover truth. The experimental results on three real-word data sets show that the proposed algorithm can effectively distinguish the data missing and false positive and improve the precision of truth discovery.
- Published
- 2015
244. An Unsupervised Approach for Constructing Word Similarity Network
- Author
-
Derong Shen, Yu Hu, Yue Kou, and Tiezheng Nie
- Subjects
Vocabulary ,Information retrieval ,business.industry ,Latent semantic analysis ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Semantics ,Semantic similarity ,Explicit semantic analysis ,Semantic computing ,Semantic integration ,Artificial intelligence ,business ,computer ,Semantic compression ,Natural language processing ,media_common - Abstract
To evaluate how much a pair of entities or documents are similar is a common problem for current applications. Most approaches for this problem are based on the co-occurrence. However, different terms or words may represent the same entity or similar semantic in the real world since a concept often has more than one way of expression. Existing works always focus on computing semantic relatedness of words. But relatedness cannot reflect the similarity most of the time, on the other hand, most of their corpus are from common data sources such as Wikipedia and are not useful for the specialized vocabulary. In this paper, we propose a novel unsupervised approach for evaluating the semantic similarity between words by mapping texts to vector space and computing prior information. In our approach, we construct a model that can identify the words representing the same entity in special context even though they don't belong to the same concept. At last, we construct a network of words in which paths between words can reflect the evolution process of concepts. Our experimental results show that that our approach gives an effective solution to discover the semantic relationship between words, especially for words in specialty domains.
- Published
- 2015
245. Graph-Based Approach for Cross Domain Text Linking
- Author
-
Yue Kou, Yu Hu, Tiezheng Nie, and Derong Shen
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,Text graph ,Ambiguity ,Document clustering ,computer.software_genre ,Domain (software engineering) ,Focus (linguistics) ,Text mining ,Semantic similarity ,Similarity (psychology) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
Comprehensive analysis of multi-domain texts has generated an important effect on text mining. Although the objects described by these multi-domain texts belong to different fields, they sometimes are overlapped partially; and linking these texts fragments which are overlapped or complementary is a necessary step for many tasks, such as entity resolution, information retrieval and text clustering. Previous works for computing text similarity mainly focus on string-based, corpus-based and knowledge-based approaches. However cross-domain texts exhibit very special features compared to texts in the same domain: (1) entity ambiguity, texts from different domains may contain various references to the same entity; (2) content skewness, cross domain texts are overlapped partially. In this paper, we propose a novel fine-grained approach based on text graph for evaluating the semantic similarity of cross-domain texts to link the similar parts. The experiment results show that our approach gives an effective solution to discover the semantic relationship between cross domain text fragments.
- Published
- 2015
246. Hybrid-LSH for Spatio-Textual Similarity Queries
- Author
-
Derong Shen, Mingdong Zhu, Ge Yu, and Ling Liu
- Subjects
Range (mathematics) ,Correctness ,Similarity (network science) ,Salient ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Search engine indexing ,High dimensional ,Data mining ,computer.software_genre ,Data objects ,computer ,Locality-sensitive hashing - Abstract
Locality Sensitive Hashing (LSH) is a popular method for high dimensional indexing and search over large datasets. However, little efforts have put forward to utilizing LSH in mobile applications for processing spatio-textual similarity queries, such as find nearby shopping centers that have a top ranked hair salon. In this paper, we present hybrid-LSH, a new LSH method for indexing data objects according to both their spatial location and their keyword similarity. Our hybrid-LSH approach has two salient features: First our hybrid-LSH carefully combines the spatial location based LSH and textual similarity based LSH to ensure the correctness of the spatial and textual similarity based NN queries. Second, we present an adaptive query-processing model to address the fixed range problem of traditional LSH and to handle queries with varying ranges effectively. Extensive experiments conducted on both synthetic and real datasets validate the efficiency of our hybrid LSH method.
- Published
- 2015
247. An Efficient Approach of Overlapping Communities Search
- Author
-
Tiezheng Nie, Jing Shan, Ge Yu, Derong Shen, and Yue Kou
- Subjects
Evolving networks ,Exact algorithm ,Computer science ,Community search ,Computation ,Distributed computing ,Offline analysis ,Online community ,Multiple input ,Graph - Abstract
A great deal of research has been dedicated to discover overlapping communities, as in most real life networks such as social networks and biology networks, a node often involves in multiple overlapping communities. However, most work has focused on community detection, which takes the whole graph as input and derives all communities at one time. Community detection can only be used in offline analysis of networks and it is quite costly, not flexible and can not support dynamically evolving networks. Online community search which only finds overlapping communities containing given nodes is a flexible and light-weight solution, and also supports dynamic graphs very well. Thus, in this paper, we study an efficient solution for overlapping community search problem. We propose an exact algorithm whose performance is highly improved by considering boundary node limitation and avoiding duplicate computations of multiple input nodes, and we also propose three approximate strategies which trade off the efficiency and quality, and can be adopted in different requirements. Comprehensive experiments are conducted and demonstrate the efficiency and quality of the proposed algorithms.
- Published
- 2015
248. Entity resolution approaches for data quality
- Author
-
Chenchen, Sun and Derong, Shen
- Subjects
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION - Abstract
Entity resolution is a key aspect of data quality, identifying which records correspond to the same real world entity in data sources. Entity resolution is a hot topic in both research communities and industries. We introduce three approaches to solve different aspects of entity resolution. The first approach learns entity resolution classifiers with genetic algorithm and active learning. The second approach proposes a solution for joint entity resolution. The third approach makes match decision for unsupervised entity resolution by graph clustering. All the three approaches are effective in entity resolution tasks.
- Published
- 2015
249. GB-JER: A Graph-Based Model for Joint Entity Resolution
- Author
-
Tiezheng Nie, Derong Shen, Chenchen Sun, Yue Kou, and Ge Yu
- Subjects
Computer science ,Graph based ,Graph (abstract data type) ,Data mining ,computer.software_genre ,computer ,Matched pair ,Similarity propagation - Abstract
To resolve multiple classes of related entity representations jointly promotes accuracy of entity resolution. We propose a graph-based joint entity resolution model: GB-JER, who exploits a dynamic entity representation relationship graph. It contracts the neighborhood of the matched pair, where enrichment of semantics provides new evidences for subsequent entity resolution iteratively. Also GB-JER is an incremental approach. The experimental evaluation shows that GB-JER outperforms existing the state-of-the-art joint entity resolution approach in accuracy.
- Published
- 2015
250. An Effective Schema Mapping Model for Decentralized Network
- Author
-
Zhenhua Wang, Derong Shen, and Ge Yu
- Subjects
Document Structure Description ,Information retrieval ,Schema migration ,Computer science ,Semi-structured model ,Database schema ,InformationSystems_DATABASEMANAGEMENT ,computer.software_genre ,Information schema ,Conceptual schema ,Star schema ,Schema (psychology) ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Data mining ,computer - Abstract
Schema heterogeneity among individual peers is an important issue in PDMS (Peer Data Management System). Schema mapping is a key technology to find the semantic matching relationship between schemas, and it plays an important role in PDMS. In this paper, we address the problem of schema mapping in PDMS, a typical decentralized network. Aiming at the limitations of previous methods, we propose a schema mapping model based on peer interest, schema structure information and query log, and we also considers the uncertainty of schema mapping during query propagation among peers. In our model, the peers with same interest organize a community and the schema mapping is handled in the interest community. We propose a query processing strategy in which the uncertainty of schema mapping in taken into account. In order to supplement the matching and reasoning, we mine the query log to discover the relationship between schema elements and refine the schema mapping further. Experimental results show that our method is feasible and effective for schema mapping in decentralized network.
- Published
- 2015
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.