54 results on '"text clustering"'
Search Results
2. Mini-batch k-Means versus k-Means to Cluster English Tafseer Text: View of Al-Baqarah Chapter
- Abstract
Al-Quran is the primary text of Muslims' religion and practise. Millions of Muslims around the world use al-Quran as their reference guide, and so knowledge can be obtained from it by Muslims and Islamic scholars in general. Al-Quran has been reinterpreted to various languages in the world, for example, English and has been written by several translators. Each translator has ideas, comments and statements to translate the verses from which he has obtained (Tafseer). Therefore, this paper tries to cluster the translation of the Tafseer using text clustering. Text clustering is the text mining method that needs to be clustered in the same section of related documents. The study adapted (mini-batch k-means and k-means) algorithms of clustering techniques to explain and to define the link between keywords known as features or concepts for Al-Baqarah chapter of 286 verses. For this dataset, data preprocessing and extraction of features using TF-IDF (Term Frequency-Inverse Document Frequency), and PCA (Principal Component Analysis) applied. Results show two/three-dimensional clustering plotting assigning seven cluster categories (k=7) for the Tafseer. The implementation time of the mini-batch k-means algorithm (0.05485s) outperforms the time of the k-means algorithm (0.23334s). Finally, the features 'god', 'people', and 'believe' was the most frequent features.
- Published
- 2021
3. Duplicate Detection and Text Classification on Simplified Technical English
- Abstract
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
- Published
- 2019
4. Duplicate Detection and Text Classification on Simplified Technical English
- Abstract
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
- Published
- 2019
5. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
6. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
7. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
8. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
9. Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message
- Abstract
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
- Published
- 2018
10. Detection of Stance-Related Characteristics in Social Media Text
- Abstract
In this paper, we present a study for the identification of stance-related features in text data from social media. Based on our previous work on stance and our findings on stance patterns, we detected stance-related characteristics in a data set from Twitter and Facebook. We extracted various corpus-, quantitative- and computational-based features that proved to be significant for six stance categories (contrariety, hypotheticality, necessity, prediction, source of knowledge, and uncertainty), and we tested them in our data set. The results of a preliminary clustering method are presented and discussed as a starting point for future contributions in the field. The results of our experiments showed a strong correlation between different characteristics and stance constructions, which can lead us to a methodology for automatic stance annotation of these data., StaViCTA
- Published
- 2018
- Full Text
- View/download PDF
11. Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message
- Abstract
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
- Published
- 2018
12. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
13. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
14. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
15. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
16. Multi-document summarization based on document clustering and neural sentence fusion
- Abstract
In this thesis, we have approached a technique for tackling abstractive text summarization tasks with state-of-the-art results. We have proposed a novel method to improve multidocument summarization. The lack of large multi-document human-authored summaries needed to train seq2seq encoder-decoder models and the inaccuracy in representing multiple long documents into a fixed size vector inspired us to design complementary models for two different tasks such as sentence clustering and neural sentence fusion. In this thesis, we minimize the risk of producing incorrect fact by encoding a related set of sentences as an input to the encoder. We applied our complementary models to implement a full abstractive multi-document summarization system which simultaneously considers importance, coverage, and diversity under a desired length limit. We conduct extensive experiments for all the proposed models which bring significant improvements over the state-of-the-art methods across different evaluation metrics.
- Published
- 2018
17. Clustering Data Text Based on Semantic
- Abstract
Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.
- Published
- 2017
18. Clustering Data Text Based on Semantic
- Abstract
Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.
- Published
- 2017
19. Clustering Data Text Based on Semantic
- Abstract
Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.
- Published
- 2017
20. Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing
- Abstract
The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this, Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Me
- Published
- 2016
21. Improving Clustering Methods By Exploiting Richness Of Text Data
- Abstract
Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOE
- Published
- 2016
22. Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing
- Abstract
The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this, Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Me
- Published
- 2016
23. Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing
- Abstract
The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this, Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Me
- Published
- 2016
24. Improving Clustering Methods By Exploiting Richness Of Text Data
- Abstract
Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOE
- Published
- 2016
25. Review of intelligent microblog short text processing
- Published
- 2016
26. Improving Clustering Methods By Exploiting Richness Of Text Data
- Abstract
Clustering is an unsupervised machine learning technique, which involves discovering different clusters (groups) of similar objects in unlabeled data and is generally considered to be a NP hard problem. Clustering methods are widely used in a verity of disciplines for analyzing different types of data, and a small improvement in clustering method can cause a ripple effect in advancing research of multiple fields. Clustering any type of data is challenging and there are many open research questions. The clustering problem is exacerbated in the case of text data because of the additional challenges such as issues in capturing semantics of a document, handling rich features of text data and dealing with the well known problem of the curse of dimensionality. In this thesis, we investigate the limitations of existing text clustering methods and address these limitations by providing five new text clustering methods--Query Sense Clustering (QSC), Dirichlet Weighted K-means (DWKM), Multi-View Multi-Objective Evolutionary Algorithm (MMOEA), Multi-objective Document Clustering (MDC) and Multi-Objective Multi-View Ensemble Clustering (MOMVEC). These five new clustering methods showed that the use of rich features in text clustering methods could outperform the existing state-of-the-art text clustering methods. The first new text clustering method QSC exploits user queries (one of the rich features in text data) to generate better quality clusters and cluster labels. The second text clustering method DWKM uses probability based weighting scheme to formulate a semantically weighted distance measure to improve the clustering results. The third text clustering method MMOEA is based on a multi-objective evolutionary algorithm. MMOEA exploits rich features to generate a diverse set of candidate clustering solutions, and forms a better clustering solution using a cluster-oriented approach. The fourth and the fifth text clustering method MDC and MOMVEC address the limitations of MMOE
- Published
- 2016
27. Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods
- Abstract
This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.
- Published
- 2014
28. Constrained text coclustering with supervised and unsupervised constraints
- Abstract
In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine informationtheoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches.©2013 IEEE.
- Published
- 2013
29. Constrained text coclustering with supervised and unsupervised constraints
- Abstract
In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine informationtheoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches.©2013 IEEE.
- Published
- 2013
30. Automated text-based analysis for decision-making research
- Abstract
We present results from a study on constructing and evaluating a support tool for the extraction of patterns in distributed decision -making processes, based on design criteria elicited from a study on the work process involved in studying such decision-making. Specifically, we devised and evaluated an analysis tool for C2 researchers who study simulated decision-making scenarios for command teams. The analysis tool used text clustering as an underlying pattern extraction technique and was evaluated together with C2 researchers in a workshop to establish whether the design criteria were valid and the approach taken with the analysis tool was sound. Design criteria elicited from an earlier study with researchers (open-endedness and transparency) were highly consistent with the results from the workshop. Specifically, evaluation results indicate that successful deployment of advanced analysis tools requires that tools can treat multiple data sources and offer rich opportunities for manipulation and interaction (open-endedness) and careful design of visual presentations and explanations of the techniques used (transparency). Finally, the results point to the high relevance and promise of using text clustering as a support for analysis of C2 data., The original publication is available at www.springerlink.com: Ola Leifler and Henrik Eriksson, Text-based Analysis for Command and Control Researchers: The Workflow Visualizer Approach, 2011, Cognition, Technology & Work. http://dx.doi.org/10.1007/s10111-010-0170-3 Copyright: Springer Science Business Media http://www.springerlink.com
- Published
- 2012
- Full Text
- View/download PDF
31. Message classification as a basis for studying command and control communication : an evaluation of machine learning approaches
- Abstract
In military command and control, success relies on being able to perform key functions such as communicating intent. Most staff functions are carried out using standard means of text communication. Exactly how members of staff perform their duties, who they communicate with and how, and how they could perform better, is an area of active research. In command and control research, there is not yet a single model which explains all actions undertaken by members of staff well enough to prescribe a set of procedures for how to perform functions in command and control. In this context, we have studied whether automated classification approaches can be applied to textual communication to assist researchers who study command teams and analyze their actions. Specifically, we report the results from evaluating machine leaning with respect to two metrics of classification performance: (1) the precision of finding a known transition between two activities in a work process, and (2) the precision of classifying messages similarly to human researchers that search for critical episodes in a workflow. The results indicate that classification based on text only provides higher precision results with respect to both metrics when compared to other machine learning approaches, and that the precision of classifying messages using text-based classification in already classified datasets was approximately 50%. We present the implications that these results have for the design of support systems based on machine learning, and outline how to practically use text classification for analyzing team communications by demonstrating a specific prototype support tool for workflow analysis., funding agencies|Swedish National Defense College
- Published
- 2012
- Full Text
- View/download PDF
32. Message classification as a basis for studying command and control communication : an evaluation of machine learning approaches
- Abstract
In military command and control, success relies on being able to perform key functions such as communicating intent. Most staff functions are carried out using standard means of text communication. Exactly how members of staff perform their duties, who they communicate with and how, and how they could perform better, is an area of active research. In command and control research, there is not yet a single model which explains all actions undertaken by members of staff well enough to prescribe a set of procedures for how to perform functions in command and control. In this context, we have studied whether automated classification approaches can be applied to textual communication to assist researchers who study command teams and analyze their actions. Specifically, we report the results from evaluating machine leaning with respect to two metrics of classification performance: (1) the precision of finding a known transition between two activities in a work process, and (2) the precision of classifying messages similarly to human researchers that search for critical episodes in a workflow. The results indicate that classification based on text only provides higher precision results with respect to both metrics when compared to other machine learning approaches, and that the precision of classifying messages using text-based classification in already classified datasets was approximately 50%. We present the implications that these results have for the design of support systems based on machine learning, and outline how to practically use text classification for analyzing team communications by demonstrating a specific prototype support tool for workflow analysis., funding agencies|Swedish National Defense College
- Published
- 2012
- Full Text
- View/download PDF
33. Automated text-based analysis for decision-making research
- Abstract
We present results from a study on constructing and evaluating a support tool for the extraction of patterns in distributed decision -making processes, based on design criteria elicited from a study on the work process involved in studying such decision-making. Specifically, we devised and evaluated an analysis tool for C2 researchers who study simulated decision-making scenarios for command teams. The analysis tool used text clustering as an underlying pattern extraction technique and was evaluated together with C2 researchers in a workshop to establish whether the design criteria were valid and the approach taken with the analysis tool was sound. Design criteria elicited from an earlier study with researchers (open-endedness and transparency) were highly consistent with the results from the workshop. Specifically, evaluation results indicate that successful deployment of advanced analysis tools requires that tools can treat multiple data sources and offer rich opportunities for manipulation and interaction (open-endedness) and careful design of visual presentations and explanations of the techniques used (transparency). Finally, the results point to the high relevance and promise of using text clustering as a support for analysis of C2 data., The original publication is available at www.springerlink.com: Ola Leifler and Henrik Eriksson, Text-based Analysis for Command and Control Researchers: The Workflow Visualizer Approach, 2011, Cognition, Technology & Work. http://dx.doi.org/10.1007/s10111-010-0170-3 Copyright: Springer Science Business Media http://www.springerlink.com
- Published
- 2012
- Full Text
- View/download PDF
34. Automated text-based analysis for decision-making research
- Abstract
We present results from a study on constructing and evaluating a support tool for the extraction of patterns in distributed decision -making processes, based on design criteria elicited from a study on the work process involved in studying such decision-making. Specifically, we devised and evaluated an analysis tool for C2 researchers who study simulated decision-making scenarios for command teams. The analysis tool used text clustering as an underlying pattern extraction technique and was evaluated together with C2 researchers in a workshop to establish whether the design criteria were valid and the approach taken with the analysis tool was sound. Design criteria elicited from an earlier study with researchers (open-endedness and transparency) were highly consistent with the results from the workshop. Specifically, evaluation results indicate that successful deployment of advanced analysis tools requires that tools can treat multiple data sources and offer rich opportunities for manipulation and interaction (open-endedness) and careful design of visual presentations and explanations of the techniques used (transparency). Finally, the results point to the high relevance and promise of using text clustering as a support for analysis of C2 data., The original publication is available at www.springerlink.com: Ola Leifler and Henrik Eriksson, Text-based Analysis for Command and Control Researchers: The Workflow Visualizer Approach, 2011, Cognition, Technology & Work. http://dx.doi.org/10.1007/s10111-010-0170-3 Copyright: Springer Science Business Media http://www.springerlink.com
- Published
- 2012
- Full Text
- View/download PDF
35. Message classification as a basis for studying command and control communication : an evaluation of machine learning approaches
- Abstract
In military command and control, success relies on being able to perform key functions such as communicating intent. Most staff functions are carried out using standard means of text communication. Exactly how members of staff perform their duties, who they communicate with and how, and how they could perform better, is an area of active research. In command and control research, there is not yet a single model which explains all actions undertaken by members of staff well enough to prescribe a set of procedures for how to perform functions in command and control. In this context, we have studied whether automated classification approaches can be applied to textual communication to assist researchers who study command teams and analyze their actions. Specifically, we report the results from evaluating machine leaning with respect to two metrics of classification performance: (1) the precision of finding a known transition between two activities in a work process, and (2) the precision of classifying messages similarly to human researchers that search for critical episodes in a workflow. The results indicate that classification based on text only provides higher precision results with respect to both metrics when compared to other machine learning approaches, and that the precision of classifying messages using text-based classification in already classified datasets was approximately 50%. We present the implications that these results have for the design of support systems based on machine learning, and outline how to practically use text classification for analyzing team communications by demonstrating a specific prototype support tool for workflow analysis., funding agencies|Swedish National Defense College
- Published
- 2012
- Full Text
- View/download PDF
36. Message classification as a basis for studying command and control communication : an evaluation of machine learning approaches
- Abstract
In military command and control, success relies on being able to perform key functions such as communicating intent. Most staff functions are carried out using standard means of text communication. Exactly how members of staff perform their duties, who they communicate with and how, and how they could perform better, is an area of active research. In command and control research, there is not yet a single model which explains all actions undertaken by members of staff well enough to prescribe a set of procedures for how to perform functions in command and control. In this context, we have studied whether automated classification approaches can be applied to textual communication to assist researchers who study command teams and analyze their actions. Specifically, we report the results from evaluating machine leaning with respect to two metrics of classification performance: (1) the precision of finding a known transition between two activities in a work process, and (2) the precision of classifying messages similarly to human researchers that search for critical episodes in a workflow. The results indicate that classification based on text only provides higher precision results with respect to both metrics when compared to other machine learning approaches, and that the precision of classifying messages using text-based classification in already classified datasets was approximately 50%. We present the implications that these results have for the design of support systems based on machine learning, and outline how to practically use text classification for analyzing team communications by demonstrating a specific prototype support tool for workflow analysis., funding agencies|Swedish National Defense College
- Published
- 2012
- Full Text
- View/download PDF
37. Automated text-based analysis for decision-making research
- Abstract
We present results from a study on constructing and evaluating a support tool for the extraction of patterns in distributed decision -making processes, based on design criteria elicited from a study on the work process involved in studying such decision-making. Specifically, we devised and evaluated an analysis tool for C2 researchers who study simulated decision-making scenarios for command teams. The analysis tool used text clustering as an underlying pattern extraction technique and was evaluated together with C2 researchers in a workshop to establish whether the design criteria were valid and the approach taken with the analysis tool was sound. Design criteria elicited from an earlier study with researchers (open-endedness and transparency) were highly consistent with the results from the workshop. Specifically, evaluation results indicate that successful deployment of advanced analysis tools requires that tools can treat multiple data sources and offer rich opportunities for manipulation and interaction (open-endedness) and careful design of visual presentations and explanations of the techniques used (transparency). Finally, the results point to the high relevance and promise of using text clustering as a support for analysis of C2 data., The original publication is available at www.springerlink.com: Ola Leifler and Henrik Eriksson, Text-based Analysis for Command and Control Researchers: The Workflow Visualizer Approach, 2011, Cognition, Technology & Work. http://dx.doi.org/10.1007/s10111-010-0170-3 Copyright: Springer Science Business Media http://www.springerlink.com
- Published
- 2012
- Full Text
- View/download PDF
38. Text Clustering Using LucidWorks and Apache Mahout
- Abstract
This module introduces algorithms and evaluation metrics for flat clustering. We focus on the usage of LucidWorks big data analysis software and Apache Mahout, an open source machine learning library in clustering of document collections with the k-means algorithm.
- Published
- 2012
39. Message classification as a basis for studying command and control communication : an evaluation of machine learning approaches
- Abstract
In military command and control, success relies on being able to perform key functions such as communicating intent. Most staff functions are carried out using standard means of text communication. Exactly how members of staff perform their duties, who they communicate with and how, and how they could perform better, is an area of active research. In command and control research, there is not yet a single model which explains all actions undertaken by members of staff well enough to prescribe a set of procedures for how to perform functions in command and control. In this context, we have studied whether automated classification approaches can be applied to textual communication to assist researchers who study command teams and analyze their actions. Specifically, we report the results from evaluating machine leaning with respect to two metrics of classification performance: (1) the precision of finding a known transition between two activities in a work process, and (2) the precision of classifying messages similarly to human researchers that search for critical episodes in a workflow. The results indicate that classification based on text only provides higher precision results with respect to both metrics when compared to other machine learning approaches, and that the precision of classifying messages using text-based classification in already classified datasets was approximately 50%. We present the implications that these results have for the design of support systems based on machine learning, and outline how to practically use text classification for analyzing team communications by demonstrating a specific prototype support tool for workflow analysis., funding agencies|Swedish National Defense College
- Published
- 2012
- Full Text
- View/download PDF
40. Automated text-based analysis for decision-making research
- Abstract
We present results from a study on constructing and evaluating a support tool for the extraction of patterns in distributed decision -making processes, based on design criteria elicited from a study on the work process involved in studying such decision-making. Specifically, we devised and evaluated an analysis tool for C2 researchers who study simulated decision-making scenarios for command teams. The analysis tool used text clustering as an underlying pattern extraction technique and was evaluated together with C2 researchers in a workshop to establish whether the design criteria were valid and the approach taken with the analysis tool was sound. Design criteria elicited from an earlier study with researchers (open-endedness and transparency) were highly consistent with the results from the workshop. Specifically, evaluation results indicate that successful deployment of advanced analysis tools requires that tools can treat multiple data sources and offer rich opportunities for manipulation and interaction (open-endedness) and careful design of visual presentations and explanations of the techniques used (transparency). Finally, the results point to the high relevance and promise of using text clustering as a support for analysis of C2 data., The original publication is available at www.springerlink.com: Ola Leifler and Henrik Eriksson, Text-based Analysis for Command and Control Researchers: The Workflow Visualizer Approach, 2011, Cognition, Technology & Work. http://dx.doi.org/10.1007/s10111-010-0170-3 Copyright: Springer Science Business Media http://www.springerlink.com
- Published
- 2012
- Full Text
- View/download PDF
41. Improving the relevance of web search results by combining web snippet categorization, clustering and personalization
- Abstract
Web search results are far from perfect due to the polysemous and synonymous characteristics of nature languages, information overload as the results of information explosion on the Web, and the flat list, “one size fits all” strategies of search engines to present search results without concentrating on user personal information needs.Re-organizing Web search results, or Web snippets, by means of text categorization and clustering are two dominant approaches to attack the issues above. Text categorization uses a collection of labeled documents to train a classifier which can then predict labels for new unlabeled documents; while text clustering groups unlabeled documents by finding common properties shared among the documents in the same group. The issue related to categorization is human labeled training documents are very expensive to obtain and thus surprisingly scarce at the moment; while how to label the generated groups is still an open research question for text clustering. In addition, a Web snippet, returned from search engines, contains only the title of a webpage and an optional very short (less than 30 words) description of the page. The less-informative aspect of Web snippets is another challenge for both text categorization and clustering.The primary objective of this research is to improve the relevance of Web search results and thus provide the user with a better search experience. To achieve this objective, the research combines Web snippet categorization, clustering and personalization techniques to recommend relevant results to search users. Using design research methodology, the study develops an IT artifact named RIB – Recommender Intelligent Browser. RIB categorizes Web snippets using a socially constructed Web directory such as the Open Directory Project (ODP) for which the semantic characteristics of the categories in ODP are extracted to generate a series of labeled document sets. At the same time, the Web snippets are clustered to boost th
- Published
- 2010
42. CLUTO Toolkit
- Abstract
The module briefly introduces the basic concepts of Clustering. The primary focus of the module is to describe the usage of CLUTO, a clustering Toolkit, comprised of various algorithms.
- Published
- 2010
43. Improving scalability and accuracy of text mining in grid environment
- Abstract
The advance in technologies such as massive storage devices and high speed internet has led to an enormous increase in the volume of available documents in electronic form. These documents represent information in a complex and rich manner that cannot be analysed using conventional statistical data mining methods. Consequently, text mining is developed as a growing new technology for discovering knowledge from textual data and managing textual information. Processing and analysing textual information can potentially obtain valuable and important information, yet these tasks also requires enormous amount of computational resources due to the sheer size of the data available. Therefore, it is important to enhance the existing methodologies to achieve better scalability, efficiency and accuracy. The emerging Grid technology shows promising results in solving the problem of scalability by splitting the works from text clustering algorithms into a number of jobs, each to be executed separately and simultaneously on different computing resources. That allows for a substantial decrease in the processing time and maintaining the similar level of quality at the same time. To improve the quality of the text clustering results, a new document encoding method is introduced that takes into consideration of the semantic similarities of the words. In this way, documents that are similar in content will be more likely to be group together. One of the ultimate goals of text mining is to help us to gain insights to the problem and to assist in the decision making process together with other source of information. Hence we tested the effectiveness of incorporating text mining method in the context of stock market prediction. This is achieved by integrating the outcomes obtained from text mining with the ones from data mining, which results in a more accurate forecast than using any single method.
- Published
- 2009
44. Enhancing text clustering by leveraging wikipedia semantics
- Abstract
Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved. Copyright 2008 ACM.
- Published
- 2008
45. Enhancing text clustering by leveraging wikipedia semantics
- Abstract
Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved. Copyright 2008 ACM.
- Published
- 2008
46. Managing email overload with an automatic nonparametric clustering approach
- Abstract
Email overload is a recent problem that there is increasingly difficulty people have faced to process the large number of emails received daily. Currently this problem becomes more and more serious and it has already affected the normal usage of email as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive load to process emails and thus this is an effective way to manage email overload problem. However, most current approaches still require significant human input when categorizing emails. In this paper we develop an automatic email clustering system, underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters and can automatically generate meaningful email clusters. Experiments show our new algorithm outperforms existing text clustering algorithms with higher efficiency in terms of computational time and clustering quality measured by different gauges.
- Published
- 2007
47. トレランス・ラフ集合モデルに基く階層型文書クラスタリングアルゴリズムの提案
- Abstract
Supervisor:Ho Tu Bao, 知識科学研究科, 修士
- Published
- 2000
48. Similarity Search in Document Collections
- Abstract
Hlavním cílem této práce je odhadnout výkonnost volně šířeni balík Sémantický Vektory a třída MoreLikeThis z balíku Apache Lucene. Tato práce nabízí porovnání těchto dvou přístupů a zavádí metody, které mohou vést ke zlepšení kvality vyhledávání., The main objective of this work is to estimate the efficiency of the available software for similarity search in document collections and on two in particular, Semantic Vectors and Lecene's class MoreLikeThis. The paper provides a comparison of those two approaches and introduces methods that can lead to improving the quality of the results generated by a search.
49. Similarity Search in Document Collections
- Abstract
Hlavním cílem této práce je odhadnout výkonnost volně šířeni balík Sémantický Vektory a třída MoreLikeThis z balíku Apache Lucene. Tato práce nabízí porovnání těchto dvou přístupů a zavádí metody, které mohou vést ke zlepšení kvality vyhledávání., The main objective of this work is to estimate the efficiency of the available software for similarity search in document collections and on two in particular, Semantic Vectors and Lecene's class MoreLikeThis. The paper provides a comparison of those two approaches and introduces methods that can lead to improving the quality of the results generated by a search.
50. Similarity Search in Document Collections
- Abstract
Hlavním cílem této práce je odhadnout výkonnost volně šířeni balík Sémantický Vektory a třída MoreLikeThis z balíku Apache Lucene. Tato práce nabízí porovnání těchto dvou přístupů a zavádí metody, které mohou vést ke zlepšení kvality vyhledávání., The main objective of this work is to estimate the efficiency of the available software for similarity search in document collections and on two in particular, Semantic Vectors and Lecene's class MoreLikeThis. The paper provides a comparison of those two approaches and introduces methods that can lead to improving the quality of the results generated by a search.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.