Back to Search
Start Over
Dublettdetektion och textklassificering på Förenklad Teknisk Engelska
- Publication Year :
- 2019
- Publisher :
- Linköpings universitet, Institutionen för datavetenskap, 2019.
-
Abstract
- This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
Details
- Language :
- English
- Database :
- OpenAIRE
- Accession number :
- edsair.dedup.wf.001..bfdc6ed66c4814a7b53a5243a87b7685