Back to Search Start Over

Dublettdetektion och textklassificering på Förenklad Teknisk Engelska

Authors :
Lund, Max
Publication Year :
2019
Publisher :
Linköpings universitet, Institutionen för datavetenskap, 2019.

Abstract

This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

Details

Language :
English
Database :
OpenAIRE
Accession number :
edsair.dedup.wf.001..bfdc6ed66c4814a7b53a5243a87b7685