Back to Search Start Over

Detecting “protein words” through unsupervised word segmentation [version 1; referees: 1 approved with reservations, 1 not approved]

Authors :
Wang Liang
Zhao Kaiyong
Author Affiliations :
<relatesTo>1</relatesTo>Sogou Tech, Beijing, China<br /><relatesTo>2</relatesTo>Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong
Source :
F1000Research. 4:1517
Publication Year :
2015
Publisher :
London, UK: F1000 Research Limited, 2015.

Abstract

Unsupervised word segmentation methods were applied to analyze protein sequences. Protein sequences, such as “MTMDKSELVQKA…,” were used as input to these methods. Segmented protein word sequences, such as “MTM DKSE LVQKA,” were then obtained. We compared the protein words derived via unsupervised segmentation and protein secondary structure segmentation. An interesting finding is that unsupervised word segmentation is more efficient than secondary structure segmentation in expressing information. Our experiment also suggests the presence of several “protein ruins” in current non-coding regions.

Details

ISSN :
20461402
Volume :
4
Database :
F1000Research
Journal :
F1000Research
Notes :
[version 1; referees: 1 approved with reservations, 1 not approved]
Publication Type :
Academic Journal
Accession number :
edsfor.10.12688.f1000research.7428.1
Document Type :
method-article
Full Text :
https://doi.org/10.12688/f1000research.7428.1