1. The Statistical Analysis in the Problem of the Author Identification of a Natural Language Text
- Author
-
E Tikhomirova
- Subjects
lcsh:Computer engineering. Computer hardware ,Language identification ,Computer science ,business.industry ,Speech recognition ,the definition of the author of the text ,lcsh:TK7885-7895 ,General Medicine ,computer.software_genre ,statistics ,Identification (biology) ,Statistical analysis ,Artificial intelligence ,business ,lcsh:Mechanics of engineering. Applied mechanics ,lcsh:TA349-359 ,computer ,Natural language processing ,Natural language ,natural language - Abstract
The paper analyses the known available method to search for the author of the text in the natural language base of knowledge proposed by O. Khrulev in which the minimum distance between the frequency dictionaries of the presumed authors and the text under analysis is accepted as a criterion for the successful identification of the author. The patterns and drawbacks of the method are revealed.The paper suggested that since the distance value is based on the average values of this lexeme-usage in all papers of the author on the basis of which frequency dictionaries are created, such leaps will show up when the specific value of the lexeme-usage frequency stands in stark difference to the average one.To test this hypothesis, the paper determines a variation coefficient of each lexeme-usage frequency in the texts under analysis.The analysis of frequency dictionaries of Russian canonical writers conducted in the paper has shown that on average about 90% of the authors' frequency dictionaries contain lexemes whose frequencies of usage are inhomogeneous.The author of the paper suggested that the coefficient of variation shows increase in author's word-hoard, i.e. the larger the vocabulary size, the richer the speech, and, therefore, the less frequently the author uses the same lexemes.In the paper there is a hypothesis that it is wrong to reduce the analysed size of the authors' frequency dictionaries only by critical boundaries: it is necessary to analyse lexemes with a variation coefficient over 33%, which illustrate rich word-hoard.The paper also proposes to define only one specific critical boundary of 10 thousand lexemes, since the indefinite boundary of 5 - 10 thousand lexemes offered by O. Khrulev makes it difficult to identify the author of unknown text. In this case, the lexemes with a variation coefficient over 33% of the total vocabulary size of the studied authors beyond the critical boundary are subjected to analysis.To test this hypothesis, a numerical experiment was carried out. The main point of the experiment was to identify the authors of unknown texts based on the authors' frequency dictionaries. At the same time, there were no unknown texts in the data compilation of the frequency dictionaries. The identification was based on the calculation of distance from the unknown text to the authors' frequency dictionaries, i.e. according to O. Khrulev’ s technique. In calculation different critical boundaries were specified.A numerical experiment has shown that the method proposed in the paper increases the successful identification percent for the larger size texts (more than 5,000 word forms) by 12.5%, and for texts of small size (less than 5,000 word forms) by 15.2%.
- Published
- 2017