5,897 results on '"authorship attribution"'
Search Results
2. Give Credit Where It’s Due: The Ethical Imperatives of Authorship Attribution in Collaborative Research.
- Author
-
Sweeting, Karen D., Diaz-Kope, Luisa M., Henley, Tiffany J., and Bharath, Del
- Subjects
- *
ATTRIBUTION of authorship , *BUSINESS partnerships , *AUTHORSHIP collaboration , *SCHOLARLY publishing , *SCHOOL credits - Abstract
Abstract\nPLAIN LANGUAGE SUMMARYThis article proposes practical steps for those new to collaborative research or those seeking to improve the nature of collaborative research partnerships. Questioning academia’s exaltation of publications and authorship as coveted assets, the authors critically analyze the often complex, contentious, and ambiguous environment of collaborative research and negotiating authorship. With respect to academic publishing and ethics of authorship, this article grapples with two critical questions: (1) What factors create ethical dilemmas in authorship in academia? (2) How can academics and researchers better navigate the complexities of collaborative research to sustain ethical authorship practices? A seven-phase process model is proposed to guide the collaborative research process to mitigate ambiguities that reside in the black box. The seven phases encompass problem settings and pre-negotiations, direction setting: action planning, social capital, re-negotiations, equity and inclusivity, implementation- the submission-revision cycle, and reflexivity. This theoretical analysis aims to offer a useful resource to promote enhanced collaborative practices, and the authors take an explicit advocacy and visionary perspective to promote transparency, accountability, and reciprocal trust, to shatter overarching issues such as oppressive, deceitful, and exploitative norms in academic authorship practices.Our main aim in this article is to translate the collaborative process into practical steps, whether this is a new experience or you are seeking to improve the nature of collaborative research partnerships. It combines research on authorship ethics and academic collaboration to examine how authors are credited in academic work. It questions the emphasis on publishing and authorship while also analyzing the complexities and challenges of collaborative research. The authors offer guidance for improving collaboration and ensuring fair recognition of author contributions. The authors present a seven-step model to guide more equitable authorship practices, emphasizing the importance of transparency, accountability, and trust in academic collaborations. This approach aims to challenge and rectify unfair practices in authorship attribution. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Authorship Attribution for English Short Texts.
- Author
-
Alsanoosy, Tawfeeq, Shalbi, Bodor, and Noor, Ayman
- Subjects
DEEP learning ,RECURRENT neural networks ,CONVOLUTIONAL neural networks ,ATTRIBUTION of authorship ,COPYRIGHT infringement - Abstract
Internet and social media explosive growth has led to the rapid and widespread dissemination of information, which often takes place anonymously. This anonymity has fostered the rise of uncredited copying, posing a significant threat of copyright infringement and raising serious concerns in fields where verifying information's authenticity is paramount. Authorship Attribution (AA), a critical classification task within Natural Language Processing (NLP), aims to mitigate these concerns by identifying the original source of content. Although extensive research exists for longer texts, AA for short texts, namely informal texts like tweets, remains challenging due to the latter's brevity and stylistic variation. Thus, this study aims to investigate and measure the performance of various Machine Learning (ML) and Deep Learning (DL) methods deployed for feature extraction from short text data, using tweets. The employed feature extraction methods were: Bag-of-Words (BoW), TF-IDF, n-grams, word-level, and character-level features. These methods were evaluated in conjunction with six ML classifiers, i.e. Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Random Forest (RF) along with two DL architectures, i.e. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The highest accuracy achieved with an ML model was 92.34%, using an SVM with TF-IDF features. Even though the basic CNN DL model reached 88% accuracy, this outcome still surpassed the previously established baseline for this task. The findings of this research not only advance the technical capabilities of AA, but also extend its practical applications, providing tools that can be adapted across various domains to ensure proper attribution and expose copyright infringement. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Cross-linguistic authorship attribution and gender profiling. Machine translation as a method for bridging the language gap.
- Author
-
Mikros, George and Boumparis, Dimitris
- Subjects
- *
MACHINE translating , *RANDOM forest algorithms , *ATTRIBUTION of authorship , *LINGUISTICS , *AUTHORSHIP - Abstract
This study explores the feasibility of cross-linguistic authorship attribution and the author's gender identification using Machine Translation (MT). Computational stylistics experiments were conducted on a Greek blog corpus translated into English using Google's Neural MT. A Random Forest algorithm was employed for authorship and gender profiling, using different feature groups [Author's Multilevel N-gram Profiles, quantitative linguistics (QL), and cross-lingual word embeddings (CLWE)] in both original and translated texts. Results indicate that MT is a viable method for converting a multilingual corpus into one language for authorship attribution and gender profiling research, with considerable accuracy when training and testing datasets use identical language. In the pure cross-linguistic scenario, higher accuracies than the baselines were obtained using CLWE and QL features. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Testing high-dimensional multinomials with applications to text analysis.
- Author
-
Cai, T Tony, Ke, Zheng T, and Turner, Paxton
- Subjects
DISTRIBUTION (Probability theory) ,CENTRAL limit theorem ,ATTRIBUTION of authorship ,FILM reviewing ,GAUSSIAN distribution - Abstract
Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Code stylometry vs formatting and minification.
- Author
-
Balla, Stefano, Gabbrielli, Maurizio, and Zacchiroli, Stefano
- Subjects
ATTRIBUTION of authorship ,STYLOMETRY ,AUTOMATIC identification ,COMPUTER software development ,JOB offers - Abstract
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Identifying Authorship in Malicious Binaries: Features, Challenges & Datasets.
- Author
-
Gray, Jason, Sgandurra, Daniele, Cavallaro, Lorenzo, and Blasco Alis, Jorge
- Published
- 2024
- Full Text
- View/download PDF
8. Observations of data characteristics and irregularities through domain-oriented transformations of attributes.
- Author
-
Stańczyk, Urszula and Baron, Grzegorz
- Abstract
The paper presents research dedicated to observations of relations between attribute properties and discretisation. In the investigations described, the gradually increasing sets of features were discretised by selected approaches, and several variants of data were constructed. The continuous, partially discrete, and completely translated datasets were explored by the chosen classifiers and their performance studied in the context of a number of discretised attributes, discretisation procedures, and the way of processing of features and datasets. The stylometric problem of authorship attribution was the machine learning task under study. The experimental results enable to observe closer the specificity of style-markers employed as characteristic features, and indicate conditions for efficient recognition of authorship. They can be extended to other application domains with similar characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Weighting Attributes Based on the Greedy Algorithm Properties.
- Author
-
Zielosko, Beata, Stańczyk, Urszula, and Jabloński, Kamil
- Abstract
Estimation of importance for considered features is an important issue for any knowledge exploration process and it can be executed by a variety of approaches. In the research reported in this study, the primary aim was the development of a methodology for creating attribute rankings. Based on the properties of the greedy algorithm for inducing decision rules, a new application of this algorithm has been proposed. Instead of constructing a single ordering of features, attributes were weighted multiple times. The input datasets were discretised with several algorithms representing supervised and unsupervised discretisation approaches. Each resulting discrete data variant was exploited to construct a ranking of attributes. The effectiveness of the obtained rankings was confirmed through a rule filtering process governed by weighted attributes. The methodology was applied to the stylometric task of authorship attribution. The experimental outcomes demonstrate the value of the proposed research method, as it generally led to improved predictions while taking into account a noticeably decreased sets of attributes and decision rules. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Domain-specific data characteristics: A study on meaning of stylometric sub-concepts and in-class imbalance.
- Author
-
Stańczyk, Urszula
- Abstract
In the context of data imbalance probably the most investigated problem is imbalance of classes, as learning from the data with this characteristic makes detection of existing patterns for all classes more difficult. However, other problems related to imbalance also exists and the paper addresses such cases where classes are balanced, but there is in-class imbalance. Such imbalance can be caused by uneven representation of sub-concepts. When there is a noticeable difference between the numbers of samples belonging to sub-concepts, this can turn the under-represented sub-concepts into disjuncts. Data irregularities of this type can hinder recognition, therefore actions are typically taken to restore balance. In the investigations described, the issue was studied in the stylometric domain and various classifiers were applied to the data that was balanced, then imbalanced, and finally with restored balance. The experiments show that the specifics of the domain of application can put its own mark on the data which is difficult to overcome by standard processing such as under- or oversampling. Observed dependence on a learner and dataset makes the issue even more complex and layered, and shows the need for deeper studies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. A comparison of latent semantic analysis and correspondence analysis of document-term matrices.
- Author
-
Qi, Qianqian, Hessen, David J., Deoskar, Tejaswini, and van der Heijden, Peter G. M.
- Subjects
SINGULAR value decomposition ,TEXT mining ,ATTRIBUTION of authorship ,INFORMATION retrieval ,NATIONAL songs ,LATENT semantic analysis - Abstract
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain Adaptation.
- Author
-
Li, Zhen, Zhao, Shasha, Chen, Chen, and Chen, Qian
- Subjects
ATTRIBUTION of authorship ,DEEP learning ,LEARNING ability ,PHYSIOLOGICAL adaptation - Abstract
Source code authorship attribution is an important problem in practical applications such as plagiarism detection, software forensics, and copyright disputes. Recent studies show that existing methods for source code authorship attribution can be significantly affected by time evolution, leading to a decrease in attribution accuracy year by year. To alleviate the problem of Deep Learning (DL)-based source code authorship attribution degrading in accuracy due to time evolution, we propose a new framework called TimeDomain Adaptation (TimeDA) by adding new feature extractors to the original DL-based code attribution framework that enhances the learning ability of the original model on source domain features without requiring new or more source data. Moreover, we employ a centroid-based pseudo-labeling strategy using neighborhood clustering entropy for adaptive learning to improve the robustness of DL-based code authorship attribution. Experimental results show that TimeDA can significantly enhance the robustness of DL-based source code authorship attribution to time evolution, with an average improvement of 8.7% on the Java dataset and 5.2% on the C++ dataset. In addition, our TimeDA benefits from employing the centroid-based pseudo-labeling strategy, which significantly reduced the model training time by 87.3% compared to traditional unsupervised domain adaptive methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Review of Various Approaches for Authorship Identification in Digital Forensics
- Author
-
Sanjesh, Riya, Mangai, J. Alamelu, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Fortino, Giancarlo, editor, Kumar, Akshi, editor, Swaroop, Abhishek, editor, and Shukla, Pancham, editor
- Published
- 2024
- Full Text
- View/download PDF
14. Authorship Attribution for Assamese Language Documents: Initial Results
- Author
-
Medhi, Smriti Priya, Sarma, Shikhar Kumar, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Das, Prodipto, editor, Begum, Shahin Ara, editor, and Buyya, Rajkumar, editor
- Published
- 2024
- Full Text
- View/download PDF
15. Word Trends in Digital Communication of High School Students in the New Capital of Indonesia (IKN) : A Corpus Linguistic Study
- Author
-
Puspitasari, Devi Ambarwati, Karlina, Yenny, Hernina, Hernina, Kurniawan, Kurniawan, Mulyo, Budi Mukhamad, Striełkowski, Wadim, Editor-in-Chief, Black, Jessica M., Series Editor, Butterfield, Stephen A., Series Editor, Chang, Chi-Cheng, Series Editor, Cheng, Jiuqing, Series Editor, Dumanig, Francisco Perlas, Series Editor, Al-Mabuk, Radhi, Series Editor, Scheper-Hughes, Nancy, Series Editor, Urban, Mathias, Series Editor, Webb, Stephen, Series Editor, Haristiani, Nuria, editor, Yulianeta, Yulianeta, editor, Wirza, Yanty, editor, Gunawan, Wawan, editor, Danuwijaya, Ari Arifin, editor, Kurniawan, Eri, editor, Suharno, Suharno, editor, Nafisah, Nia, editor, and Imperiani, Ernie Diyahkusumaning Ayu, editor
- Published
- 2024
- Full Text
- View/download PDF
16. Forensic Assignment Stylometry
- Author
-
Crockett, Robin, Khan, Zeenath Reza, Section editor, and Eaton, Sarah Elaine, editor
- Published
- 2024
- Full Text
- View/download PDF
17. An Interpretable Authorship Attribution Algorithm Based on Distance-Related Characterizations of Tokens
- Author
-
Lomas, Victor, Reyes, Michelle, Neme, Antonio, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Calvo, Hiram, editor, Martínez-Villaseñor, Lourdes, editor, and Ponce, Hiram, editor
- Published
- 2024
- Full Text
- View/download PDF
18. A Comparative Study of Machine Learning Methods and Text Features for Text Authorship Recognition in the Example of Azerbaijani Language Texts.
- Author
-
Azimov, Rustam and Providas, Efthimios
- Subjects
- *
ARTIFICIAL neural networks , *TEXT recognition , *CONVOLUTIONAL neural networks , *MACHINE learning , *SUPPORT vector machines , *ELECTRONIC publications - Abstract
This paper presents various machine learning methods with different text features that are explored and evaluated to determine the authorship of the texts in the example of the Azerbaijani language. We consider techniques like artificial neural network, convolutional neural network, random forest, and support vector machine. These techniques are used with different text features like word length, sentence length, combined word length and sentence length, n-grams, and word frequencies. The models were trained and tested on the works of many famous Azerbaijani writers. The results of computer experiments obtained by utilizing a comparison of various techniques and text features were analyzed. The cases where the usage of text features allowed better results were determined. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Beyond content: discriminatory power of function words in text type classification.
- Author
-
Venglařová, Klára and Matlach, Vladimír
- Subjects
- *
ATTRIBUTION of authorship - Abstract
Our work aims to evaluate the strength of the association between function words and several text types: novels, poems, academic articles, reviews, and blog posts, and the accuracy of their classification to these categories, through machine-learning and statistical methods. The principal conclusion is that the types of texts are distinguishable based only on the function words, either by vocabulary or vocabulary diversity. Such findings may impact the techniques of authorship attribution based on function words and text clustering techniques since some function words add information about the text types/genres, in addition to content words. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Stylometric analysis of French plays of the 17th century.
- Author
-
Savoy, Jacques
- Subjects
- *
SEVENTEENTH century , *ATTRIBUTION of authorship , *FRENCH fiction , *DIGITAL humanities - Abstract
The automatic assignment of a text to one or more predefined categories presents multiple applications. In this context, the current study focuses on author attribution in which the true author of a doubtful text must be identified. This analysis focuses on the style of sixty-six French comedies in verse written by seventeen supposed authors during the 17th century. The hypothesis we want to verify assumes that the real author is the name appearing on the cover (called the signature hypothesis). In order to validate the reliability of two attribution procedures, we used two additional corpora based on 200 extracts of novels written in French, with thirty authors and 140 Italian novels authored by forty persons. After this verification, we propose an improvement of the Delta method as well as a new analysis grid for this model. Finally, we applied these approaches to our French comedy corpus. The results demonstrate that the signature hypothesis must be discarded. Moreover, these works present similar styles, making any attribution difficult to support with a high degree of certainty. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Morphosyntactic Annotation in Literary Stylometry.
- Author
-
Gorman, Robert
- Subjects
- *
STYLOMETRY , *LITERARY style , *ANNOTATIONS , *ATTRIBUTION of authorship - Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an "authorial" signal as opposed to the clear distinction among individual works. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Unsigned play by Milan Kundera? An authorship attribution study.
- Author
-
Jungmannová, Lenka and Plecháč, Petr
- Subjects
- *
ATTRIBUTION of authorship , *SUPERVISED learning - Abstract
In addition to being a widely recognized novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitelé klíčů 1961), The Blunder (Ptákovina 1967), and Jacques and his Master (Jakub a jeho pán 1971). In recent years, however, the hypothesis has been raised that Kundera was the true author of a fourth play, Juro Jánošík , first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera's student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro Jánošík , with results strongly supporting the hypothesis of Kundera's authorship. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian.
- Author
-
Nitu, Melania and Dascalu, Mihai
- Subjects
ATTRIBUTION of authorship ,TRANSFORMER models ,LANGUAGE models ,NATURAL language processing ,LINGUISTIC analysis - Abstract
Authorship attribution for less-resourced languages like Romanian, characterized by the scarcity of large, annotated datasets and the limited number of available NLP tools, poses unique challenges. This study focuses on a hybrid Transformer combining handcrafted linguistic features, ranging from surface indices like word frequencies to syntax, semantics, and discourse markers, with contextualized embeddings from a Romanian BERT encoder. The methodology involves extracting contextualized representations from a pre-trained Romanian BERT model and concatenating them with linguistic features, selected using the Kruskal–Wallis mean rank, to create a hybrid input vector for a classification layer. We compare this approach with a baseline ensemble of seven machine learning classifiers for authorship attribution employing majority soft voting. We conduct studies on both long texts (full texts) and short texts (paragraphs), with 19 authors and a subset of 10. Our hybrid Transformer outperforms existing methods, achieving an F1 score of 0.87 on the full dataset of the 19-author set (an 11% enhancement) and an F1 score of 0.95 on the 10-author subset (an increase of 10% over previous research studies). We conduct linguistic analysis leveraging textual complexity indices and employ McNemar and Cochran's Q statistical tests to evaluate the performance evolution across the best three models, while highlighting patterns in misclassifications. Our research contributes to diversifying methodologies for effective authorship attribution in resource-constrained linguistic environments. Furthermore, we publicly release the full dataset and the codebase associated with this study to encourage further exploration and development in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Code stylometry vs formatting and minification
- Author
-
Stefano Balla, Maurizio Gabbrielli, and Stefano Zacchiroli
- Subjects
Authorship attribution ,Code stylometry ,Code formatting ,Minification ,Source code ,Syntax tree ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
- Published
- 2024
- Full Text
- View/download PDF
25. Towards Performance Improvement of Authorship Attribution
- Author
-
Amar Suljic and Md Shafaeat Hossain
- Subjects
Authorship attribution ,fraudulent text ,fusion ,plagiarism ,n-grams ,stylometric features ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
An accurate authorship attribution model can play a vital role in security domain by detecting fraudulent texts and combating plagiarism, online piracy, and cyber attacks. In this paper, we work on improving the performance of authorship attribution. To this end, we focus on generating effective samples and features towards creating an authorship attribution model. We did our experiments using a convolutional neural network (CNN). Two key findings from our experiments are as follows: first, our results consistently show that fusing n-grams and stylometric features yields a better performance than independently using each type of features. Notably, with fused features, we achieved an accuracy of 97.03%, a precision of 97.58%, and a recall of 97.03%. Second key finding is—when a sliding window is used in generating training samples, it is possible to improve performance by increasing the amount of overlap between samples, which can be achieved by decreasing the step length of the window. Our study shows that there is a linear relationship between performance metrics and the percent of overlap between training samples. Across three different types of features (n-grams, stylometric, and fused), the worst performance in our experiments was obtained when there was no overlap in the training samples. Inversely, the best performance was achieved when there was a 95% or a 99% overlap in the sliding windows.
- Published
- 2024
- Full Text
- View/download PDF
26. Android Authorship Attribution Using Source Code-Based Features
- Author
-
Emre Aydogan and Sevil Sen
- Subjects
Android ,authorship attribution ,mobile malware ,metadata ,obfuscation ,source code-based ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the widespread use of mobile devices, Android has become the most popular operating system, and new applications being uploaded to the Android market every day. However, due to the ease of modifying and repackaging Android binaries, Android applications can easily be modified and imitated by other developers and released in third-party Android markets. Therefore, determining the original developers of Android applications is a challenging problem known as authorship attribution. This study explores the distinctive features of Android applications to identify their authors. Software developers generally leave a footprint that reflects their writing styles in their applications. Therefore, this footprint, which can be extracted from either the source code or the binary code, can help identify the authors of software applications. Since obtaining the source code of applications in the wild can be impractical, especially when dealing with malware, researchers prefer to focus on the binaries of applications. Therefore, this study proposes an approach that identifies Android developers by deriving a wide range of features from different parts of Android applications, such as smali files, libraries, manifest files, and metadata information. Moreover, other features such as configuration, dex code, resource-based, and string-related features are inherited from other studies in Android authorship attribution and fused with the proposed feature set. The proposed approach was evaluated on benign and malware datasets and compared with those of other studies. The results show that the proposed features increase the accuracy by showing 82.5% and 95.6% in the market and malware datasets, respectively. The results demonstrate the positive impact of the proposed features on Android authorship attribution.
- Published
- 2024
- Full Text
- View/download PDF
27. Semantic Clustering and Transfer Learning in Social Media Texts Authorship Attribution
- Author
-
Anastasia Fedotova, Anna Kurtukova, Aleksandr Romanov, and Alexander Shelupanov
- Subjects
Authorship attribution ,machine learning ,natural language processing ,semantic clustering ,transfer learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This paper is the fourth part of a research series that focuses on determining the authorship of Russian-language texts by analyzing short social media comments, including those from mass media and communities associated with destructive content. Semantic text clustering was used to analyze content and employed a transfer learning technique based on a pre-trained model to identify sensitive topics. Authorship attribution is implemented as a classical classification task with a closed set of authors and a more challenging open-set task. In the latter case, multiple experiments were conducted, incorporating the identification of destructive content with known authors and artificially generated texts. For open attribution, a method combining One-Class SVM and fastText was proposed. Results demonstrate high accuracy (92% or higher) for cases with 2 and 5 authors, regardless of comment length and the additional task of identifying authors of destructive text. Mixed-data experiments involving 10 or more authors yielded results comparable to or more accurate (84% or higher) than previous studies.
- Published
- 2024
- Full Text
- View/download PDF
28. Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey.
- Author
-
He, Xie, Lashkari, Arash Habibi, Vombatkere, Nikhill, and Sharma, Dilli Prasad
- Subjects
- *
ATTRIBUTION of authorship , *PROTECTION of trade secrets , *COPYRIGHT infringement , *RESEARCH personnel - Abstract
Over the past few decades, researchers have put their effort and paid significant attention to the authorship attribution field, as it plays an important role in software forensics analysis, plagiarism detection, security attack detection, and protection of trade secrets, patent claims, copyright infringement, or cases of software theft. It helps new researchers understand the state-of-the-art works on authorship attribution methods, identify and examine the emerging methods for authorship attribution, and discuss their key concepts, associated challenges, and potential future work that could help newcomers in this field. This paper comprehensively surveys authorship attribution methods and their key classifications, used feature types, available datasets, model evaluation criteria and metrics, and challenges and limitations. In addition, we discuss the potential future research directions of the authorship attribution field based on the insights and lessons learned from this survey work. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. Information Retrieval and Machine Learning Methods for Academic Expert Finding.
- Author
-
de Campos, Luis M., Fernández-Luna, Juan M., Huete, Juan F., Ribadas-Pena, Francisco J., and Bolaños, Néstor
- Subjects
- *
MACHINE learning , *INFORMATION retrieval , *DEEP learning , *RECOMMENDER systems , *ATTRIBUTION of authorship - Abstract
In the context of academic expert finding, this paper investigates and compares the performance of information retrieval (IR) and machine learning (ML) methods, including deep learning, to approach the problem of identifying academic figures who are experts in different domains when a potential user requests their expertise. IR-based methods construct multifaceted textual profiles for each expert by clustering information from their scientific publications. Several methods fully tailored for this problem are presented in this paper. In contrast, ML-based methods treat expert finding as a classification task, training automatic text classifiers using publications authored by experts. By comparing these approaches, we contribute to a deeper understanding of academic-expert-finding techniques and their applicability in knowledge discovery. These methods are tested with two large datasets from the biomedical field: PMSC-UGR and CORD-19. The results show how IR techniques were, in general, more robust with both datasets and more suitable than the ML-based ones, with some exceptions showing good performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. INTRODUCTION TO THE ALGORITHMIZATION OF THE AUTOMATED WRITING OF SCIENTIFIC PUBLICATIONS
- Author
-
Olga Olshevska, Oleksandr Kharakhash, and Anastasiia Volkova
- Subjects
algotithmization ,scientific writing ,automated writing ,literature review ,data analysis ,ethical consideration ,intellectual property ,authorship attribution ,bias in algorithms ,simbiotic model ,Automation ,T59.5 - Abstract
In the rapidly evolving landscape of scientific research, the demand for efficient and precise dissemination of knowledge has led to the exploration of innovative approaches. This article delves into the burgeoning field of algorithmization applied to the automated writing of scientific publications. As traditional methods of manuscript preparation face challenges related to time consumption and potential human errors, the integraton of algorithms promises to revolutionize the publication process. The article commences with an exploration of the current challenges in scientific writing, highlighting the time-intensive nature of literature review, data analysis, and drafting. It underscores the potential for automation to alleviate researchers' burden by streamlining these processes, allowing them to focus more on the core aspects of their research. The ethical considerations inherent in algorithmic scientific writing are thoroughly addressed. The article scrutinizes concerns related to intellectual property, authorship attribution, and potential biases embedded in algorithms. It advocates for transparent practices and emphasizes the need for researchers to maintain oversight over algorithmic outputs to preserve the integrity of scientific discourse. [1] An in-depth analysis of existing automated writing tools and platforms is presented, evaluating their strengths and limitations. The article compares popular algorithms and discusses their applicability to diverse scientific domains. Moreover, it sheds light on the potential for collaboration between human researchers and algorithms, presenting a symbiotic model that harnesses the strengths of both. The article concludes with a forward-looking perspective, envisioning the future implications of algorithmization in scientific publication writing. It discusses potential advancements, challenges, and the evolving role of researchers in an era where algorithms contribute significantly to the scholarly communication landscape.
- Published
- 2023
- Full Text
- View/download PDF
31. A3C: Albanian Authorship Attribution Corpus
- Author
-
Misini, Arta, Kadriu, Arbana, Canhasi, Ercan, Bexheti, Abdylmenaf, editor, Abazi-Alili, Hyrije, editor, Dana, Léo-Paul, editor, Ramadani, Veland, editor, and Caputo, Andrea, editor
- Published
- 2023
- Full Text
- View/download PDF
32. Forensic Assignment Stylometry
- Author
-
Crockett, Robin, Khan, Zeenath Reza, Section editor, and Eaton, Sarah Elaine, editor
- Published
- 2023
- Full Text
- View/download PDF
33. Effect of Machine Translation on Authorship Attribution
- Author
-
Ouamour, S., Sayoud, H., Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Bhateja, Vikrant, editor, Yang, Xin-She, editor, Ferreira, Marta Campos, editor, Sengar, Sandeep Singh, editor, and Travieso-Gonzalez, Carlos M., editor
- Published
- 2023
- Full Text
- View/download PDF
34. Spanish Stylometric Features to Determine Gender and Profession of Ecuadorian Twitter Users
- Author
-
Espin-Riofrio, César, Pazmiño-Rosales, María, Aucapiña-Camas, Carlos, Mendoza-Morán, Verónica, Montejo-Ráez, Arturo, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Narváez, Fabián R., editor, Urgilés, Fernando, editor, Bastos-Filho, Teodiano Freire, editor, and Salgado-Guerrero, Juan Pablo, editor
- Published
- 2023
- Full Text
- View/download PDF
35. Analyzing Stylistic Variation Across Different Political Regimes
- Author
-
Dinu, Liviu P., Uban, Ana Sabina, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Gelbukh, Alexander, editor
- Published
- 2023
- Full Text
- View/download PDF
36. Language and Platform Independent Attribution of Heterogeneous Code
- Author
-
Abazari, Farzaneh, Branca, Enrico, Novikova, Evgeniya, Stakhanova, Natalia, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Li, Fengjun, editor, Liang, Kaitai, editor, Lin, Zhiqiang, editor, and Katsikas, Sokratis K., editor
- Published
- 2023
- Full Text
- View/download PDF
37. Following Negationists on Twitter and Telegram: Application of NCD to the Analysis of Multiplatform Misinformation Dynamics
- Author
-
de Paz, Alfonso, Suárez, Manuel, Palmero, Santiago, Degli-Esposti, Sara, Arroyo, David, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bravo, José, editor, Ochoa, Sergio, editor, and Favela, Jesús, editor
- Published
- 2023
- Full Text
- View/download PDF
38. A Comparative Study of Stylometric Characteristics in Authorship Attribution
- Author
-
Mahor, Urmila, Kumar, Aarti, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Joshi, Amit, editor, Mahmud, Mufti, editor, and Ragel, Roshan G., editor
- Published
- 2023
- Full Text
- View/download PDF
39. A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
- Author
-
Ruhaila Maskat, Norazmiera Ayunie Azman, Nur Shaheera Shastera Nulizairos, Nurul Athirah Zahidin, Adibah Humairah Mahadi, Siti Rubaya Norshamsul, Mohd Mukhlis Mohd Sharif, and Hairulnizam Mahdin
- Subjects
Biological gender identification ,Authorship attribution ,Code-switching ,Malay-English ,Manglish ,NLP ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Science (General) ,Q1-390 - Abstract
Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.
- Published
- 2024
- Full Text
- View/download PDF
40. Stylometry and forensic science: A literature review
- Author
-
Valentina Cammarota, Silvia Bozza, Claude-Alain Roten, and Franco Taroni
- Subjects
Forensic stylistics ,Stylometry ,Authorship attribution ,Criminal law and procedure ,K5000-5582 - Abstract
The article focuses on a careful description of literature on stylometry and on its potential use in forensic science. The state of the art of stylometry is summarized to illustrate the history and the scientific foundation of this discipline. However, the study conducted reveals that there are still some key unresolved aspects that require a response from the academic world. The paper introduces the readers to those issues that need to be tackled for stylometry to be accepted as a forensic discipline. In particular, a coherent probabilistic procedure to assess the probative value of the results obtained through this methodology is largely absent. This gap should be filled properly by applying criteria recommended by international organizations such as the European Network of Forensic Science Institutes. Solutions do exist and will allow a better integration of stylometry in forensic science, favouring the acceptance of this scientific technical method in judicial proceedings.
- Published
- 2024
- Full Text
- View/download PDF
41. I repeat therefore I am: The parasyntactic perspective.
- Author
-
Benešová, Martina, Faltýnek, Dan, Kormaníková, Libuše, and Kučera, Ondřej
- Subjects
- *
HABIT , *ATTRIBUTION of authorship , *LINGUISTIC context - Abstract
The text presents a theoretical platform and a case study of a new method for authorship attribution based on an author's specific low-frequency lexicon. It will be shown that an author's text is largely context-independent and is constructed by the author's habit based on the regular repetition of certain topics or modes of expression. The author's idiosyncratic way of choosing between synonymous linguistic devices in the text happens at a distance of several word forms or sentence units. This means that texts themselves are constructed using a much wider range of repetitions than expected and that the structure of the text above the level of intersentential linking is determined by a specific group of words (functional but above all content words) obligatorily used by the author in the formulation of the text. The newly introduced method can be used to attribute authorship by relying on the specific linguistic imprint of the author in the text (in this context, we talk about parasyntactic linguistic level). The method is compared with a function-word-based method. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
42. Who could be behind QAnon? Authorship attribution with supervised machine-learning.
- Author
-
Cafiero, Florian and Camps, Jean-Baptiste
- Subjects
- *
ATTRIBUTION of authorship , *MACHINE learning , *QANON , *COINCIDENCE , *LITERARY form , *SOCIAL media - Abstract
A series of social media posts on 4chan then 8chan, signed under the pseudonym 'Q', started a movement known as QAnon, which led some of its most radical supporters to violent and illegal actions. To identify the person(s) behind Q, we evaluate the coincidence between the linguistic properties of the texts written by Q and to those written by a list of suspects provided by journalistic investigation. To identify the authors of these posts, serious challenges have to be addressed. The 'Q drops' are very short texts, written in a way that constitute a sort of literary genre in itself, with very peculiar features of style. These texts might have been written by different authors, whose other writings are often hard to find. After an online ethnography of the movement, necessary to collect enough material written by these thirteen potential authors, we use supervised machine learning to build stylistic profiles for each of them. We then performed a 'rolling analysis', looking repeatedly through a moving window for parts of Q's writings matching our profiles. We conclude that two different individuals, Paul F. and Ron W. are the closest match to Q's linguistic signature, and they could have successively written Q's texts. These potential authors are not high-ranked personality from the US administration, but rather social media activists. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
43. Ethics and IPR - Much Needed Legal Solutions for Tomorrow.
- Author
-
Chorążewska, Anna, Stanimirova, Ivana, and Oster, Kamil
- Abstract
This article considers the protection of authorship in scientific papers. We analysed the role of authorship in the light of the current legal and ethical framework. We have discovered that standard rules of copyright law refer to the relationship between the 'author' and the result of their creative activity. 'Authors' are not originators of a discovery, idea, procedure, theory, method or other immaterial contribution to research unless they have fixed the intellectual work in any tangible medium of expression. At times, it is challenging to identify scientific products, which are an essential contribution to research projects, which means that copyright law might not protect them. These two contexts, modern science and copyright law, allow us to conclude that ethical codes for researchers properly define the right to be an author of a scientific paper. The study aims to clarify that (1) international human rights guarantee the protection of the author's moral rights of the original contribution to the research project, (2) this obligation is not implemented correctly by national legislators, (3) national legislators' task is to create an adequate legal protection system for original contributions to research science according to the example of the solutions adopted by the German legislator. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
44. Authorship Attribution on Short Texts in the Slovenian Language.
- Author
-
Gabrovšek, Gregor, Peer, Peter, Emeršič, Žiga, and Batagelj, Borut
- Subjects
ATTRIBUTION of authorship ,LANGUAGE models ,HATE speech ,SPEECH perception ,STYLOMETRY ,LANGUAGE & languages - Abstract
Featured Application: The results of this study are applicable to systems combating misinformation and hate speech online. For example, the authorship attribution technique developed in this study is applicable to identifying people who were banned from online platforms for hate speech but started posting again under a newly registered account. The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05 . The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
45. INTRODUCTION TO THE ALGORITHMIZATION OF THE AUTOMATED WRITING OF SCIENTIFIC PUBLICATIONS.
- Author
-
Olshevska, Olga and OleksandrKharakhash
- Subjects
TECHNICAL writing ,ALGORITHMIC bias ,LITERATURE reviews ,ATTRIBUTION of authorship ,INTELLECTUAL property - Abstract
У середовищі наукових досліджень, що швидко розвивається, потреба в ефективному та точному розповсюдженні знань призвела до пошуку інноваційних підходів. У цій статті розглядається галузь алгоритмізації, що розвивається, застосована до автоматизованого написання наукових публікацій. Оскільки традиційні методи підготовки рукописів стикаються з проблемами, пов’язаними з витратами часу та потенційними людськими помилками, інтеграція алгоритмів обіцяє революцію в процесі публікації. Стаття починається з дослідження поточних викликів у науковому написанні, підкреслюючи, що огляд літератури, аналіз даних і написання займають багато часу. Це підкреслює потенціал автоматизації для полегшення тягаря дослідників шляхом упорядкування цих процесів, дозволяючи їм більше зосереджуватися на основних аспектах своїх досліджень. Ретельно розглядаються етичні міркування, властиві алгоритмічному науковому написанню. У статті детально розглядаються проблеми, пов’язані з інтелектуальною власністю, визначенням авторства та потенційними упередженнями, вбудованими в алгоритми. Він виступає за прозорі практики та наголошує на необхідності для дослідників підтримувати нагляд за алгоритмічними результатами для збереження цілісності наукового дискурсу. [1] Представлено поглиблений аналіз існуючих засобів автоматизованого письма та платформ, оцінюючи їх переваги та обмеження. У статті порівнюються популярні алгоритми та обговорюється їх застосування в різних наукових областях. Крім того, він проливає світло на потенціал співпраці між дослідниками та алгоритмами, представляючи симбіотичну модель, яка використовує сильні сторони обох. Стаття завершується перспективним поглядом, який передбачає майбутні наслідки алгоритмізації для написання наукових публікацій. У ньому обговорюються потенційні досягнення, проблеми та еволюція ролі дослідників в епоху, коли алгоритми роблять значний внесок у ландшафт наукової комунікації. [ABSTRACT FROM AUTHOR]
- Published
- 2023
46. Authorship attribution in twitter: a comparative study of machine learning and deep learning approaches
- Author
-
Aouchiche, Rebeh Imane Ammar, Boumahdi, Fatima, Remmide, Mohamed Abdelkarim, and Madani, Amina
- Published
- 2024
- Full Text
- View/download PDF
47. A Comparative Study of Machine Learning Methods and Text Features for Text Authorship Recognition in the Example of Azerbaijani Language Texts
- Author
-
Rustam Azimov and Efthimios Providas
- Subjects
authorship recognition of literary works ,authorship attribution ,author identification ,text feature engineering ,machine learning ,Industrial engineering. Management engineering ,T55.4-60.8 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
This paper presents various machine learning methods with different text features that are explored and evaluated to determine the authorship of the texts in the example of the Azerbaijani language. We consider techniques like artificial neural network, convolutional neural network, random forest, and support vector machine. These techniques are used with different text features like word length, sentence length, combined word length and sentence length, n-grams, and word frequencies. The models were trained and tested on the works of many famous Azerbaijani writers. The results of computer experiments obtained by utilizing a comparison of various techniques and text features were analyzed. The cases where the usage of text features allowed better results were determined.
- Published
- 2024
- Full Text
- View/download PDF
48. Morphosyntactic Annotation in Literary Stylometry
- Author
-
Robert Gorman
- Subjects
stylometry ,Universal Dependencies ,authorship attribution ,Information technology ,T58.5-58.64 - Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an “authorial” signal as opposed to the clear distinction among individual works.
- Published
- 2024
- Full Text
- View/download PDF
49. Individuals with developmental disabilities make their own stylistic contributions to text written with physical facilitation
- Author
-
Giovanni Nicoli, Giulia Pavon, Andy Grayson, Anne Emerson, Michele Cortelazzo, and Suvobrata Mitra
- Subjects
facilitated communication ,stylometry ,co-authorship ,developmental disabilities ,authorship attribution ,Psychiatry ,RC435-571 ,Pediatrics ,RJ1-570 - Abstract
IntroductionFor individuals with developmental disabilities (DD) such as autism, Down syndrome, or cerebral palsy, learning to express with language is a two-fold challenge because atypical cognitive capacity is compounded by sensorimotor coordination deficits. One approach to assisting linguistic expression in these individuals is to physically support them, for example, by touching their torso or arm as they type. The neurophysiological mechanism of such motor assistance for linguistic expression is not known, but recently it has been proposed that light touch may reduce the cognitive load associated with the sensorimotor coordination of typing, thereby releasing shared cognitive resources to the task of generating content. Historically, there has been significant controversy over the extent to which the facilitator and not the user authors texts written with touch assistance. User groups and a few researchers have argued that the user can express their thoughts through such techniques, but the prevailing view among researchers is that these texts are entirely the by-products of the facilitators' ideomotor cueing of users' movements. If the user is not a source of the produced text, the only linguistic style detectable in the text should be the facilitator's.MethodsHere, we use quantitative linguistic analysis to investigate whether DD users typing text with touch assistance exhibit their own stylistic signatures alongside those of their facilitators. In Study 1, we investigate whether the stylometric fingerprints of a set of users are detectable when they are all assisted by the same facilitator. In Study 2, we examine whether the users' stylometric characteristics are retained even when they are assisted by multiple facilitators.ResultsAcross both studies, the results show that the users' stylistic signature is detectable alongside that of facilitators. This suggests that the texts generated by DD users withphysical assistance should be viewed as coauthored rather than wholly authored by facilitators via ideomotor processes.DiscussionThe users' stylometric presence in these texts suggests that touch-assistance may serve as a developmental scaffold and should be re-appraised as a teaching aid even where unassisted linguistic expression is an unlikely end goal.
- Published
- 2023
- Full Text
- View/download PDF
50. "Reis melhor do que eu": los heterónimos de Pessoa desde una perspectiva estilométrica.
- Author
-
Skorinkin, Daniil and Orekhov, Boris
- Subjects
- *
ATTRIBUTION of authorship , *DOCUMENTARY evidence , *STYLOMETRY , *PROBLEM solving , *QUANTITATIVE research - Abstract
Traditionally, stylometry has been used to solve problems of authorship attribution. Quantitative attribution methods remain the last hope for researchers when reliable documentary evidence is unavailable. In the last 20 years, the Delta method, developed by John F. Burrows, has emerged as the leading attribution method. Overall, it has proven to be a reasonably reliable way of attributing texts in controversial cases. However, as our research shows, the case of Fernando Pessoa stands out, as he produced his texts "on behalf" of fictitious identities, commonly known as "heteronyms". It turns out that Delta does not identify these works as expected, that is, as texts belonging to the pen of a single person, Fernando Pessoa, but as texts from different authors. The article carries out a series of experiments to test the extent to which Pessoa manages to confuse the quantitative assessment of the authorship of his poetic texts. Pessoa's texts are examined as an independent corpus and against the background of the work of other Lusophone poets. In all cases, the distances between texts belonging to Pessoa's heteronyms are comparable to those between texts from different authors, much greater than the distances between texts from the same author. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.