4,232 results on '"Authorship Attribution"'
Search Results
2. Cross-linguistic authorship attribution and gender profiling. Machine translation as a method for bridging the language gap.
- Author
-
Mikros, George and Boumparis, Dimitris
- Subjects
- *
MACHINE translating , *RANDOM forest algorithms , *ATTRIBUTION of authorship , *LINGUISTICS , *AUTHORSHIP - Abstract
This study explores the feasibility of cross-linguistic authorship attribution and the author's gender identification using Machine Translation (MT). Computational stylistics experiments were conducted on a Greek blog corpus translated into English using Google's Neural MT. A Random Forest algorithm was employed for authorship and gender profiling, using different feature groups [Author's Multilevel N-gram Profiles, quantitative linguistics (QL), and cross-lingual word embeddings (CLWE)] in both original and translated texts. Results indicate that MT is a viable method for converting a multilingual corpus into one language for authorship attribution and gender profiling research, with considerable accuracy when training and testing datasets use identical language. In the pure cross-linguistic scenario, higher accuracies than the baselines were obtained using CLWE and QL features. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Testing high-dimensional multinomials with applications to text analysis.
- Author
-
Cai, T Tony, Ke, Zheng T, and Turner, Paxton
- Subjects
DISTRIBUTION (Probability theory) ,CENTRAL limit theorem ,ATTRIBUTION of authorship ,FILM reviewing ,GAUSSIAN distribution - Abstract
Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Identifying Authorship in Malicious Binaries: Features, Challenges & Datasets.
- Author
-
Gray, Jason, Sgandurra, Daniele, Cavallaro, Lorenzo, and Blasco Alis, Jorge
- Published
- 2024
- Full Text
- View/download PDF
5. A comparison of latent semantic analysis and correspondence analysis of document-term matrices.
- Author
-
Qi, Qianqian, Hessen, David J., Deoskar, Tejaswini, and van der Heijden, Peter G. M.
- Subjects
SINGULAR value decomposition ,TEXT mining ,ATTRIBUTION of authorship ,INFORMATION retrieval ,NATIONAL songs ,LATENT semantic analysis - Abstract
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain Adaptation.
- Author
-
Li, Zhen, Zhao, Shasha, Chen, Chen, and Chen, Qian
- Subjects
ATTRIBUTION of authorship ,DEEP learning ,LEARNING ability ,PHYSIOLOGICAL adaptation - Abstract
Source code authorship attribution is an important problem in practical applications such as plagiarism detection, software forensics, and copyright disputes. Recent studies show that existing methods for source code authorship attribution can be significantly affected by time evolution, leading to a decrease in attribution accuracy year by year. To alleviate the problem of Deep Learning (DL)-based source code authorship attribution degrading in accuracy due to time evolution, we propose a new framework called TimeDomain Adaptation (TimeDA) by adding new feature extractors to the original DL-based code attribution framework that enhances the learning ability of the original model on source domain features without requiring new or more source data. Moreover, we employ a centroid-based pseudo-labeling strategy using neighborhood clustering entropy for adaptive learning to improve the robustness of DL-based code authorship attribution. Experimental results show that TimeDA can significantly enhance the robustness of DL-based source code authorship attribution to time evolution, with an average improvement of 8.7% on the Java dataset and 5.2% on the C++ dataset. In addition, our TimeDA benefits from employing the centroid-based pseudo-labeling strategy, which significantly reduced the model training time by 87.3% compared to traditional unsupervised domain adaptive methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. A Comparative Study of Machine Learning Methods and Text Features for Text Authorship Recognition in the Example of Azerbaijani Language Texts.
- Author
-
Azimov, Rustam and Providas, Efthimios
- Subjects
- *
ARTIFICIAL neural networks , *TEXT recognition , *CONVOLUTIONAL neural networks , *MACHINE learning , *SUPPORT vector machines , *ELECTRONIC publications - Abstract
This paper presents various machine learning methods with different text features that are explored and evaluated to determine the authorship of the texts in the example of the Azerbaijani language. We consider techniques like artificial neural network, convolutional neural network, random forest, and support vector machine. These techniques are used with different text features like word length, sentence length, combined word length and sentence length, n-grams, and word frequencies. The models were trained and tested on the works of many famous Azerbaijani writers. The results of computer experiments obtained by utilizing a comparison of various techniques and text features were analyzed. The cases where the usage of text features allowed better results were determined. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Beyond content: discriminatory power of function words in text type classification.
- Author
-
Venglařová, Klára and Matlach, Vladimír
- Subjects
- *
ATTRIBUTION of authorship - Abstract
Our work aims to evaluate the strength of the association between function words and several text types: novels, poems, academic articles, reviews, and blog posts, and the accuracy of their classification to these categories, through machine-learning and statistical methods. The principal conclusion is that the types of texts are distinguishable based only on the function words, either by vocabulary or vocabulary diversity. Such findings may impact the techniques of authorship attribution based on function words and text clustering techniques since some function words add information about the text types/genres, in addition to content words. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Stylometric analysis of French plays of the 17th century.
- Author
-
Savoy, Jacques
- Subjects
- *
SEVENTEENTH century , *ATTRIBUTION of authorship , *FRENCH fiction , *DIGITAL humanities - Abstract
The automatic assignment of a text to one or more predefined categories presents multiple applications. In this context, the current study focuses on author attribution in which the true author of a doubtful text must be identified. This analysis focuses on the style of sixty-six French comedies in verse written by seventeen supposed authors during the 17th century. The hypothesis we want to verify assumes that the real author is the name appearing on the cover (called the signature hypothesis). In order to validate the reliability of two attribution procedures, we used two additional corpora based on 200 extracts of novels written in French, with thirty authors and 140 Italian novels authored by forty persons. After this verification, we propose an improvement of the Delta method as well as a new analysis grid for this model. Finally, we applied these approaches to our French comedy corpus. The results demonstrate that the signature hypothesis must be discarded. Moreover, these works present similar styles, making any attribution difficult to support with a high degree of certainty. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Code stylometry vs formatting and minification
- Author
-
Stefano Balla, Maurizio Gabbrielli, and Stefano Zacchiroli
- Subjects
Authorship attribution ,Code stylometry ,Code formatting ,Minification ,Source code ,Syntax tree ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
- Published
- 2024
- Full Text
- View/download PDF
11. Review of Various Approaches for Authorship Identification in Digital Forensics
- Author
-
Sanjesh, Riya, Mangai, J. Alamelu, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Fortino, Giancarlo, editor, Kumar, Akshi, editor, Swaroop, Abhishek, editor, and Shukla, Pancham, editor
- Published
- 2024
- Full Text
- View/download PDF
12. Authorship Attribution for Assamese Language Documents: Initial Results
- Author
-
Medhi, Smriti Priya, Sarma, Shikhar Kumar, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Das, Prodipto, editor, Begum, Shahin Ara, editor, and Buyya, Rajkumar, editor
- Published
- 2024
- Full Text
- View/download PDF
13. Word Trends in Digital Communication of High School Students in the New Capital of Indonesia (IKN) : A Corpus Linguistic Study
- Author
-
Puspitasari, Devi Ambarwati, Karlina, Yenny, Hernina, Hernina, Kurniawan, Kurniawan, Mulyo, Budi Mukhamad, Striełkowski, Wadim, Editor-in-Chief, Black, Jessica M., Series Editor, Butterfield, Stephen A., Series Editor, Chang, Chi-Cheng, Series Editor, Cheng, Jiuqing, Series Editor, Dumanig, Francisco Perlas, Series Editor, Al-Mabuk, Radhi, Series Editor, Scheper-Hughes, Nancy, Series Editor, Urban, Mathias, Series Editor, Webb, Stephen, Series Editor, Haristiani, Nuria, editor, Yulianeta, Yulianeta, editor, Wirza, Yanty, editor, Gunawan, Wawan, editor, Danuwijaya, Ari Arifin, editor, Kurniawan, Eri, editor, Suharno, Suharno, editor, Nafisah, Nia, editor, and Imperiani, Ernie Diyahkusumaning Ayu, editor
- Published
- 2024
- Full Text
- View/download PDF
14. Forensic Assignment Stylometry
- Author
-
Crockett, Robin, Khan, Zeenath Reza, Section editor, and Eaton, Sarah Elaine, editor
- Published
- 2024
- Full Text
- View/download PDF
15. An Interpretable Authorship Attribution Algorithm Based on Distance-Related Characterizations of Tokens
- Author
-
Lomas, Victor, Reyes, Michelle, Neme, Antonio, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Calvo, Hiram, editor, Martínez-Villaseñor, Lourdes, editor, and Ponce, Hiram, editor
- Published
- 2024
- Full Text
- View/download PDF
16. Morphosyntactic Annotation in Literary Stylometry.
- Author
-
Gorman, Robert
- Subjects
- *
STYLOMETRY , *LITERARY style , *ANNOTATIONS , *ATTRIBUTION of authorship - Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an "authorial" signal as opposed to the clear distinction among individual works. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Unsigned play by Milan Kundera? An authorship attribution study.
- Author
-
Jungmannová, Lenka and Plecháč, Petr
- Subjects
- *
ATTRIBUTION of authorship , *SUPERVISED learning - Abstract
In addition to being a widely recognized novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitelé klíčů 1961), The Blunder (Ptákovina 1967), and Jacques and his Master (Jakub a jeho pán 1971). In recent years, however, the hypothesis has been raised that Kundera was the true author of a fourth play, Juro Jánošík , first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera's student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro Jánošík , with results strongly supporting the hypothesis of Kundera's authorship. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian.
- Author
-
Nitu, Melania and Dascalu, Mihai
- Subjects
ATTRIBUTION of authorship ,TRANSFORMER models ,LANGUAGE models ,NATURAL language processing ,LINGUISTIC analysis - Abstract
Authorship attribution for less-resourced languages like Romanian, characterized by the scarcity of large, annotated datasets and the limited number of available NLP tools, poses unique challenges. This study focuses on a hybrid Transformer combining handcrafted linguistic features, ranging from surface indices like word frequencies to syntax, semantics, and discourse markers, with contextualized embeddings from a Romanian BERT encoder. The methodology involves extracting contextualized representations from a pre-trained Romanian BERT model and concatenating them with linguistic features, selected using the Kruskal–Wallis mean rank, to create a hybrid input vector for a classification layer. We compare this approach with a baseline ensemble of seven machine learning classifiers for authorship attribution employing majority soft voting. We conduct studies on both long texts (full texts) and short texts (paragraphs), with 19 authors and a subset of 10. Our hybrid Transformer outperforms existing methods, achieving an F1 score of 0.87 on the full dataset of the 19-author set (an 11% enhancement) and an F1 score of 0.95 on the 10-author subset (an increase of 10% over previous research studies). We conduct linguistic analysis leveraging textual complexity indices and employ McNemar and Cochran's Q statistical tests to evaluate the performance evolution across the best three models, while highlighting patterns in misclassifications. Our research contributes to diversifying methodologies for effective authorship attribution in resource-constrained linguistic environments. Furthermore, we publicly release the full dataset and the codebase associated with this study to encourage further exploration and development in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey.
- Author
-
He, Xie, Lashkari, Arash Habibi, Vombatkere, Nikhill, and Sharma, Dilli Prasad
- Subjects
- *
ATTRIBUTION of authorship , *PROTECTION of trade secrets , *COPYRIGHT infringement , *RESEARCH personnel - Abstract
Over the past few decades, researchers have put their effort and paid significant attention to the authorship attribution field, as it plays an important role in software forensics analysis, plagiarism detection, security attack detection, and protection of trade secrets, patent claims, copyright infringement, or cases of software theft. It helps new researchers understand the state-of-the-art works on authorship attribution methods, identify and examine the emerging methods for authorship attribution, and discuss their key concepts, associated challenges, and potential future work that could help newcomers in this field. This paper comprehensively surveys authorship attribution methods and their key classifications, used feature types, available datasets, model evaluation criteria and metrics, and challenges and limitations. In addition, we discuss the potential future research directions of the authorship attribution field based on the insights and lessons learned from this survey work. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Information Retrieval and Machine Learning Methods for Academic Expert Finding.
- Author
-
de Campos, Luis M., Fernández-Luna, Juan M., Huete, Juan F., Ribadas-Pena, Francisco J., and Bolaños, Néstor
- Subjects
- *
MACHINE learning , *INFORMATION retrieval , *DEEP learning , *RECOMMENDER systems , *ATTRIBUTION of authorship - Abstract
In the context of academic expert finding, this paper investigates and compares the performance of information retrieval (IR) and machine learning (ML) methods, including deep learning, to approach the problem of identifying academic figures who are experts in different domains when a potential user requests their expertise. IR-based methods construct multifaceted textual profiles for each expert by clustering information from their scientific publications. Several methods fully tailored for this problem are presented in this paper. In contrast, ML-based methods treat expert finding as a classification task, training automatic text classifiers using publications authored by experts. By comparing these approaches, we contribute to a deeper understanding of academic-expert-finding techniques and their applicability in knowledge discovery. These methods are tested with two large datasets from the biomedical field: PMSC-UGR and CORD-19. The results show how IR techniques were, in general, more robust with both datasets and more suitable than the ML-based ones, with some exceptions showing good performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Towards Performance Improvement of Authorship Attribution
- Author
-
Amar Suljic and Md Shafaeat Hossain
- Subjects
Authorship attribution ,fraudulent text ,fusion ,plagiarism ,n-grams ,stylometric features ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
An accurate authorship attribution model can play a vital role in security domain by detecting fraudulent texts and combating plagiarism, online piracy, and cyber attacks. In this paper, we work on improving the performance of authorship attribution. To this end, we focus on generating effective samples and features towards creating an authorship attribution model. We did our experiments using a convolutional neural network (CNN). Two key findings from our experiments are as follows: first, our results consistently show that fusing n-grams and stylometric features yields a better performance than independently using each type of features. Notably, with fused features, we achieved an accuracy of 97.03%, a precision of 97.58%, and a recall of 97.03%. Second key finding is—when a sliding window is used in generating training samples, it is possible to improve performance by increasing the amount of overlap between samples, which can be achieved by decreasing the step length of the window. Our study shows that there is a linear relationship between performance metrics and the percent of overlap between training samples. Across three different types of features (n-grams, stylometric, and fused), the worst performance in our experiments was obtained when there was no overlap in the training samples. Inversely, the best performance was achieved when there was a 95% or a 99% overlap in the sliding windows.
- Published
- 2024
- Full Text
- View/download PDF
22. Android Authorship Attribution Using Source Code-Based Features
- Author
-
Emre Aydogan and Sevil Sen
- Subjects
Android ,authorship attribution ,mobile malware ,metadata ,obfuscation ,source code-based ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the widespread use of mobile devices, Android has become the most popular operating system, and new applications being uploaded to the Android market every day. However, due to the ease of modifying and repackaging Android binaries, Android applications can easily be modified and imitated by other developers and released in third-party Android markets. Therefore, determining the original developers of Android applications is a challenging problem known as authorship attribution. This study explores the distinctive features of Android applications to identify their authors. Software developers generally leave a footprint that reflects their writing styles in their applications. Therefore, this footprint, which can be extracted from either the source code or the binary code, can help identify the authors of software applications. Since obtaining the source code of applications in the wild can be impractical, especially when dealing with malware, researchers prefer to focus on the binaries of applications. Therefore, this study proposes an approach that identifies Android developers by deriving a wide range of features from different parts of Android applications, such as smali files, libraries, manifest files, and metadata information. Moreover, other features such as configuration, dex code, resource-based, and string-related features are inherited from other studies in Android authorship attribution and fused with the proposed feature set. The proposed approach was evaluated on benign and malware datasets and compared with those of other studies. The results show that the proposed features increase the accuracy by showing 82.5% and 95.6% in the market and malware datasets, respectively. The results demonstrate the positive impact of the proposed features on Android authorship attribution.
- Published
- 2024
- Full Text
- View/download PDF
23. Semantic Clustering and Transfer Learning in Social Media Texts Authorship Attribution
- Author
-
Anastasia Fedotova, Anna Kurtukova, Aleksandr Romanov, and Alexander Shelupanov
- Subjects
Authorship attribution ,machine learning ,natural language processing ,semantic clustering ,transfer learning ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This paper is the fourth part of a research series that focuses on determining the authorship of Russian-language texts by analyzing short social media comments, including those from mass media and communities associated with destructive content. Semantic text clustering was used to analyze content and employed a transfer learning technique based on a pre-trained model to identify sensitive topics. Authorship attribution is implemented as a classical classification task with a closed set of authors and a more challenging open-set task. In the latter case, multiple experiments were conducted, incorporating the identification of destructive content with known authors and artificially generated texts. For open attribution, a method combining One-Class SVM and fastText was proposed. Results demonstrate high accuracy (92% or higher) for cases with 2 and 5 authors, regardless of comment length and the additional task of identifying authors of destructive text. Mixed-data experiments involving 10 or more authors yielded results comparable to or more accurate (84% or higher) than previous studies.
- Published
- 2024
- Full Text
- View/download PDF
24. INTRODUCTION TO THE ALGORITHMIZATION OF THE AUTOMATED WRITING OF SCIENTIFIC PUBLICATIONS
- Author
-
Olga Olshevska, Oleksandr Kharakhash, and Anastasiia Volkova
- Subjects
algotithmization ,scientific writing ,automated writing ,literature review ,data analysis ,ethical consideration ,intellectual property ,authorship attribution ,bias in algorithms ,simbiotic model ,Automation ,T59.5 - Abstract
In the rapidly evolving landscape of scientific research, the demand for efficient and precise dissemination of knowledge has led to the exploration of innovative approaches. This article delves into the burgeoning field of algorithmization applied to the automated writing of scientific publications. As traditional methods of manuscript preparation face challenges related to time consumption and potential human errors, the integraton of algorithms promises to revolutionize the publication process. The article commences with an exploration of the current challenges in scientific writing, highlighting the time-intensive nature of literature review, data analysis, and drafting. It underscores the potential for automation to alleviate researchers' burden by streamlining these processes, allowing them to focus more on the core aspects of their research. The ethical considerations inherent in algorithmic scientific writing are thoroughly addressed. The article scrutinizes concerns related to intellectual property, authorship attribution, and potential biases embedded in algorithms. It advocates for transparent practices and emphasizes the need for researchers to maintain oversight over algorithmic outputs to preserve the integrity of scientific discourse. [1] An in-depth analysis of existing automated writing tools and platforms is presented, evaluating their strengths and limitations. The article compares popular algorithms and discusses their applicability to diverse scientific domains. Moreover, it sheds light on the potential for collaboration between human researchers and algorithms, presenting a symbiotic model that harnesses the strengths of both. The article concludes with a forward-looking perspective, envisioning the future implications of algorithmization in scientific publication writing. It discusses potential advancements, challenges, and the evolving role of researchers in an era where algorithms contribute significantly to the scholarly communication landscape.
- Published
- 2023
- Full Text
- View/download PDF
25. Authorship attribution in twitter: a comparative study of machine learning and deep learning approaches
- Author
-
Aouchiche, Rebeh Imane Ammar, Boumahdi, Fatima, Remmide, Mohamed Abdelkarim, and Madani, Amina
- Published
- 2024
- Full Text
- View/download PDF
26. I repeat therefore I am: The parasyntactic perspective.
- Author
-
Benešová, Martina, Faltýnek, Dan, Kormaníková, Libuše, and Kučera, Ondřej
- Subjects
- *
HABIT , *ATTRIBUTION of authorship , *LINGUISTIC context - Abstract
The text presents a theoretical platform and a case study of a new method for authorship attribution based on an author's specific low-frequency lexicon. It will be shown that an author's text is largely context-independent and is constructed by the author's habit based on the regular repetition of certain topics or modes of expression. The author's idiosyncratic way of choosing between synonymous linguistic devices in the text happens at a distance of several word forms or sentence units. This means that texts themselves are constructed using a much wider range of repetitions than expected and that the structure of the text above the level of intersentential linking is determined by a specific group of words (functional but above all content words) obligatorily used by the author in the formulation of the text. The newly introduced method can be used to attribute authorship by relying on the specific linguistic imprint of the author in the text (in this context, we talk about parasyntactic linguistic level). The method is compared with a function-word-based method. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
27. Who could be behind QAnon? Authorship attribution with supervised machine-learning.
- Author
-
Cafiero, Florian and Camps, Jean-Baptiste
- Subjects
- *
ATTRIBUTION of authorship , *MACHINE learning , *QANON , *COINCIDENCE , *LITERARY form , *SOCIAL media - Abstract
A series of social media posts on 4chan then 8chan, signed under the pseudonym 'Q', started a movement known as QAnon, which led some of its most radical supporters to violent and illegal actions. To identify the person(s) behind Q, we evaluate the coincidence between the linguistic properties of the texts written by Q and to those written by a list of suspects provided by journalistic investigation. To identify the authors of these posts, serious challenges have to be addressed. The 'Q drops' are very short texts, written in a way that constitute a sort of literary genre in itself, with very peculiar features of style. These texts might have been written by different authors, whose other writings are often hard to find. After an online ethnography of the movement, necessary to collect enough material written by these thirteen potential authors, we use supervised machine learning to build stylistic profiles for each of them. We then performed a 'rolling analysis', looking repeatedly through a moving window for parts of Q's writings matching our profiles. We conclude that two different individuals, Paul F. and Ron W. are the closest match to Q's linguistic signature, and they could have successively written Q's texts. These potential authors are not high-ranked personality from the US administration, but rather social media activists. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
28. Ethics and IPR - Much Needed Legal Solutions for Tomorrow.
- Author
-
Chorążewska, Anna, Stanimirova, Ivana, and Oster, Kamil
- Abstract
This article considers the protection of authorship in scientific papers. We analysed the role of authorship in the light of the current legal and ethical framework. We have discovered that standard rules of copyright law refer to the relationship between the 'author' and the result of their creative activity. 'Authors' are not originators of a discovery, idea, procedure, theory, method or other immaterial contribution to research unless they have fixed the intellectual work in any tangible medium of expression. At times, it is challenging to identify scientific products, which are an essential contribution to research projects, which means that copyright law might not protect them. These two contexts, modern science and copyright law, allow us to conclude that ethical codes for researchers properly define the right to be an author of a scientific paper. The study aims to clarify that (1) international human rights guarantee the protection of the author's moral rights of the original contribution to the research project, (2) this obligation is not implemented correctly by national legislators, (3) national legislators' task is to create an adequate legal protection system for original contributions to research science according to the example of the solutions adopted by the German legislator. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
29. A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution
- Author
-
Ruhaila Maskat, Norazmiera Ayunie Azman, Nur Shaheera Shastera Nulizairos, Nurul Athirah Zahidin, Adibah Humairah Mahadi, Siti Rubaya Norshamsul, Mohd Mukhlis Mohd Sharif, and Hairulnizam Mahdin
- Subjects
Biological gender identification ,Authorship attribution ,Code-switching ,Malay-English ,Manglish ,NLP ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Science (General) ,Q1-390 - Abstract
Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.
- Published
- 2024
- Full Text
- View/download PDF
30. A Comparative Study of Machine Learning Methods and Text Features for Text Authorship Recognition in the Example of Azerbaijani Language Texts
- Author
-
Rustam Azimov and Efthimios Providas
- Subjects
authorship recognition of literary works ,authorship attribution ,author identification ,text feature engineering ,machine learning ,Industrial engineering. Management engineering ,T55.4-60.8 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
This paper presents various machine learning methods with different text features that are explored and evaluated to determine the authorship of the texts in the example of the Azerbaijani language. We consider techniques like artificial neural network, convolutional neural network, random forest, and support vector machine. These techniques are used with different text features like word length, sentence length, combined word length and sentence length, n-grams, and word frequencies. The models were trained and tested on the works of many famous Azerbaijani writers. The results of computer experiments obtained by utilizing a comparison of various techniques and text features were analyzed. The cases where the usage of text features allowed better results were determined.
- Published
- 2024
- Full Text
- View/download PDF
31. A3C: Albanian Authorship Attribution Corpus
- Author
-
Misini, Arta, Kadriu, Arbana, Canhasi, Ercan, Bexheti, Abdylmenaf, editor, Abazi-Alili, Hyrije, editor, Dana, Léo-Paul, editor, Ramadani, Veland, editor, and Caputo, Andrea, editor
- Published
- 2023
- Full Text
- View/download PDF
32. Forensic Assignment Stylometry
- Author
-
Crockett, Robin, Khan, Zeenath Reza, Section editor, and Eaton, Sarah Elaine, editor
- Published
- 2023
- Full Text
- View/download PDF
33. Effect of Machine Translation on Authorship Attribution
- Author
-
Ouamour, S., Sayoud, H., Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Bhateja, Vikrant, editor, Yang, Xin-She, editor, Ferreira, Marta Campos, editor, Sengar, Sandeep Singh, editor, and Travieso-Gonzalez, Carlos M., editor
- Published
- 2023
- Full Text
- View/download PDF
34. Spanish Stylometric Features to Determine Gender and Profession of Ecuadorian Twitter Users
- Author
-
Espin-Riofrio, César, Pazmiño-Rosales, María, Aucapiña-Camas, Carlos, Mendoza-Morán, Verónica, Montejo-Ráez, Arturo, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Narváez, Fabián R., editor, Urgilés, Fernando, editor, Bastos-Filho, Teodiano Freire, editor, and Salgado-Guerrero, Juan Pablo, editor
- Published
- 2023
- Full Text
- View/download PDF
35. Analyzing Stylistic Variation Across Different Political Regimes
- Author
-
Dinu, Liviu P., Uban, Ana Sabina, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Gelbukh, Alexander, editor
- Published
- 2023
- Full Text
- View/download PDF
36. Language and Platform Independent Attribution of Heterogeneous Code
- Author
-
Abazari, Farzaneh, Branca, Enrico, Novikova, Evgeniya, Stakhanova, Natalia, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Li, Fengjun, editor, Liang, Kaitai, editor, Lin, Zhiqiang, editor, and Katsikas, Sokratis K., editor
- Published
- 2023
- Full Text
- View/download PDF
37. Following Negationists on Twitter and Telegram: Application of NCD to the Analysis of Multiplatform Misinformation Dynamics
- Author
-
de Paz, Alfonso, Suárez, Manuel, Palmero, Santiago, Degli-Esposti, Sara, Arroyo, David, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Bravo, José, editor, Ochoa, Sergio, editor, and Favela, Jesús, editor
- Published
- 2023
- Full Text
- View/download PDF
38. A Comparative Study of Stylometric Characteristics in Authorship Attribution
- Author
-
Mahor, Urmila, Kumar, Aarti, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Joshi, Amit, editor, Mahmud, Mufti, editor, and Ragel, Roshan G., editor
- Published
- 2023
- Full Text
- View/download PDF
39. Stylometry and forensic science: A literature review
- Author
-
Valentina Cammarota, Silvia Bozza, Claude-Alain Roten, and Franco Taroni
- Subjects
Forensic stylistics ,Stylometry ,Authorship attribution ,Criminal law and procedure ,K5000-5582 - Abstract
The article focuses on a careful description of literature on stylometry and on its potential use in forensic science. The state of the art of stylometry is summarized to illustrate the history and the scientific foundation of this discipline. However, the study conducted reveals that there are still some key unresolved aspects that require a response from the academic world. The paper introduces the readers to those issues that need to be tackled for stylometry to be accepted as a forensic discipline. In particular, a coherent probabilistic procedure to assess the probative value of the results obtained through this methodology is largely absent. This gap should be filled properly by applying criteria recommended by international organizations such as the European Network of Forensic Science Institutes. Solutions do exist and will allow a better integration of stylometry in forensic science, favouring the acceptance of this scientific technical method in judicial proceedings.
- Published
- 2024
- Full Text
- View/download PDF
40. Authorship Attribution on Short Texts in the Slovenian Language.
- Author
-
Gabrovšek, Gregor, Peer, Peter, Emeršič, Žiga, and Batagelj, Borut
- Subjects
ATTRIBUTION of authorship ,LANGUAGE models ,HATE speech ,SPEECH perception ,STYLOMETRY ,LANGUAGE & languages - Abstract
Featured Application: The results of this study are applicable to systems combating misinformation and hate speech online. For example, the authorship attribution technique developed in this study is applicable to identifying people who were banned from online platforms for hate speech but started posting again under a newly registered account. The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05 . The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
41. INTRODUCTION TO THE ALGORITHMIZATION OF THE AUTOMATED WRITING OF SCIENTIFIC PUBLICATIONS.
- Author
-
Olshevska, Olga and OleksandrKharakhash
- Subjects
TECHNICAL writing ,ALGORITHMIC bias ,LITERATURE reviews ,ATTRIBUTION of authorship ,INTELLECTUAL property - Abstract
У середовищі наукових досліджень, що швидко розвивається, потреба в ефективному та точному розповсюдженні знань призвела до пошуку інноваційних підходів. У цій статті розглядається галузь алгоритмізації, що розвивається, застосована до автоматизованого написання наукових публікацій. Оскільки традиційні методи підготовки рукописів стикаються з проблемами, пов’язаними з витратами часу та потенційними людськими помилками, інтеграція алгоритмів обіцяє революцію в процесі публікації. Стаття починається з дослідження поточних викликів у науковому написанні, підкреслюючи, що огляд літератури, аналіз даних і написання займають багато часу. Це підкреслює потенціал автоматизації для полегшення тягаря дослідників шляхом упорядкування цих процесів, дозволяючи їм більше зосереджуватися на основних аспектах своїх досліджень. Ретельно розглядаються етичні міркування, властиві алгоритмічному науковому написанню. У статті детально розглядаються проблеми, пов’язані з інтелектуальною власністю, визначенням авторства та потенційними упередженнями, вбудованими в алгоритми. Він виступає за прозорі практики та наголошує на необхідності для дослідників підтримувати нагляд за алгоритмічними результатами для збереження цілісності наукового дискурсу. [1] Представлено поглиблений аналіз існуючих засобів автоматизованого письма та платформ, оцінюючи їх переваги та обмеження. У статті порівнюються популярні алгоритми та обговорюється їх застосування в різних наукових областях. Крім того, він проливає світло на потенціал співпраці між дослідниками та алгоритмами, представляючи симбіотичну модель, яка використовує сильні сторони обох. Стаття завершується перспективним поглядом, який передбачає майбутні наслідки алгоритмізації для написання наукових публікацій. У ньому обговорюються потенційні досягнення, проблеми та еволюція ролі дослідників в епоху, коли алгоритми роблять значний внесок у ландшафт наукової комунікації. [ABSTRACT FROM AUTHOR]
- Published
- 2023
42. Morphosyntactic Annotation in Literary Stylometry
- Author
-
Robert Gorman
- Subjects
stylometry ,Universal Dependencies ,authorship attribution ,Information technology ,T58.5-58.64 - Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an “authorial” signal as opposed to the clear distinction among individual works.
- Published
- 2024
- Full Text
- View/download PDF
43. Authorship Attribution in Less-Resourced Languages: A Hybrid Transformer Approach for Romanian
- Author
-
Melania Nitu and Mihai Dascalu
- Subjects
authorship attribution ,linguistic features ,contextualized embeddings ,hybrid Transformer ,ensemble learning ,linguistic analysis ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Authorship attribution for less-resourced languages like Romanian, characterized by the scarcity of large, annotated datasets and the limited number of available NLP tools, poses unique challenges. This study focuses on a hybrid Transformer combining handcrafted linguistic features, ranging from surface indices like word frequencies to syntax, semantics, and discourse markers, with contextualized embeddings from a Romanian BERT encoder. The methodology involves extracting contextualized representations from a pre-trained Romanian BERT model and concatenating them with linguistic features, selected using the Kruskal–Wallis mean rank, to create a hybrid input vector for a classification layer. We compare this approach with a baseline ensemble of seven machine learning classifiers for authorship attribution employing majority soft voting. We conduct studies on both long texts (full texts) and short texts (paragraphs), with 19 authors and a subset of 10. Our hybrid Transformer outperforms existing methods, achieving an F1 score of 0.87 on the full dataset of the 19-author set (an 11% enhancement) and an F1 score of 0.95 on the 10-author subset (an increase of 10% over previous research studies). We conduct linguistic analysis leveraging textual complexity indices and employ McNemar and Cochran’s Q statistical tests to evaluate the performance evolution across the best three models, while highlighting patterns in misclassifications. Our research contributes to diversifying methodologies for effective authorship attribution in resource-constrained linguistic environments. Furthermore, we publicly release the full dataset and the codebase associated with this study to encourage further exploration and development in this field.
- Published
- 2024
- Full Text
- View/download PDF
44. Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey
- Author
-
Xie He, Arash Habibi Lashkari, Nikhill Vombatkere, and Dilli Prasad Sharma
- Subjects
authorship attribution ,authorship verification ,author profiling ,source code analysis ,Information technology ,T58.5-58.64 - Abstract
Over the past few decades, researchers have put their effort and paid significant attention to the authorship attribution field, as it plays an important role in software forensics analysis, plagiarism detection, security attack detection, and protection of trade secrets, patent claims, copyright infringement, or cases of software theft. It helps new researchers understand the state-of-the-art works on authorship attribution methods, identify and examine the emerging methods for authorship attribution, and discuss their key concepts, associated challenges, and potential future work that could help newcomers in this field. This paper comprehensively surveys authorship attribution methods and their key classifications, used feature types, available datasets, model evaluation criteria and metrics, and challenges and limitations. In addition, we discuss the potential future research directions of the authorship attribution field based on the insights and lessons learned from this survey work.
- Published
- 2024
- Full Text
- View/download PDF
45. Individuals with developmental disabilities make their own stylistic contributions to text written with physical facilitation
- Author
-
Giovanni Nicoli, Giulia Pavon, Andy Grayson, Anne Emerson, Michele Cortelazzo, and Suvobrata Mitra
- Subjects
facilitated communication ,stylometry ,co-authorship ,developmental disabilities ,authorship attribution ,Psychiatry ,RC435-571 ,Pediatrics ,RJ1-570 - Abstract
IntroductionFor individuals with developmental disabilities (DD) such as autism, Down syndrome, or cerebral palsy, learning to express with language is a two-fold challenge because atypical cognitive capacity is compounded by sensorimotor coordination deficits. One approach to assisting linguistic expression in these individuals is to physically support them, for example, by touching their torso or arm as they type. The neurophysiological mechanism of such motor assistance for linguistic expression is not known, but recently it has been proposed that light touch may reduce the cognitive load associated with the sensorimotor coordination of typing, thereby releasing shared cognitive resources to the task of generating content. Historically, there has been significant controversy over the extent to which the facilitator and not the user authors texts written with touch assistance. User groups and a few researchers have argued that the user can express their thoughts through such techniques, but the prevailing view among researchers is that these texts are entirely the by-products of the facilitators' ideomotor cueing of users' movements. If the user is not a source of the produced text, the only linguistic style detectable in the text should be the facilitator's.MethodsHere, we use quantitative linguistic analysis to investigate whether DD users typing text with touch assistance exhibit their own stylistic signatures alongside those of their facilitators. In Study 1, we investigate whether the stylometric fingerprints of a set of users are detectable when they are all assisted by the same facilitator. In Study 2, we examine whether the users' stylometric characteristics are retained even when they are assisted by multiple facilitators.ResultsAcross both studies, the results show that the users' stylistic signature is detectable alongside that of facilitators. This suggests that the texts generated by DD users withphysical assistance should be viewed as coauthored rather than wholly authored by facilitators via ideomotor processes.DiscussionThe users' stylometric presence in these texts suggests that touch-assistance may serve as a developmental scaffold and should be re-appraised as a teaching aid even where unassisted linguistic expression is an unlikely end goal.
- Published
- 2023
- Full Text
- View/download PDF
46. "Reis melhor do que eu": los heterónimos de Pessoa desde una perspectiva estilométrica.
- Author
-
Skorinkin, Daniil and Orekhov, Boris
- Subjects
- *
ATTRIBUTION of authorship , *DOCUMENTARY evidence , *STYLOMETRY , *PROBLEM solving , *QUANTITATIVE research - Abstract
Traditionally, stylometry has been used to solve problems of authorship attribution. Quantitative attribution methods remain the last hope for researchers when reliable documentary evidence is unavailable. In the last 20 years, the Delta method, developed by John F. Burrows, has emerged as the leading attribution method. Overall, it has proven to be a reasonably reliable way of attributing texts in controversial cases. However, as our research shows, the case of Fernando Pessoa stands out, as he produced his texts "on behalf" of fictitious identities, commonly known as "heteronyms". It turns out that Delta does not identify these works as expected, that is, as texts belonging to the pen of a single person, Fernando Pessoa, but as texts from different authors. The article carries out a series of experiments to test the extent to which Pessoa manages to confuse the quantitative assessment of the authorship of his poetic texts. Pessoa's texts are examined as an independent corpus and against the background of the work of other Lusophone poets. In all cases, the distances between texts belonging to Pessoa's heteronyms are comparable to those between texts from different authors, much greater than the distances between texts from the same author. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
47. A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts.
- Author
-
AlZahrani, Fetoun Mansour and Al-Yahya, Maha
- Subjects
ATTRIBUTION of authorship ,NATURAL language processing ,ISLAMIC law - Abstract
Authorship attribution (AA) is a field of natural language processing that aims to attribute text to its author. Although the literature includes several studies on Arabic AA in general, applying AA to classical Arabic texts has not gained similar attention. This study focuses on investigating recent Arabic pretrained transformer-based models in a rarely studied domain with limited research contributions: the domain of Islamic law. We adopt an experimental approach to investigate AA. Because no dataset has been designed specifically for this task, we design and build our own dataset using Islamic law digital resources. We conduct several experiments on fine-tuning four Arabic pretrained transformer-based models: AraBERT, AraELECTRA, ARBERT, and MARBERT. Results of the experiments indicate that for the task of attributing a given text to its author, ARBERT and AraELECTRA outperform the other models with an accuracy of 96%. We conclude that pretrained transformer models, specifically ARBERT and AraELECTRA, fine-tuned using the Islamic legal dataset, show significant results in applying AA to Islamic legal texts. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
48. Should supervised discretisation always be trusted unreservedly? On combining characteristics of supervised and unsupervised discretisation algorithms in two-step processing.
- Author
-
Stańczyk, Urszula and Baron, Grzegorz
- Subjects
ATTRIBUTION of authorship ,ALGORITHMS ,PATTERN recognition systems ,MACHINE learning ,PRODUCT returns - Abstract
The paper presents a description of the research methodology dedicated to a two-step discretisation process applied to the input numeric data, with combining the characteristics of selected supervised and unsupervised algorithms, which leads to extended processing of some attributes in train and test sets. The methodology was illustrated with the investigations carried out in the domain of stylometric analysis of texts, for two datasets prepared for the task of binary authorship attribution. The several variants of transformed input data obtained were subjected to exploration using two selected machine learning methods capable of inducing knowledge from both continuous and categorical forms, namely the PART and J48 classifiers. The results from the experiments indicate that, as can be expected, supervised transformations of data work well enough, however, they do not always return the best outcome. The two-step processing of some attributes shows sufficient promise to warrant a closer study, as opposed to always unconditionally relying only on supervised algorithms as outperforming all other approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
49. Quantifying the Dissimilarity of Texts.
- Author
-
Shade, Benjamin and Altmann, Eduardo G.
- Subjects
- *
WORD frequency , *DISTRIBUTION (Probability theory) , *DOCUMENT clustering , *NATURAL language processing , *DATABASES , *INFORMATION retrieval - Abstract
Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts—vocabularies, word frequency distributions, and vector embeddings—and three simple tasks—clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen–Shannon divergence applied to word frequencies performed strongly across all tasks, that D's based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D's when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen–Shannon divergence applied to word frequencies. We also found numerically that the Jensen–Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
50. NEW LIGHT ON THE ADDITAMENTVM ALDINVM (SILIUS ITALICUS, PVNICA 8.144–223).
- Author
-
Nagy, Benjamin C. and Lee, †Janice M.
- Subjects
- *
INTUITION , *AUTHENTICITY (Philosophy) , *ATTRIBUTION of authorship , *HUMANISTS , *SCHOLARS , *STYLOMETRY - Abstract
The authenticity of the Additamentum Aldinum (Sil. Pun. 8.144–223) has long been a matter of debate. While many scholars have expressed doubts that it is by Silius and suggest rather that it is from the hands of a skilful humanist, it has not, up to this time, been possible to provide solid evidence to support their intuition. This paper not only re-examines the standard arguments for and against authenticity but brings the latest computational stylometric techniques to bear on the question. These analyses reveal that the style of the Additamentum differs in statistically significant terms from the rest of Silius' Punica. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.