Author: "Caines, Andrew" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Caines, Andrew"' showing total 133 results

Start Over Author "Caines, Andrew"

133 results on '"Caines, Andrew"'

1. From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Author: Goriely, Zébulon, Martinez, Richard Diehl, Caines, Andrew, Beinborn, Lisa, and Buttery, Paula
Subjects: Computer Science - Computation and Language
Abstract: Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.
Published: 2024

2. Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Author: Martinez, Richard Diehl, Goriely, Zebulon, Caines, Andrew, Buttery, Paula, and Beinborn, Lisa
Subjects: Computer Science - Computation and Language
Abstract: Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.
Published: 2024

3. Grammatical Error Correction for Code-Switched Sentences by Learners of English

Author: Chan, Kelvin Wey Han, Bryant, Christopher, Nguyen, Li, Caines, Andrew, and Yuan, Zheng
Subjects: Computer Science - Computation and Language
Abstract: Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 $F_{0.5}$ across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model's performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.
Published: 2024

4. CLIMB: Curriculum Learning for Infant-inspired Model Building

Author: Martinez, Richard Diehl, Goriely, Zebulon, McGovern, Hope, Davis, Christopher, Caines, Andrew, Buttery, Paula, and Beinborn, Lisa
Subjects: Computer Science - Computation and Language
Abstract: We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabulary curriculum, we analyze methods for constraining the vocabulary in the early stages of training to simulate cognitively more plausible learning curves. In the data curriculum experiments, we vary the order of the training instances based on i) infant-inspired expectations and ii) the learning behavior of the model. In the objective curriculum, we explore different variations of combining the conventional masked language modeling task with a more coarse-grained word class prediction task to reinforce linguistic generalization capabilities. Our results did not yield consistent improvements over our own non-curriculum learning baseline across a range of linguistic benchmarks; however, we do find marginal gains on select tasks. Our analysis highlights key takeaways for specific combinations of tasks and settings which benefit from our proposed curricula. We moreover determine that careful selection of model architecture, and training hyper-parameters yield substantial improvements over the default baselines provided by the BabyLM challenge.
Published: 2023

5. On the application of Large Language Models for language teaching and assessment technology

Author: Caines, Andrew, Benedetto, Luca, Taslimipoor, Shiva, Davis, Christopher, Gao, Yuan, Andersen, Oeistein, Yuan, Zheng, Elliott, Mark, Moore, Russell, Bryant, Christopher, Rei, Marek, Yannakoudakis, Helen, Mullooly, Andrew, Nicholls, Diane, and Buttery, Paula
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The recent release of very large language models such as PaLM and GPT-4 has made an unprecedented impact in the popular media and public consciousness, giving rise to a mixture of excitement and fear as to their capabilities and potential uses, and shining a light on natural language processing research which had not previously received so much attention. The developments offer great promise for education technology, and in this paper we look specifically at the potential for incorporating large language models in AI-driven language teaching and assessment systems. We consider several research areas and also discuss the risks and ethical considerations surrounding generative AI in education technology for language learners. Overall we find that larger language models offer improvements over previous models in text generation, opening up routes toward content generation which had not previously been plausible. For text generation they must be prompted carefully and their outputs may need to be reshaped before they are ready for use. For automated grading and grammatical error correction, tasks whose progress is checked on well-known benchmarks, early investigations indicate that large language models on their own do not improve on state-of-the-art results according to standard evaluation metrics. For grading it appears that linguistic features established in the literature should still be used for best performance, and for error correction it may be that the models can offer alternative feedback styles which are not measured sensitively with existing methods. In all cases, there is work to be done to experiment with the inclusion of large language models in education technology for language learners, in order to properly understand and report on their capacities and limitations, and to ensure that foreseeable risks such as misinformation and harmful bias are mitigated., Comment: Accepted at the AIED2023 workshop: Empowering Education with LLMs - the Next-Gen Interface and Content Generation
Published: 2023

6. Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers

Author: Bujel, Kamil, Caines, Andrew, Yannakoudakis, Helen, and Rei, Marek
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Long-sequence transformers are designed to improve the representation of longer texts by language models and their performance on downstream document-level tasks. However, not much is understood about the quality of token-level predictions in long-form models. We investigate the performance of such architectures in the context of document classification with unsupervised rationale extraction. We find standard soft attention methods to perform significantly worse when combined with the Longformer language model. We propose a compositional soft attention architecture that applies RoBERTa sentence-wise to extract plausible rationales at the token-level. We find this method to significantly outperform Longformer-driven baselines on sentiment classification datasets, while also exhibiting significantly lower runtimes.
Published: 2023

7. Recurrent Neural Collaborative Filtering for Knowledge Tracing

Author: Moore, Russell, Caines, Andrew, Buttery, Paula, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Olney, Andrew M., editor, Chounta, Irene-Angelica, editor, Liu, Zitao, editor, Santos, Olga C., editor, and Bittencourt, Ig Ibert, editor
Published: 2024
Full Text: View/download PDF

8. Workshop on Automatic Evaluation of Learning and Assessment Content

Author: Benedetto, Luca, Taslimipoor, Shiva, Caines, Andrew, Galvan-Sosa, Diana, Dueñas, George, Loukina, Anastassia, Zesch, Torsten, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Olney, Andrew M., editor, Chounta, Irene-Angelica, editor, Liu, Zitao, editor, Santos, Olga C., editor, and Bittencourt, Ig Ibert, editor
Published: 2024
Full Text: View/download PDF

9. Probing for targeted syntactic knowledge through grammatical error detection

Author: Davis, Christopher, Bryant, Christopher, Caines, Andrew, Rei, Marek, and Buttery, Paula
Subjects: Computer Science - Computation and Language
Abstract: Targeted studies testing knowledge of subject-verb agreement (SVA) indicate that pre-trained language models encode syntactic information. We assert that if models robustly encode subject-verb agreement, they should be able to identify when agreement is correct and when it is incorrect. To that end, we propose grammatical error detection as a diagnostic probe to evaluate token-level contextual representations for their knowledge of SVA. We evaluate contextual representations at each layer from five pre-trained English language models: BERT, XLNet, GPT-2, RoBERTa, and ELECTRA. We leverage public annotated training data from both English second language learners and Wikipedia edits, and report results on manually crafted stimuli for subject-verb agreement. We find that masked language models linearly encode information relevant to the detection of SVA errors, while the autoregressive models perform on par with our baseline. However, we also observe a divergence in performance when probes are trained on different training sets, and when they are evaluated on different syntactic constructions, suggesting the information pertaining to SVA error detection is not robustly encoded., Comment: CoNLL 2022
Published: 2022

10. Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

Author: Ali, Felermino D. M. A., Caines, Andrew, and Malavi, Jaimito L. A.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.
Published: 2021

11. Skills Embeddings: A Neural Approach to Multicomponent Representations of Students and Tasks

Author: Moore, Russell, Caines, Andrew, Elliott, Mark, Zaidi, Ahm, Rice, Andrew, and Buttery, Paula
Abstract: Educational systems use models of student skill to inform decision-making processes. Defining such models manually is challenging due to the large number of relevant factors. We propose learning multidimensional representations (embeddings) from student activity data -- these are fixed-length real vectors with three desirable characteristics: co-location of similar students and items in a vector space; magnitude increases with skill, and that absence of a skill can be represented. Based on the Multi-component Latent Trait Model, we use a neural network with complementary trainable weights to learn these embeddings by back-propagation. We evaluate using synthetic student activity data that provides a ground truth of student skills in order to understand the impact of number of students, question items and knowledge components in the domain. We find that our data-mined parameter values can recreate the synthetic datasets up to the accuracy of the model that generated them, for domains with up to 10 simultaneously active knowledge components, which can be effectively mined using relatively small quantities of data (1000 students, 100 items). We describe a procedure to estimate the number of components in a domain, and propose a component-masking logic mechanism that improves performance on high-dimensional datasets. [For the full proceedings, see ED599096.]
Published: 2019

12. Treatment of olecranon fractures in older individuals: a cross-sectional survey of surgeon treatment preferences

Author: Woolnough, Taylor, Caines, Andrew M., Pollock, JW., and Papp, Steven R.
Published: 2024
Full Text: View/download PDF

13. The Teacher-Student Chatroom Corpus

Author: Caines, Andrew, Yannakoudakis, Helen, Edmondson, Helena, Allen, Helen, Pérez-Paredes, Pascual, Byrne, Bill, and Buttery, Paula
Subjects: Computer Science - Computation and Language
Abstract: The Teacher-Student Chatroom Corpus (TSCC) is a collection of written conversations captured during one-to-one lessons between teachers and learners of English. The lessons took place in an online chatroom and therefore involve more interactive, immediate and informal language than might be found in asynchronous exchanges such as email correspondence. The fact that the lessons were one-to-one means that the teacher was able to focus exclusively on the linguistic abilities and errors of the student, and to offer personalised exercises, scaffolding and correction. The TSCC contains more than one hundred lessons between two teachers and eight students, amounting to 13.5K conversational turns and 133K words: it is freely available for research use. We describe the corpus design, data collection procedure and annotations added to the text. We perform some preliminary descriptive analyses of the data and consider possible uses of the TSCC., Comment: NLP4CALL
Published: 2020

14. Adaptive Forgetting Curves for Spaced Repetition Language Learning

Author: Zaidi, Ahmed, Caines, Andrew, Moore, Russell, Buttery, Paula, and Rice, Andrew
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The forgetting curve has been extensively explored by psychologists, educationalists and cognitive scientists alike. In the context of Intelligent Tutoring Systems, modelling the forgetting curve for each user and knowledge component (e.g. vocabulary word) should enable us to develop optimal revision strategies that counteract memory decay and ensure long-term retention. In this study we explore a variety of forgetting curve models incorporating psychological and linguistic features, and we use these models to predict the probability of word recall by learners of English as a second language. We evaluate the impact of the models and their features using data from an online vocabulary teaching platform and find that word complexity is a highly informative feature which may be successfully learned by a neural network model., Comment: Artificial Intelligence for Education 2020 (AIED)
Published: 2020

15. Lateral Patellar Instability

Author: Drapeau-Zgoralski, Véronique, Swift, Brendan, Caines, Andrew, Kerrigan, Alicia, Carsen, Sasha, and Pickell, Michael
Published: 2023
Full Text: View/download PDF

16. Prompting open-source and commercial language models for grammatical error correction of English learner text

Author: Davis, Christopher, Caines, Andrew, Andersen, Øistein, Taslimipoor, Shiva, Yannakoudakis, Helen, Yuan, Zheng, Bryant, Christopher, Rei, Marek, Buttery, Paula, Davis, Christopher, Caines, Andrew, Andersen, Øistein, Taslimipoor, Shiva, Yannakoudakis, Helen, Yuan, Zheng, Bryant, Christopher, Rei, Marek, and Buttery, Paula
Abstract: Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts -- namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting., Comment: 8 pages with appendices
Published: 2024

17. The Cross-Linguistic Performance of Word Segmentation Models over Time

Author: Caines, Andrew, Altmann-Richer, Emma, and Buttery, Paula
Abstract: We select three word segmentation models with psycholinguistic foundations -- transitional probabilities, the diphone-based segmenter, and PUDDLE -- which track phoneme co-occurrence and positional frequencies in input strings, and in the case of PUDDLE build lexical and diphone inventories. The models are evaluated on caregiver utterances in 132 CHILDES corpora representing 28 languages and 11.9 m words. PUDDLE shows the best performance overall, albeit with wide cross-linguistic variation. We explore the reasons for this variation, fitting regression models to performance scores with linguistic properties which capture lexico-phonological characteristics of the input: word length, utterance length, diversity in the lexicon, the frequency of one-word utterances, the regularity of phoneme patterns at word boundaries, and the distribution of diphones in each language. These properties together explain four-tenths of the observed variation in segmentation performance, a strong outcome and a solid foundation for studying further variables which make the segmentation task difficult.
Published: 2019
Full Text: View/download PDF

18. Behavioural Cloning of Teachers for Automatic Homework Selection

Author: Moore, Russell, Caines, Andrew, Rice, Andrew, Buttery, Paula, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Isotani, Seiji, editor, Millán, Eva, editor, Ogan, Amy, editor, Hastings, Peter, editor, McLaren, Bruce, editor, and Luckin, Rose, editor
Published: 2019
Full Text: View/download PDF

19. Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum

Author: Pastrana, Sergio, Hutchings, Alice, Caines, Andrew, Buttery, Paula, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Bailey, Michael, editor, Holz, Thorsten, editor, Stamatogiannakis, Manolis, editor, and Ioannidis, Sotiris, editor
Published: 2018
Full Text: View/download PDF

20. Is it true that most intra-articular glenoid fractures should have surgery?

Author: Willms, Scott, Caines, Andrew, and Buckley, Richard
Published: 2024
Full Text: View/download PDF

21. Word segmentation from transcriptions of child-directed speech using lexical and sub-lexical cues

Author: GORIELY, Zébulon, primary, CAINES, Andrew, additional, and BUTTERY, Paula, additional
Published: 2023
Full Text: View/download PDF

22. Adaptive Forgetting Curves for Spaced Repetition Language Learning

Author: Zaidi, Ahmed, primary, Caines, Andrew, additional, Moore, Russell, additional, Buttery, Paula, additional, and Rice, Andrew, additional
Published: 2020
Full Text: View/download PDF

23. TEAM: A Low-Cost Alternative to ATLS for Providing Trauma Care Teaching in Haiti

Author: Kurdin, Anton, Caines, Andrew, Boone, Darrell, and Furey, Andrew
Published: 2018
Full Text: View/download PDF

24. You talking to me? : zero auxiliary constructions in British English

Author: Caines, Andrew Paul
Subjects: 400, English language--Auxiliary verbs, English language--Data processing, English language--Discourse analysis--Data processing
Published: 2011

25. Automated hate speech detection and span extraction in underground hacking and extremist forums

Author: Zhou, Linda, Caines, Andrew, Pete, Ildiko, Hutchings, Alice, Caines, Andrew [0000-0001-9647-4902], and Apollo - University of Cambridge Repository
Subjects: Linguistics and Language, Artificial Intelligence, Text classification, Corpus annotation, Span extraction, Language and Linguistics, Software
Abstract: Hate speech is any kind of communication that attacks a person or a group based on their characteristics, such as gender, religion and race. Due to the availability of online platforms where people can express their (hateful) opinions, the amount of hate speech is steadily increasing that often leads to offline hate crimes. This paper focuses on understanding and detecting hate speech in underground hacking and extremist forums where cybercriminals and extremists, respectively, communicate with each other, and some of them are associated with criminal activity. Moreover, due to the lengthy posts, it would be beneficial to identify the specific span of text containing hateful content in order to assist site moderators with the removal of hate speech. This paper describes a hate speech dataset composed of posts extracted from HackForums, an online hacking forum, and Stormfront and Incels.co, two extremist forums. We combined our dataset with a Twitter hate speech dataset to train a multi-platform classifier. Our evaluation shows that a classifier trained on multiple sources of data does not always improve the performance compared to a mono-platform classifier. Finally, this is the first work on extracting hate speech spans from longer texts. The paper fine-tunes BERT (Bidirectional Encoder Representations from Transformers) and adopts two approaches – span prediction and sequence labelling. Both approaches successfully extract hateful spans and achieve an F1-score of at least 69%.
Published: 2022

26. Do we fix front and back of every APC pelvic injury?

Author: Kendall, John, Caines, Andrew, and Buckley, Richard
Published: 2024
Full Text: View/download PDF

27. Shibboleth: An agent-based model of signalling mimicry

Author: Goodman, Jonathan R., primary, Caines, Andrew, additional, and Foley, Robert A., additional
Published: 2023
Full Text: View/download PDF

28. Argot as a Trust Signal: Slang, Jargon & Reputation on a Large Cybercrime Forum

Author: Hughes, Jack, Caines, Andrew, and Hutchings, Alice
Abstract: We apply signalling theory to a cybercrime forum to explore how argot (slang and jargon) is used to signal trust in untrustworthy environments. We develop an argot detection tool, using word embeddings from forum and non-forum datasets, which are aligned using training annotations. Compared with prior work, our approach improves performance, with an increase in the F1 and accuracy scores. Using the detected argot to create per-user variables, we find a negative correlation between the use of argot and reputation votes. We explore the trajectories of groups of forum members to observe how the use of argot and user reputation in the forum varies over time. Our findings indicate forum users are using argot to overcome the cold start problem, a conundrum faced by new users to social networks with ranking systems and marketplaces with feedback systems. A significant group of long-standing users is characterised by high levels of argot in their early forum postings. This decreases once reputation metrics increase. This particular trajectory group are amongst the highest-rated long-term members.
Published: 2023
Full Text: View/download PDF

29. Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English

Author: Moore, Russell, Caines, Andrew, Graham, Calbert, Buttery, Paula, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Král, Pavel, editor, and Matoušek, Václav, editor
Published: 2015
Full Text: View/download PDF

30. MultiGED-2023 shared task at NLP4CALL: Multilingual Grammatical Error Detection

Author: Volodina, Elena, primary, Bryant, Christopher, additional, Caines, Andrew, additional, De Clercq, Orphée, additional, Frey, Jennifer-Carmen, additional, Ershova, Elizaveta, additional, Rosen, Alexandr, additional, and Vinogradova, Olga, additional
Published: 2023
Full Text: View/download PDF

31. ‘You Still Talking to Me?’‘You Still Talking to Me?’

Author: Caines, Andrew, primary, McCarthy, Michael, additional, and Buttery, Paula, additional
Published: 2018
Full Text: View/download PDF

32. Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum

Author: Pastrana, Sergio, primary, Hutchings, Alice, additional, Caines, Andrew, additional, and Buttery, Paula, additional
Published: 2018
Full Text: View/download PDF

33. Automated hate speech detection and span extraction in underground hacking and extremist forums.

Author: Zhou, Linda, Caines, Andrew, Pete, Ildiko, and Hutchings, Alice
Subjects: LANGUAGE models, HATE speech, HATE crimes, DEAF children
Abstract: Hate speech is any kind of communication that attacks a person or a group based on their characteristics, such as gender, religion and race. Due to the availability of online platforms where people can express their (hateful) opinions, the amount of hate speech is steadily increasing that often leads to offline hate crimes. This paper focuses on understanding and detecting hate speech in underground hacking and extremist forums where cybercriminals and extremists, respectively, communicate with each other, and some of them are associated with criminal activity. Moreover, due to the lengthy posts, it would be beneficial to identify the specific span of text containing hateful content in order to assist site moderators with the removal of hate speech. This paper describes a hate speech dataset composed of posts extracted from HackForums, an online hacking forum, and Stormfront and Incels.co, two extremist forums. We combined our dataset with a Twitter hate speech dataset to train a multi-platform classifier. Our evaluation shows that a classifier trained on multiple sources of data does not always improve the performance compared to a mono-platform classifier. Finally, this is the first work on extracting hate speech spans from longer texts. The paper fine-tunes BERT (Bidirectional Encoder Representations from Transformers) and adopts two approaches – span prediction and sequence labelling. Both approaches successfully extract hateful spans and achieve an F1-score of at least 69%. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

34. A Survey on Recent Approaches to Question Difficulty Estimation from Text.

Author: BENEDETTO, LUCA, CREMONESI, PAOLO, CAINES, ANDREW, BUTTERY, PAULA, CAPPELLI, ANDREA, GIUSSANI, ANDREA, and TURRIN, ROBERTO
Subjects: NATURAL language processing
Abstract: Question Difficulty Estimation from Text (QDET) is the application of Natural Language Processing techniques to the estimation of a value, either numerical or categorical, which represents the difficulty of questions in educational settings. We give an introduction to the field, build a taxonomy based on question characteristics, and present the various approaches that have been proposed in recent years, outlining opportunities for further research. This survey provides an introduction for researchers and practitioners into the domain of question difficulty estimation from text and acts as a point of reference about recent research in this topic to date. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

35. Open Reduction Internal Fixation Versus Distal Femoral Replacement (DFR) for Treatment of OTA/AO 33C Fractures in the Elderly: A Review of Functional Outcomes and Cost Analysis

Author: Caines, Andrew, primary, Adamczyk, Andrew, additional, Mahaffey, Ryan, additional, and Pickell, Michael, additional
Published: 2023
Full Text: View/download PDF

36. CLIMB – Curriculum Learning for Infant-inspired Model Building

Author: Martinez, Richard Diehl, primary, McGovern, Hope, additional, Goriely, Zebulon, additional, Davis, Christopher, additional, Caines, Andrew, additional, Buttery, Paula, additional, and Beinborn, Lisa, additional
Published: 2023
Full Text: View/download PDF

37. Automatically identifying the function and intent of posts in underground forums

Author: Caines, Andrew, Pastrana, Sergio, Hutchings, Alice, and Buttery, Paula J.
Published: 2018
Full Text: View/download PDF

38. The Teacher-Student Chatroom Corpus version 2: more lessons, new annotation, automatic detection of sequence shifts

Author: Caines, Andrew, primary, Yannakoudakis, Helen, additional, Allen, Helen, additional, Pérez-Paredes, Pascual, additional, Byrne, Bill, additional, and Buttery, Paula, additional
Published: 2022
Full Text: View/download PDF

39. PostCog: A tool for interdisciplinary research into underground forums at scale

Author: Pete, Ildiko, primary, Hughes, Jack, additional, Caines, Andrew, additional, Vu, Anh V., additional, Gupta, Harshad, additional, Hutchings, Alice, additional, Anderson, Ross, additional, and Buttery, Paula, additional
Published: 2022
Full Text: View/download PDF

40. The Specificity and Helpfulness of Peer-to-Peer Feedback in Higher Education

Author: Rietsche, Roman, Caines, Andrew, Schramm, Cornelius, Pfütze, Dominik, and Buttery, Paula
Subjects: information management, Peer feedback, Sentence Specificity, BERT
Abstract: With the growth of online learning through MOOCs and other educational applications, it has become increasingly difficult for course providers to offer personalized feedback to students. Therefore asking students to provide feedback to each other has become one way to support learning. This peer-to-peer feedback has become increasingly important whether in MOOCs to provide feedback to thousands of students or in large-scale classes at universities. One of the challenges when allowing peer-to-peer feedback is that the feedback should be perceived as helpful, and an import factor determining helpfulness is how specific the feedback is. However, in classes including thousands of students, instructors do not have the resources to check the specificity of every piece of feedback between students. Therefore, we present an automatic classification model to measure sentence specificity in written feedback. The model was trained and tested on student feedback texts written in German where sentences have been labelled as general or specific. We find that we can automatically classify the sentences with an accuracy of 76.7% using a conventional feature-based approach, whereas transfer learning with BERT for German gives a classification accuracy of 81.1%. However, the feature-based approach comes with lower computational costs and preserves human interpretability of the coefficients. In addition we show that specificity of sentences in feedback texts has a weak positive correlation with perceptions of helpfulness. This indicates that specificity is one of the ingredients of good feedback, and invites further investigation.
Published: 2022

41. ALEN App: Argumentative Writing Support To Foster English Language Learning

Author: Wambsganss, Thiemo, Caines, Andrew, Buttery, Paula, Thiemo Wambsganss, Andrew Caines, and Paula Buttery
Abstract: This paper introduces a novel tool to support and engage English language learners with feedback on the quality of their argument structures. We present an approach which automatically detects claim-premise structures and provides visual feedback to the learner to prompt them to repair any broken argumentation structures.To investigate, if our persuasive feedback on language learners’ essay writing tasks engages and supports them in learning better English language, we designed the ALEN app (Argumentation for Learning English). We leverage an argumentation mining model trained on texts written by students and embed it in a writing support tool which provides students with feedback in their essay writing process. We evaluated our tool in two field-studies with a total of 28 students from a German high school to investigate the effects of adaptive argumentation feedback on their learning of English. The quantitative results suggest that using the ALEN app leads to a high self-efficacy, ease-of-use, intention to use and perceived usefulness for students in their English language learning process. Moreover, the qualitative answers indicate the potential benefits of combining grammar feedback with discourse level argumentation mining.
Published: 2022

42. Towards an open-domain chatbot for language practice

Author: Tyen, Gladys, primary, Brenchley, Mark, additional, Caines, Andrew, additional, and Buttery, Paula, additional
Published: 2022
Full Text: View/download PDF

43. A unified framework for cross-domain and cross-task learning of mental health conditions

Author: Chua, Huikai, primary, Caines, Andrew, additional, and Yannakoudakis, Helen, additional
Published: 2022
Full Text: View/download PDF

44. Probing for targeted syntactic knowledge through grammatical error detection

Author: Davis, Christopher, primary, Bryant, Christopher, additional, Caines, Andrew, additional, Rei, Marek, additional, and Buttery, Paula, additional
Published: 2022
Full Text: View/download PDF

45. Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English

Author: Moore, Russell, primary, Caines, Andrew, additional, Graham, Calbert, additional, and Buttery, Paula, additional
Published: 2015
Full Text: View/download PDF

46. Listening practice for learners of English: towards an intelligent tutoring system

Author: Caines, Andrew, primary, Waters, Joseph, additional, Xu, Sherry, additional, Elliott, Mark, additional, Lee, Hye-won, additional, Moore, Russell, additional, Bozic, Mirjana, additional, and Buttery, Paula, additional
Published: 2021
Full Text: View/download PDF

47. You talking to me? Corpus and experimental data on the zero auxiliary interrogative in British English

Author: Caines, Andrew, primary
Published: 2012
Full Text: View/download PDF

48. Normalising frequency counts to account for 'opportunity of use' in learner corpora

Author: Buttery, Paula, primary and Caines, Andrew, additional
Published: 2012
Full Text: View/download PDF

49. The Teacher-Student Chatroom Corpus

Author: Caines, Andrew, primary, Yannakoudakis, Helen, additional, Edmondson, Helena, additional, Allen, Helen, additional, Pérez-Paredes, Pascual, additional, Byrne, Bill, additional, and Buttery, Paula, additional
Published: 2020
Full Text: View/download PDF

50. Collecting the Teacher-Student Chatroom Corpus

Author: Caines, Andrew, primary, Yannakoudakis, Helen, additional, Edmondson, Helena, additional, Allen, Helen, additional, Pérez-Paredes, Pascual, additional, Byrne, Bill, additional, and Buttery, Paula, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

133 results on '"Caines, Andrew"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources