Author: "Mahamood, Saad" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mahamood, Saad"' showing total 30 results

Start Over Author "Mahamood, Saad"

30 results on '"Mahamood, Saad"'

1. Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Author: Schmidtová, Patrícia, Mahamood, Saad, Balloccu, Simone, Dušek, Ondřej, Gatt, Albert, Gkatzia, Dimitra, Howcroft, David M., Plátek, Ondřej, and Sivaprasad, Adarsa
Subjects: Computer Science - Computation and Language
Abstract: Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field., Comment: Accepted to INLG 2024
Published: 2024

2. On the Role of Summary Content Units in Text Summarization Evaluation

Author: Nawrath, Marcel, Nowak, Agnieszka, Ratz, Tristan, Walenta, Danilo C., Opitz, Juri, Ribeiro, Leonardo F. R., Sedoc, João, Deutsch, Daniel, Mille, Simon, Liu, Yixin, Zhang, Lining, Gehrmann, Sebastian, Mahamood, Saad, Clinciu, Miruna, Chandu, Khyathi, and Hou, Yufang
Subjects: Computer Science - Computation and Language
Abstract: At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs are concise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages? ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategies to approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when ranking short summaries, but may not help as much when ranking systems or longer summaries., Comment: 10 Pages, 3 Figures, 3 Tables, camera ready version accepted at NAACL 2024
Published: 2024

3. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Author: Belz, Anya, Thomson, Craig, Reiter, Ehud, Abercrombie, Gavin, Alonso-Moral, Jose M., Arvan, Mohammad, Braggaar, Anouck, Cieliebak, Mark, Clark, Elizabeth, van Deemter, Kees, Dinkar, Tanvi, Dušek, Ondřej, Eger, Steffen, Fang, Qixiang, Gao, Mingqi, Gatt, Albert, Gkatzia, Dimitra, González-Corbelle, Javier, Hovy, Dirk, Hürlimann, Manuela, Ito, Takumi, Kelleher, John D., Klubicka, Filip, Krahmer, Emiel, Lai, Huiyuan, van der Lee, Chris, Li, Yiru, Mahamood, Saad, Mieskes, Margot, van Miltenburg, Emiel, Mosteiro, Pablo, Nissim, Malvina, Parde, Natalie, Plátek, Ondřej, Rieser, Verena, Ruan, Jie, Tetreault, Joel, Toral, Antonio, Wan, Xiaojun, Wanner, Leo, Watson, Lewis, and Yang, Diyi
Subjects: Computer Science - Computation and Language, 68, I.2.7
Abstract: We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP., Comment: 5 pages plus appendix, 4 tables, 1 figure. To appear at "Workshop on Insights from Negative Results in NLP" (co-located with EACL2023). Updated author list and acknowledgements
Published: 2023

4. Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

Author: Zhang, Lining, Mille, Simon, Hou, Yufang, Deutsch, Daniel, Clark, Elizabeth, Liu, Yixin, Mahamood, Saad, Gehrmann, Sebastian, Clinciu, Miruna, Chandu, Khyathi, and Sedoc, João
Subjects: Computer Science - Computation and Language
Abstract: To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.
Published: 2022

5. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Author: Gehrmann, Sebastian, Bhattacharjee, Abhik, Mahendiran, Abinaya, Wang, Alex, Papangelis, Alexandros, Madaan, Aman, McMillan-Major, Angelina, Shvets, Anna, Upadhyay, Ashish, Yao, Bingsheng, Wilie, Bryan, Bhagavatula, Chandra, You, Chaobin, Thomson, Craig, Garbacea, Cristina, Wang, Dakuo, Deutsch, Daniel, Xiong, Deyi, Jin, Di, Gkatzia, Dimitra, Radev, Dragomir, Clark, Elizabeth, Durmus, Esin, Ladhak, Faisal, Ginter, Filip, Winata, Genta Indra, Strobelt, Hendrik, Hayashi, Hiroaki, Novikova, Jekaterina, Kanerva, Jenna, Chim, Jenny, Zhou, Jiawei, Clive, Jordan, Maynez, Joshua, Sedoc, João, Juraska, Juraj, Dhole, Kaustubh, Chandu, Khyathi Raghavi, Perez-Beltrachini, Laura, Ribeiro, Leonardo F. R., Tunstall, Lewis, Zhang, Li, Pushkarna, Mahima, Creutz, Mathias, White, Michael, Kale, Mihir Sanjay, Eddine, Moussa Kamal, Daheim, Nico, Subramani, Nishant, Dusek, Ondrej, Liang, Paul Pu, Ammanamanchi, Pawan Sasanka, Zhu, Qi, Puduppully, Ratish, Kriz, Reno, Shahriyar, Rifat, Cardenas, Ronald, Mahamood, Saad, Osei, Salomey, Cahyawijaya, Samuel, Štajner, Sanja, Montella, Sebastien, Shailza, Jolly, Shailza, Mille, Simon, Hasan, Tahmid, Shen, Tianhao, Adewumi, Tosin, Raunak, Vikas, Raheja, Vipul, Nikolaev, Vitaly, Tsai, Vivian, Jernite, Yacine, Xu, Ying, Sang, Yisi, Liu, Yixin, and Hou, Yufang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
Published: 2022

6. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Author: Dhole, Kaustubh D., Gangal, Varun, Gehrmann, Sebastian, Gupta, Aadesh, Li, Zhenhao, Mahamood, Saad, Mahendiran, Abinaya, Mille, Simon, Shrivastava, Ashish, Tan, Samson, Wu, Tongshuang, Sohl-Dickstein, Jascha, Choi, Jinho D., Hovy, Eduard, Dusek, Ondrej, Ruder, Sebastian, Anand, Sajant, Aneja, Nagender, Banjade, Rabin, Barthe, Lisa, Behnke, Hanna, Berlot-Attwell, Ian, Boyle, Connor, Brun, Caroline, Cabezudo, Marco Antonio Sobrevilla, Cahyawijaya, Samuel, Chapuis, Emile, Che, Wanxiang, Choudhary, Mukund, Clauss, Christian, Colombo, Pierre, Cornell, Filip, Dagan, Gautier, Das, Mayukh, Dixit, Tanay, Dopierre, Thomas, Dray, Paul-Alexis, Dubey, Suchitra, Ekeinhor, Tatiana, Di Giovanni, Marco, Goyal, Tanya, Gupta, Rishabh, Hamla, Louanes, Han, Sang, Harel-Canada, Fabrice, Honore, Antoine, Jindal, Ishan, Joniak, Przemyslaw K., Kleyko, Denis, Kovatchev, Venelin, Krishna, Kalpesh, Kumar, Ashutosh, Langer, Stefan, Lee, Seungjae Ryan, Levinson, Corey James, Liang, Hualou, Liang, Kaizhao, Liu, Zhexiong, Lukyanenko, Andrey, Marivate, Vukosi, de Melo, Gerard, Meoni, Simon, Meyer, Maxime, Mir, Afnan, Moosavi, Nafise Sadat, Muennighoff, Niklas, Mun, Timothy Sum Hon, Murray, Kenton, Namysl, Marcin, Obedkova, Maria, Oli, Priti, Pasricha, Nivranshu, Pfister, Jan, Plant, Richard, Prabhu, Vinay, Pais, Vasile, Qin, Libo, Raji, Shahab, Rajpoot, Pawan Kumar, Raunak, Vikas, Rinberg, Roy, Roberts, Nicolas, Rodriguez, Juan Diego, Roux, Claude, S., Vasconcellos P. H., Sai, Ananya B., Schmidt, Robin M., Scialom, Thomas, Sefara, Tshephisho, Shamsi, Saqib N., Shen, Xudong, Shi, Haoyue, Shi, Yiwen, Shvets, Anna, Siegel, Nick, Sileo, Damien, Simon, Jamie, Singh, Chandan, Sitelew, Roman, Soni, Priyank, Sorensen, Taylor, Soto, William, Srivastava, Aman, Srivatsa, KV Aditya, Sun, Tony, T, Mukund Varma, Tabassum, A, Tan, Fiona Anting, Teehan, Ryan, Tiwari, Mo, Tolkiehn, Marie, Wang, Athena, Wang, Zijian, Wang, Gloria, Wang, Zijie J., Wei, Fuxuan, Wilie, Bryan, Winata, Genta Indra, Wu, Xinyi, Wydmański, Witold, Xie, Tianbao, Yaseen, Usama, Yee, Michael A., Zhang, Jing, and Zhang, Yue
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter)., Comment: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter
Published: 2021

7. Underreporting of errors in NLG output, and what to do about it

Author: van Miltenburg, Emiel, Clinciu, Miruna-Adriana, Dušek, Ondřej, Gkatzia, Dimitra, Inglis, Stephanie, Leppänen, Leo, Mahamood, Saad, Manning, Emma, Schoch, Stephanie, Thomson, Craig, and Wen, Luou
Subjects: Computer Science - Computation and Language
Abstract: We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by `state-of-the-art' research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting., Comment: Prefinal version, accepted for publication in the Proceedings of the 14th International Conference on Natural Language Generation (INLG 2021, Aberdeen). Comments welcome
Published: 2021

8. Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Author: Mille, Simon, Dhole, Kaustubh D., Mahamood, Saad, Perez-Beltrachini, Laura, Gangal, Varun, Kale, Mihir, van Miltenburg, Emiel, and Gehrmann, Sebastian
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
Published: 2021

9. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Author: Gehrmann, Sebastian, Adewumi, Tosin, Aggarwal, Karmanya, Ammanamanchi, Pawan Sasanka, Anuoluwapo, Aremu, Bosselut, Antoine, Chandu, Khyathi Raghavi, Clinciu, Miruna, Das, Dipanjan, Dhole, Kaustubh D., Du, Wanyu, Durmus, Esin, Dušek, Ondřej, Emezue, Chris, Gangal, Varun, Garbacea, Cristina, Hashimoto, Tatsunori, Hou, Yufang, Jernite, Yacine, Jhamtani, Harsh, Ji, Yangfeng, Jolly, Shailza, Kale, Mihir, Kumar, Dhruv, Ladhak, Faisal, Madaan, Aman, Maddela, Mounica, Mahajan, Khyati, Mahamood, Saad, Majumder, Bodhisattwa Prasad, Martins, Pedro Henrique, McMillan-Major, Angelina, Mille, Simon, van Miltenburg, Emiel, Nadeem, Moin, Narayan, Shashi, Nikolaev, Vitaly, Niyongabo, Rubungo Andre, Osei, Salomey, Parikh, Ankur, Perez-Beltrachini, Laura, Rao, Niranjan Ramesh, Raunak, Vikas, Rodriguez, Juan Diego, Santhanam, Sashank, Sedoc, João, Sellam, Thibault, Shaikh, Samira, Shimorina, Anastasia, Cabezudo, Marco Antonio Sobrevilla, Strobelt, Hendrik, Subramani, Nishant, Xu, Wei, Yang, Diyi, Yerukola, Akhila, and Zhou, Jiawei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.
Published: 2021

10. Generating affective natural language for parents of neonatal infants

Author: Mahamood, Saad Ali
Subjects: 006.3, Computational linguistics, Neonatal intensive care, Natural language processing (Computer science)
Abstract: The thesis presented here describes original research in the field of Natural Language Generation (NLG). NLG is the subfield of artificial intelligence that is concerned with the automatic production of documents from underlying data. This thesis in particular focuses on developing new and novel methods for generating text that takes into consideration the recipient’s level of stress as a factor to adapt the resultant textural output. This consideration of taking the recipient level of stress was particularly salient due to the domain that this research was conducted under; providing information for parents of pre-term infants during neonatal intensive care (NICU). A highly technical and stressful environment for parents where emotional sensitivity must be shown for the nature of information presented. We have investigated the emotional and informational needs of these parents through an extensive past literature review and two separate research studies with former and current NICU parents. The NLG system built for this research was called BabyTalk Family (BT-Family). A system that can produce a textual summary of medical events that has occurred for a baby in NICU in last twenty-four hours for parents. The novelty of this system is that is capable of estimating the level of stress of the recipient and by using several affective NLG strategies it is able to tailor it’s output for a stressed audience. Unlike traditional NLG systems where the output would remain unchanged regardless of emotional state of the recipient. The key innovation in this system was the integration of several affective strategies in the Document Planner for tailoring textual output for stress recipients. BT-Family’s output was evaluated with thirteen parents that previously had baby in neonatal care. We developed a methodology for an evaluation that involved a direct comparison between stressed and unstressed text for the same given medical scenario for variables such as preference, understandability, helpfulness, and emotional appropriateness. The results, obtained showed the parents overwhelming preferred the stressed text for all of the variables measured.
Published: 2010

11. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Author: Sub Natural Language Processing, Leerstoel Oberski, Methodology and statistics for the behavioural and social sciences, Dep Informatica, Natural Language Processing, Belz, Anya, Thomson, Craig, Reiter, Ehud, Abercrombie, Gavin, Alonso-Moral, Jose M., Arvan, Mohammad, Cheung, Jackie, Cieliebak, Mark, Clark, Elizabeth, Deemter, Kees van, Dinkar, Tanvi, Dušek, Ondřej, Eger, Steffen, Fang, Qixiang, Gatt, Albert, Gkatzia, Dimitra, González-Corbelle, Javier, Hovy, Dirk, Hürlimann, Manuela, Ito, Takumi, Kelleher, John D., Klubicka, Filip, Lai, Huiyuan, Lee, Chris van der, Miltenburg, Emiel van, Li, Yiru, Mahamood, Saad, Mieskes, Margot, Nissim, Malvina, Parde, Natalie, Plátek, Ondřej, Rieser, Verena, Romero, Pablo Mosteiro, Tetreault, Joel, Toral, Antonio, Wan, Xiaojun, Wanner, Leo, Watson, Lewis, Yang, Diyi, Sub Natural Language Processing, Leerstoel Oberski, Methodology and statistics for the behavioural and social sciences, Dep Informatica, Natural Language Processing, Belz, Anya, Thomson, Craig, Reiter, Ehud, Abercrombie, Gavin, Alonso-Moral, Jose M., Arvan, Mohammad, Cheung, Jackie, Cieliebak, Mark, Clark, Elizabeth, Deemter, Kees van, Dinkar, Tanvi, Dušek, Ondřej, Eger, Steffen, Fang, Qixiang, Gatt, Albert, Gkatzia, Dimitra, González-Corbelle, Javier, Hovy, Dirk, Hürlimann, Manuela, Ito, Takumi, Kelleher, John D., Klubicka, Filip, Lai, Huiyuan, Lee, Chris van der, Miltenburg, Emiel van, Li, Yiru, Mahamood, Saad, Mieskes, Margot, Nissim, Malvina, Parde, Natalie, Plátek, Ondřej, Rieser, Verena, Romero, Pablo Mosteiro, Tetreault, Joel, Toral, Antonio, Wan, Xiaojun, Wanner, Leo, Watson, Lewis, and Yang, Diyi
Published: 2023

12. Barriers and enabling factors for error analysis in NLG research

Author: Van Miltenburg, Emiel, primary, Clinciu, Miruna, additional, Dušek, Ondřej, additional, Gkatzia, Dimitra, additional, Inglis, Stephanie, additional, Leppänen, Leo, additional, Mahamood, Saad, additional, Schoch, Stephanie, additional, Thomson, Craig, additional, and Wen, Luou, additional
Published: 2023
Full Text: View/download PDF

13. A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

Author: Zhang, Lining, primary, Mille, Simon, additional, Hou, Yufang, additional, Deutsch, Daniel, additional, Clark, Elizabeth, additional, Liu, Yixin, additional, Mahamood, Saad, additional, Gehrmann, Sebastian, additional, Clinciu, Miruna, additional, Chandu, Khyathi Raghavi, additional, and Sedoc, João, additional
Published: 2023
Full Text: View/download PDF

14. Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization

Author: Zhang, Lining, Sedoc, João, Mille, Simon, Hou, Yufang, Gehrmann, Sebastian, Deutsch, Daniel, Clark, Elizabeth, Liu, Yixin, Clinciu, Miruna, Mahamood, Saad, and Chandu, Khyathi
Subjects: FOS: Computer and information sciences, Computation and Language (cs.CL)
Abstract: The acquisition of high-quality human annotations through crowdsourcing platforms like Amazon Mechanical Turk (MTurk) is more challenging than expected. The annotation quality might be affected by various aspects like annotation instructions, Human Intelligence Task (HIT) design, and wages paid to annotators, etc. To avoid potentially low-quality annotations which could mislead the evaluation of automatic summarization system outputs, we investigate the recruitment of high-quality MTurk workers via a three-step qualification pipeline. We show that we can successfully filter out bad workers before they carry out the evaluations and obtain high-quality annotations while optimizing the use of resources. This paper can serve as basis for the recruitment of qualified annotators in other challenging annotation tasks.
Published: 2022
Full Text: View/download PDF

15. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Author: Gehrmann, Sebastian, primary, Bhattacharjee, Abhik, additional, Mahendiran, Abinaya, additional, Wang, Alex, additional, Papangelis, Alexandros, additional, Madaan, Aman, additional, Mcmillan-major, Angelina, additional, Shvets, Anna, additional, Upadhyay, Ashish, additional, Bohnet, Bernd, additional, Yao, Bingsheng, additional, Wilie, Bryan, additional, Bhagavatula, Chandra, additional, You, Chaobin, additional, Thomson, Craig, additional, Garbacea, Cristina, additional, Wang, Dakuo, additional, Deutsch, Daniel, additional, Xiong, Deyi, additional, Jin, Di, additional, Gkatzia, Dimitra, additional, Radev, Dragomir, additional, Clark, Elizabeth, additional, Durmus, Esin, additional, Ladhak, Faisal, additional, Ginter, Filip, additional, Winata, Genta Indra, additional, Strobelt, Hendrik, additional, Hayashi, Hiroaki, additional, Novikova, Jekaterina, additional, Kanerva, Jenna, additional, Chim, Jenny, additional, Zhou, Jiawei, additional, Clive, Jordan, additional, Maynez, Joshua, additional, Sedoc, João, additional, Juraska, Juraj, additional, Dhole, Kaustubh, additional, Chandu, Khyathi Raghavi, additional, Beltrachini, Laura Perez, additional, Ribeiro, Leonardo F . R., additional, Tunstall, Lewis, additional, Zhang, Li, additional, Pushkarna, Mahim, additional, Creutz, Mathias, additional, White, Michael, additional, Kale, Mihir Sanjay, additional, Eddine, Moussa Kamal, additional, Daheim, Nico, additional, Subramani, Nishant, additional, Dusek, Ondrej, additional, Liang, Paul Pu, additional, Ammanamanchi, Pawan Sasanka, additional, Zhu, Qi, additional, Puduppully, Ratish, additional, Kriz, Reno, additional, Shahriyar, Rifat, additional, Cardenas, Ronald, additional, Mahamood, Saad, additional, Osei, Salomey, additional, Cahyawijaya, Samuel, additional, Štajner, Sanja, additional, Montella, Sebastien, additional, Jolly, Shailza, additional, Mille, Simon, additional, Hasan, Tahmid, additional, Shen, Tianhao, additional, Adewumi, Tosin, additional, Raunak, Vikas, additional, Raheja, Vipul, additional, Nikolaev, Vitaly, additional, Tsai, Vivian, additional, Jernite, Yacine, additional, Xu, Ying, additional, Sang, Yisi, additional, Liu, Yixin, additional, and Hou, Yufang, additional
Published: 2022
Full Text: View/download PDF

16. Underreporting of errors in NLG output, and what to do about it

Author: Miltenburg, Emiel, Clinciu, Miruna, Ondrej Dusek, Gkatzia, Dimitra, Inglis, Stephanie, Leppänen, Leo, Mahamood, Saad, Manning, Emma, Schoch, Stephanie, Thomson, Craig, Wen, Luou, Department of Computer Science, Discovery Research Group/Prof. Hannu Toivonen, and Language, Communication and Cognition
Subjects: FOS: Computer and information sciences, evaluation, Computer Science - Computation and Language, 113 Computer and information sciences, Computation and Language (cs.CL), natural language generation
Abstract: We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make. This is a problem, because mistakes are an important indicator of where systems should still be improved. If authors only report overall performance metrics, the research community is left in the dark about the specific weaknesses that are exhibited by `state-of-the-art' research. Next to quantifying the extent of error under-reporting, this position paper provides recommendations for error identification, analysis and reporting., Prefinal version, accepted for publication in the Proceedings of the 14th International Conference on Natural Language Generation (INLG 2021, Aberdeen). Comments welcome
Published: 2021
Full Text: View/download PDF

17. A Framework for Task-Sensitive Natural Language Augmentation.

Author: Dhole, Kaustubh D., Gangal, Varun, Gehrmann, Sebastian, Gupta, Aadesh, Zhenhao Li, Mahamood, Saad, Mahendiran, Abinaya, Mille, Simon, Shrivastava, Ashish, Tan, Samson, Tongshuang Wu, Sohl-Dickstein, Jascha, Choi, Jinho D., Hovy, Eduard, Dusek, Ondrej, Ruder, Sebastian, Anand, Sajant, Aneja, Nagender, Banjade, Rabin, and Barthe, Lisa
Subjects: DATA mining, NATURAL language processing, SOCIOLINGUISTICS, SYNTAX in programming languages, LANGUAGE models, ROBUST control
Abstract: Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the effiicacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

18. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

Author: Howcroft, David, Belz, Anya, Gkatzia, Dimitra, Clinciu, Miruna, Hasan, Sadid, Mahamood, Saad, Mille, Simon, van Miltenburg, Emiel, Santhanam, Sashank, Rieser, Verena, Language, Communication and Cognition, Davis, Brian, Graham, Yvette, Kelleher, John D., and Sripada, Yaji
Subjects: Computational linguistics
Abstract: Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
Published: 2020

19. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions

Author: Davis, Brian, Graham, Yvette, Kelleher, John D., Sripada, Yaji, Howcroft, David, Belz, Anya, Gkatzia, Dimitra, Clinciu, Miruna, Hasan, Sadid, Mahamood, Saad, Mille, Simon, van Miltenburg, Emiel, Santhanam, Sashank, Rieser, Verena, Davis, Brian, Graham, Yvette, Kelleher, John D., Sripada, Yaji, Howcroft, David, Belz, Anya, Gkatzia, Dimitra, Clinciu, Miruna, Hasan, Sadid, Mahamood, Saad, Mille, Simon, van Miltenburg, Emiel, Santhanam, Sashank, and Rieser, Verena
Abstract: Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
Published: 2020

20. Underreporting of errors in NLG output, and what to do about it

Author: van Miltenburg, Emiel, primary, Clinciu, Miruna, additional, Dušek, Ondřej, additional, Gkatzia, Dimitra, additional, Inglis, Stephanie, additional, Leppänen, Leo, additional, Mahamood, Saad, additional, Manning, Emma, additional, Schoch, Stephanie, additional, Thomson, Craig, additional, and Wen, Luou, additional
Published: 2021
Full Text: View/download PDF

21. Reproducing a Comparison of Hedged and Non-hedged NLG Texts

Author: Mahamood, Saad, primary
Published: 2021
Full Text: View/download PDF

22. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Author: Gehrmann, Sebastian, primary, Adewumi, Tosin, additional, Aggarwal, Karmanya, additional, Ammanamanchi, Pawan Sasanka, additional, Aremu, Anuoluwapo, additional, Bosselut, Antoine, additional, Chandu, Khyathi Raghavi, additional, Clinciu, Miruna-Adriana, additional, Das, Dipanjan, additional, Dhole, Kaustubh, additional, Du, Wanyu, additional, Durmus, Esin, additional, Dušek, Ondřej, additional, Emezue, Chris Chinenye, additional, Gangal, Varun, additional, Garbacea, Cristina, additional, Hashimoto, Tatsunori, additional, Hou, Yufang, additional, Jernite, Yacine, additional, Jhamtani, Harsh, additional, Ji, Yangfeng, additional, Jolly, Shailza, additional, Kale, Mihir, additional, Kumar, Dhruv, additional, Ladhak, Faisal, additional, Madaan, Aman, additional, Maddela, Mounica, additional, Mahajan, Khyati, additional, Mahamood, Saad, additional, Majumder, Bodhisattwa Prasad, additional, Martins, Pedro Henrique, additional, McMillan-Major, Angelina, additional, Mille, Simon, additional, van Miltenburg, Emiel, additional, Nadeem, Moin, additional, Narayan, Shashi, additional, Nikolaev, Vitaly, additional, Niyongabo Rubungo, Andre, additional, Osei, Salomey, additional, Parikh, Ankur, additional, Perez-Beltrachini, Laura, additional, Rao, Niranjan Ramesh, additional, Raunak, Vikas, additional, Rodriguez, Juan Diego, additional, Santhanam, Sashank, additional, Sedoc, João, additional, Sellam, Thibault, additional, Shaikh, Samira, additional, Shimorina, Anastasia, additional, Sobrevilla Cabezudo, Marco Antonio, additional, Strobelt, Hendrik, additional, Subramani, Nishant, additional, Xu, Wei, additional, Yang, Diyi, additional, Yerukola, Akhila, additional, and Zhou, Jiawei, additional
Published: 2021
Full Text: View/download PDF

23. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

Author: Howcroft, David M., primary, Belz, Anya, additional, Clinciu, Miruna-Adriana, additional, Gkatzia, Dimitra, additional, Hasan, Sadid A., additional, Mahamood, Saad, additional, Mille, Simon, additional, van Miltenburg, Emiel, additional, Santhanam, Sashank, additional, and Rieser, Verena, additional
Published: 2020
Full Text: View/download PDF

24. Explainable Artificial Intelligence and its potential within Industry

Author: Mahamood, Saad, primary
Published: 2019
Full Text: View/download PDF

25. Hotel Scribe: Generating High Variation Hotel Descriptions

Author: Mahamood, Saad, primary and Zembrzuski, Maciej, additional
Published: 2019
Full Text: View/download PDF

26. A Snapshot of NLG Evaluation Practices 2005 - 2014

Author: Gkatzia, Dimitra, primary and Mahamood, Saad, additional
Published: 2015
Full Text: View/download PDF

27. Generating Annotated Graphs using the NLG Pipeline Architecture

Author: Mahamood, Saad, primary, Bradshaw, William, additional, and Reiter, Ehud, additional
Published: 2014
Full Text: View/download PDF

28. From data to text in the Neonatal Intensive Care Unit: Using NLG technology for decision support and information management

Author: Gatt, Albert, primary, Portet, François, additional, Reiter, Ehud, additional, Hunter, Jim, additional, Mahamood, Saad, additional, Moncur, Wendy, additional, and Sripada, Somayajulu, additional
Published: 2009
Full Text: View/download PDF

29. Neonatal Intensive Care Information for Parents An Affective Approach

Author: Mahamood, Saad, primary, Reiter, Ehud, additional, and Mellish, Chris, additional
Published: 2008
Full Text: View/download PDF

30. A comparison of hedged and non-hedged NLG texts

Author: Mahamood, Saad, primary, Reiter, Ehud, additional, and Mellish, Chris, additional
Published: 2007
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

30 results on '"Mahamood, Saad"'

1. Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

2. On the Role of Summary Content Units in Text Summarization Evaluation

3. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

4. Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

5. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

6. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

7. Underreporting of errors in NLG output, and what to do about it

8. Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

9. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

10. Generating affective natural language for parents of neonatal infants

11. Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

12. Barriers and enabling factors for error analysis in NLG research

13. A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

14. Needle in a Haystack: An Analysis of Finding Qualified Workers on MTurk for Summarization

15. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

16. Underreporting of errors in NLG output, and what to do about it

17. A Framework for Task-Sensitive Natural Language Augmentation.

18. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

19. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions

20. Underreporting of errors in NLG output, and what to do about it

21. Reproducing a Comparison of Hedged and Non-hedged NLG Texts

22. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

23. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

24. Explainable Artificial Intelligence and its potential within Industry

25. Hotel Scribe: Generating High Variation Hotel Descriptions

26. A Snapshot of NLG Evaluation Practices 2005 - 2014

27. Generating Annotated Graphs using the NLG Pipeline Architecture

28. From data to text in the Neonatal Intensive Care Unit: Using NLG technology for decision support and information management

29. Neonatal Intensive Care Information for Parents An Affective Approach

30. A comparison of hedged and non-hedged NLG texts

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

30 results on '"Mahamood, Saad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources