8 results on '"llama-2"'
Search Results
2. Enhancing textual textbook question answering with large language models and retrieval augmented generation
- Author
-
Alawwad, Hessa A., Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, and Jamal, Amani
- Published
- 2025
- Full Text
- View/download PDF
3. An Empirical Evaluation of Large Language Models in Static Code Analysis for PHP Vulnerability Detection
- Author
-
Orçun Çetin, Emre Ekmekcioglu, Budi Arief, and Julio Hernandez-Castro
- Subjects
ChatGPT ,Claude ,Bard ,Gemini ,Llama-2 ,Static cod ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Web services play an important role in our daily lives. They are used in a wide range of activities, from online banking and shopping to education, entertainment and social interactions. Therefore, it is essential to ensure that they are kept as secure as possible. However – as is the case with any complex software system – creating a sophisticated software free from any security vulnerabilities is a very challenging task. One method to enhance software security is by employing static code analysis. This technique can be used to identify potential vulnerabilities in the source code before they are exploited by bad actors. This approach has been instrumental in tackling many vulnerabilities, but it is not without limitations. Recent research suggests that static code analysis can benefit from the use of large language models (LLMs). This is a promising line of research, but there are still very few and quite limited studies in the literature on the effectiveness of various LLMs at detecting vulnerabilities in source code. This is the research gap that we aim to address in this work. Our study examined five notable LLM chatbot models: ChatGPT 4, ChatGPT 3.5, Claude, Bard/Gemini1, and Llama-2, assessing their abilities to identify 104 known vulnerabilities spanning the Top-10 categories defined by the Open Worldwide Application Security Project (OWASP). Moreover, we evaluated issues related to these LLMs’ false-positive rates using 97 patched code samples. We specifically focused on PHP vulnerabilities, given its prevalence in web applications. We found that ChatGPT-4 has the highest vulnerability detection rate, with over 61.5% of vulnerabilities found, followed by ChatGPT-3.5 at 50%. Bard has the highest rate of vulnerabilities missed, at 53.8%, and the lowest detection rate, at 13.4%. For all models, there is a significant percentage of vulnerabilities that were classified as partially found, indicating a level of uncertainty or incomplete detection across all tested LLMs. Moreover, we found that ChatGPT-4 and ChatGPT-3.5 are consistently more effective across most categories, compared to other models. Bard and Llama-2 display limited effectiveness in detecting vulnerabilities across the majority of categories listed. Surprisingly, our findings reveal high false positive rates across all LLMs. Even the model demonstrating the best performance (ChatGPT-4) notched a false positive rate of nearly 63%, while several models glaringly under-performed, hitting startlingly bad false positive rates of over 90%. Finally, simultaneously deploying multiple LLMs for static analysis resulted in only a marginal enhancement in the rates of vulnerability detection. We believe these results are generalizable to most other programming languages, and hence far from being limited to PHP only.
- Published
- 2024
- Full Text
- View/download PDF
4. Can the “Art” of Mathematical Modeling in Industrial Engineering be Automated by AI?
- Author
-
Mardikar, Shrushti, Aldhuhayyan, Abdullah, and Prabhu, Vittaldas V.
- Subjects
ECONOMIC activity ,TECHNOLOGICAL innovations ,ECONOMIC development ,DIGITAL technology ,ARTIFICIAL intelligence - Abstract
The capacity of Generative Artificial Intelligence (AI) models in formulating optimization problems is an interesting area of exploration in this rapidly evolving field. This study explores the capability of AI to interpret and formulate mathematical modeling problems from English descriptions. Five Large Language Models (LLMs) were selected, including OpenAI's ChatGPT-4, Google's Gemini, Microsoft's Copilot, Anthropic’s Claude, and an open-sourced model by Meta Llama-2. The research is conducted through a systematic comparison between human-expert generated formulations and those produced by the LLMs to better understand the strengths and shortcomings of the five LLMs. A diverse set of 26 linear programming problems was used for this evaluation. The effectiveness of these AI tools is measured based on the correctness of the formulations. ChatGPT-4 outperformed its competitors with a mean score of 88.55, followed by Copilot at 84.93, Gemini at 83.57, Claude at 81.21, and Llama-2 at 46.26. The test problems were example in Linear programming from typical junior level course in industrial engineering and were graded using a rubric. Overall, ChatGPT-4 was the best earning a "B+" grade compared to others Copilot ("B"), Gemini ("B-"), Claude ("B-"), and Llama-2 ("F"). These findings indicate considerable variation in current Generative AI technologies in their ability to automatically formulate mathematical optimization problems. There are interesting opportunities to harness these technologies as they continue to rapidly evolve in research, education, and practice. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Comparative Analysis of Large Language Models for Question Answering from Financial Documents
- Author
-
Panwar, Shivam, Bansal, Anukriti, Zareen, Farhana, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Sharma, Harish, editor, Shrivastava, Vivek, editor, Tripathi, Ashish Kumar, editor, and Wang, Lipo, editor
- Published
- 2024
- Full Text
- View/download PDF
6. Implementation of language models within an infrastructure designed for Natural Language Processing
- Author
-
Bartosz Walkowiak and Tomasz Walkowiak
- Subjects
language model deployment ,quantization ,llama-2 ,e5 model ,onnx ,llama.cpp ,clarin-pl ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 ,Telecommunication ,TK5101-6720 - Abstract
This paper explores cost-effective alternatives for resource-constrained environments in the context of language models by investigating methods such as quantization and CPUbased model implementations. The study addresses the computational efficiency of language models during inference and the development of infrastructure for text document processing. The paper discusses related technologies, the CLARIN-PL infrastructure architecture, and implementations of small and large language models. The emphasis is on model formats, data precision, and runtime environments (GPU and CPU). It identifies optimal solutions through extensive experimentation. In addition, the paper advocates for a more comprehensive performance evaluation approach. Instead of reporting only average token throughput, it suggests considering the curve’s shape, which can vary from constant to monotonically increasing or decreasing functions. Evaluating token throughput at various curve points, especially for different output token counts, provides a more informative perspective.
- Published
- 2024
7. Large language models as an interface to interact with API tools in natural language
- Author
-
Tesfagiorgis, Yohannes Gebreyohannes, Monteiro Silva, Bruno Miguel, Tesfagiorgis, Yohannes Gebreyohannes, and Monteiro Silva, Bruno Miguel
- Abstract
In this research project, we aim to explore the use of Large Language Models (LLMs) as an interface to interact with API tools in natural language. Bubeck et al. [1] shed some light on how LLMs could be used to interact with API tools. Since then, new versions of LLMs have been launched and the question of how reliable a LLM can be in this task remains unanswered. The main goal of our thesis is to investigate the designs of the available system prompts for LLMs, identify the best-performing prompts, and evaluate the reliability of different LLMs when using the best-identified prompts. We will employ a multiple-stage controlled experiment: A literature review where we reveal the available system prompts used in the scientific community and open-source projects; then, using F1-score as a metric we will analyse the precision and recall of the system prompts aiming to select the best-performing system prompts in interacting with API tools; and in a latter stage, we compare a selection of LLMs with the best-performing prompts identified earlier. From these experiences, we realize that AI-generated system prompts perform better than the current prompts used in open-source and literature with GPT-4, zero-shot prompts have better performance in this specific task with GPT-4 and that a good system prompt in one model does not generalize well into other models.
- Published
- 2023
8. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.
- Author
-
Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, and Wang Y
- Abstract
Background: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches., Objective: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models., Methods: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches., Results: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types., Conclusions: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area., (©Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, Yanshan Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 08.04.2024.)
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.