Author: "Henderson, Peter" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Henderson, Peter"' showing total 2,467 results

Start Over Author "Henderson, Peter"

2,467 results on '"Henderson, Peter"'

1. Preface

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

2. Cover

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

3. Prologue. Empowerment

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

4. Index

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

5. Notes

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

6. 4. Courage

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

7. 1. Vision

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

8. 6. Hope

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

9. 5. Passion

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

10. 3. Resilience

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

11. Epilogue. Transition

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

12. 2. Openness

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

13. Half Title, Title, Copyright, Dedication

Author: Rous, Philip J., Schaefer, Lynne C., Henderson, Peter H., and Hrabowski, III, Freeman A.
Published: 2024

14. Thirty-Six Unpublished Letters from William Henry Davies to Edward Thomas

Author: Baker, William and Henderson, Peter
Published: 2022

15. A Forgotten Anthology: Jacqueline Trotter's Valour and Vision: Poems of the War 1914–1918

Author: Baker, William and Henderson, Peter
Published: 2022

16. On Evaluating the Durability of Safeguards for Open-Weight LLMs

Author: Qi, Xiangyu, Wei, Boyi, Carlini, Nicholas, Huang, Yangsibo, Xie, Tinghao, He, Luxi, Jagielski, Matthew, Nasr, Milad, Mittal, Prateek, and Henderson, Peter
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence
Abstract: Stakeholders -- from model developers to policymakers -- seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model's weights via fine-tuning. This holds the promise of raising adversaries' costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.
Published: 2024

17. The Mirage of Artificial Intelligence Terms of Use Restrictions

Author: Henderson, Peter and Lemley, Mark A.
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Artificial intelligence (AI) model creators commonly attach restrictive terms of use to both their models and their outputs. These terms typically prohibit activities ranging from creating competing AI models to spreading disinformation. Often taken at face value, these terms are positioned by companies as key enforceable tools for preventing misuse, particularly in policy dialogs. But are these terms truly meaningful? There are myriad examples where these broad terms are regularly and repeatedly violated. Yet except for some account suspensions on platforms, no model creator has actually tried to enforce these terms with monetary penalties or injunctive relief. This is likely for good reason: we think that the legal enforceability of these licenses is questionable. This Article systematically assesses of the enforceability of AI model terms of use and offers three contributions. First, we pinpoint a key problem: the artifacts that they protect, namely model weights and model outputs, are largely not copyrightable, making it unclear whether there is even anything to be licensed. Second, we examine the problems this creates for other enforcement. Recent doctrinal trends in copyright preemption may further undermine state-law claims, while other legal frameworks like the DMCA and CFAA offer limited recourse. Anti-competitive provisions likely fare even worse than responsible use provisions. Third, we provide recommendations to policymakers. There are compelling reasons for many provisions to be unenforceable: they chill good faith research, constrain competition, and create quasi-copyright ownership where none should exist. There are, of course, downsides: model creators have fewer tools to prevent harmful misuse. But we think the better approach is for statutory provisions, not private fiat, to distinguish between good and bad uses of AI, restricting the latter., Comment: Forthcoming Indiana Law Journal
Published: 2024

18. Index

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

19. Notes

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

20. 4. Leadership and Empowerment

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

21. Epilogue. A Great Challenge

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

22. 5. Grit and Greatness

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

23. 7. Pillars of Success

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

24. 8. An Honors University

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

25. Part II

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

26. 9. A Challenge of Quality

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

27. 3. Culture Change Is Hard as Hell

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

28. 6. At the Crossroads

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

29. 2. Higher Education Matters

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

30. 1. And Then We Did It

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

31. Cover

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

32. 10. The New American College

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

33. 11. Difficult Conversations

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

34. 13. Success Is Never Final

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

35. Title Page, Copyright Page, Dedication

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

36. Preface. It’s about Us

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

37. 12. Looking in the Mirror

Author: Hrabowski, III, Freeman A., Rous, Philip J., and Henderson, Peter H.
Published: 2019

38. An Adversarial Perspective on Machine Unlearning for AI Safety

Author: Łucki, Jakub, Wei, Boyi, Huang, Yangsibo, Henderson, Peter, Tramèr, Florian, and Rando, Javier
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Cryptography and Security
Abstract: Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training., Comment: Spotlight paper at Neurips 2024 SoLaR workshop
Published: 2024

39. Evaluating Copyright Takedown Methods for Language Models

Author: Wei, Boyi, Shi, Weijia, Huang, Yangsibo, Smith, Noah A., Zhang, Chiyuan, Zettlemoyer, Luke, Li, Kai, and Henderson, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals., Comment: 31 pages, 9 figures, 14 tables
Published: 2024

40. The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Author: Longpre, Shayne, Biderman, Stella, Albalak, Alon, Schoelkopf, Hailey, McDuff, Daniel, Kapoor, Sayash, Klyman, Kevin, Lo, Kyle, Ilharco, Gabriel, San, Nay, Rauh, Maribeth, Skowron, Aviya, Vidgen, Bertie, Weidinger, Laura, Narayanan, Arvind, Sanh, Victor, Adelani, David, Liang, Percy, Bommasani, Rishi, Henderson, Peter, Luccioni, Sasha, Jernite, Yacine, and Soldaini, Luca
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
Published: 2024

41. SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Author: Xie, Tinghao, Qi, Xiangyu, Zeng, Yi, Huang, Yangsibo, Sehwag, Udari Madhushani, Huang, Kaixuan, He, Luxi, Wei, Boyi, Li, Dacheng, Sheng, Ying, Jia, Ruoxi, Li, Bo, Li, Kai, Chen, Danqi, Henderson, Peter, and Mittal, Prateek
Subjects: Computer Science - Artificial Intelligence
Abstract: Evaluating aligned large language models' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner.
Published: 2024

42. Fantastic Copyrighted Beasts and How (Not) to Generate Them

Author: He, Luxi, Huang, Yangsibo, Shi, Weijia, Xie, Tinghao, Liu, Haotian, Wang, Yue, Zettlemoyer, Luke, Zhang, Chiyuan, Chen, Danqi, and Henderson, Peter
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning
Abstract: Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
Published: 2024

43. Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Author: Qi, Xiangyu, Panda, Ashwinee, Lyu, Kaifeng, Ma, Xiao, Roy, Subhrajit, Beirami, Ahmad, Mittal, Prateek, and Henderson, Peter
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence
Abstract: The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.
Published: 2024

44. JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Author: Pan, Minzhou, Zeng, Yi, Lin, Xue, Yu, Ning, Hsieh, Cho-Jui, Henderson, Peter, and Jia, Ruoxi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: In this study, we investigate the vulnerability of image watermarks to diffusion-model-based image editing, a challenge exacerbated by the computational cost of accessing gradient information and the closed-source nature of many diffusion models. To address this issue, we introduce JIGMARK. This first-of-its-kind watermarking technique enhances robustness through contrastive learning with pairs of images, processed and unprocessed by diffusion models, without needing a direct backpropagation of the diffusion process. Our evaluation reveals that JIGMARK significantly surpasses existing watermarking solutions in resilience to diffusion-model edits, demonstrating a True Positive Rate more than triple that of leading baselines at a 1% False Positive Rate while preserving image quality. At the same time, it consistently improves the robustness against other conventional perturbations (like JPEG, blurring, etc.) and malicious watermark attacks over the state-of-the-art, often by a large margin. Furthermore, we propose the Human Aligned Variation (HAV) score, a new metric that surpasses traditional similarity measures in quantifying the number of image derivatives from image editing.
Published: 2024

45. AI Risk Management Should Incorporate Both Safety and Security

Author: Qi, Xiangyu, Huang, Yangsibo, Zeng, Yi, Debenedetti, Edoardo, Geiping, Jonas, He, Luxi, Huang, Kaixuan, Madhushani, Udari, Sehwag, Vikash, Shi, Weijia, Wei, Boyi, Xie, Tinghao, Chen, Danqi, Chen, Pin-Yu, Ding, Jeffrey, Jia, Ruoxi, Ma, Jiaqi, Narayanan, Arvind, Su, Weijie J, Wang, Mengdi, Xiao, Chaowei, Li, Bo, Song, Dawn, Henderson, Peter, and Mittal, Prateek
Subjects: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence
Abstract: The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this paper, we advocate that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security, and unambiguously take into account the perspectives of both disciplines in order to devise mostly effective and holistic risk mitigation approaches. Unfortunately, this vision is often obfuscated, as the definitions of the basic concepts of "safety" and "security" themselves are often inconsistent and lack consensus across communities. With AI risk management being increasingly cross-disciplinary, this issue is particularly salient. In light of this conceptual challenge, we introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security, aiming to facilitate a shared understanding and effective collaboration across communities.
Published: 2024

46. FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

Author: Niklaus, Joel, Zheng, Lucia, McCarthy, Arya D., Hahn, Christopher, Rosen, Brian M., Henderson, Peter, Ho, Daniel E., Honke, Garrett, Liang, Percy, and Manning, Christopher
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, 68T50, I.2
Abstract: Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach for most open LLMs and there do not yet exist any large scale instruction datasets for the domain. This critically limits research in this application area. In this work, we curate LawInstruct, a large legal instruction dataset, covering 17 jurisdictions, 24 languages and a total of 12M examples. We present evidence that domain-specific pretraining and instruction tuning improve performance on LegalBench, including improving Flan-T5 XL by 8 points or 16\% over the baseline. However, the effect does not generalize across all tasks, training regimes, model sizes, and other factors. LawInstruct is a resource for accelerating the development of models with stronger information processing and decision making capabilities in the legal domain.
Published: 2024

47. What is in Your Safe Data? Identifying Benign Data that Breaks Safety

Author: He, Luxi, Xia, Mengzhou, and Henderson, Peter
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Cryptography and Security
Abstract: Current Large Language Models (LLMs), even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Additionally, we propose a bi-directional anchoring method that, during the selection process, prioritizes data points that are close to harmful examples and far from benign ones. Our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints surprisingly leads to the fine-tuned model affirmatively responding to >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We also observe that the selected data frequently appear as lists, bullet points, or math questions, indicating a systematic pattern in fine-tuning data that contributes to jailbreaking.
Published: 2024

48. A Safe Harbor for AI Evaluation and Red Teaming

Author: Longpre, Shayne, Kapoor, Sayash, Klyman, Kevin, Ramaswami, Ashwin, Bommasani, Rishi, Blili-Hamelin, Borhane, Huang, Yangsibo, Skowron, Aviya, Yong, Zheng-Xin, Kotha, Suhas, Zeng, Yi, Shi, Weiyan, Yang, Xianjun, Southen, Reid, Robey, Alexander, Chao, Patrick, Yang, Diyi, Jia, Ruoxi, Kang, Daniel, Pentland, Sandy, Narayanan, Arvind, Liang, Percy, and Henderson, Peter
Subjects: Computer Science - Artificial Intelligence
Abstract: Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.
Published: 2024

49. On the Societal Impact of Open Foundation Models

Author: Kapoor, Sayash, Bommasani, Rishi, Klyman, Kevin, Longpre, Shayne, Ramaswami, Ashwin, Cihon, Peter, Hopkins, Aspen, Bankston, Kevin, Biderman, Stella, Bogen, Miranda, Chowdhury, Rumman, Engler, Alex, Henderson, Peter, Jernite, Yacine, Lazar, Seth, Maffulli, Stefano, Nelson, Alondra, Pineau, Joelle, Skowron, Aviya, Song, Dawn, Storchan, Victor, Zhang, Daniel, Ho, Daniel E., Liang, Percy, and Narayanan, Arvind
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those with broadly available model weights (e.g. Llama 2, Stable Diffusion XL). We identify five distinctive properties (e.g. greater customizability, poor monitoring) of open foundation models that lead to both their benefits and risks. Open foundation models present significant benefits, with some caveats, that span innovation, competition, the distribution of decision-making power, and transparency. To understand their risks of misuse, we design a risk assessment framework for analyzing their marginal risk. Across several misuse vectors (e.g. cyberattacks, bioweapons), we find that current research is insufficient to effectively characterize the marginal risk of open foundation models relative to pre-existing technologies. The framework helps explain why the marginal risk is low in some cases, clarifies disagreements about misuse risks by revealing that past work has focused on different subsets of the framework with different assumptions, and articulates a way forward for more constructive debate. Overall, our work helps support a more grounded assessment of the societal impact of open foundation models by outlining what research is needed to empirically validate their theoretical benefits and risks.
Published: 2024

50. Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Author: Wei, Boyi, Huang, Kaixuan, Huang, Yangsibo, Xie, Tinghao, Qi, Xiangyu, Xia, Mengzhou, Mittal, Prateek, Wang, Mengdi, and Henderson, Peter
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs., Comment: 22 pages, 9 figures. Project page is available at https://boyiwei.com/alignment-attribution/
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

2,467 results on '"Henderson, Peter"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources