638 results on '"Krueger, David"'
Search Results
2. The Flag and the Cross: White Christian Nationalism and the Threat to American Democracy by Philip S. Gorski and Samuel L. Perry (review)
- Author
-
Krueger, David M.
- Published
- 2023
3. Introduction to This Issue
- Author
-
Krueger, David M.
- Published
- 2022
- Full Text
- View/download PDF
4. Blacks and Jews in America: An Invitation to Dialogue by Terrance L. Johnson and Jacques Berlinerblau (review)
- Author
-
Krueger, David M.
- Published
- 2022
- Full Text
- View/download PDF
5. Adversarial Robustness of In-Context Learning in Transformers for Linear Regression
- Author
-
Anwar, Usman, Von Oswald, Johannes, Kirsch, Louis, Krueger, David, and Frei, Spencer
- Subjects
Computer Science - Machine Learning ,Computer Science - Cryptography and Security - Abstract
Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers to \textit{hijacking attacks} focusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary's goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers' robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we investigate the transferability of hijacking attacks across transformers of varying scales and initialization seeds, as well as between transformers and ordinary least squares (OLS). We find that while attacks transfer effectively between small-scale transformers, they show poor transferability in other scenarios (small-to-large scale, large-to-large scale, and between transformers and OLS).
- Published
- 2024
6. Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games
- Author
-
Anwar, Usman, Pandian, Ashish, Wan, Jia, Krueger, David, and Foerster, Jakob
- Subjects
Computer Science - Machine Learning - Abstract
Zero-shot coordination (ZSC) is a popular setting for studying the ability of reinforcement learning (RL) agents to coordinate with novel partners. Prior ZSC formulations assume the $\textit{problem setting}$ is common knowledge: each agent knows the underlying Dec-POMDP, knows others have this knowledge, and so on ad infinitum. However, this assumption rarely holds in complex real-world settings, which are often difficult to fully and correctly specify. Hence, in settings where this common knowledge assumption is invalid, agents trained using ZSC methods may not be able to coordinate well. To address this limitation, we formulate the $\textit{noisy zero-shot coordination}$ (NZSC) problem. In NZSC, agents observe different noisy versions of the ground truth Dec-POMDP, which are assumed to be distributed according to a fixed noise model. Only the distribution of ground truth Dec-POMDPs and the noise model are common knowledge. We show that a NZSC problem can be reduced to a ZSC problem by designing a meta-Dec-POMDP with an augmented state space consisting of all the ground-truth Dec-POMDPs. For solving NZSC problems, we propose a simple and flexible meta-learning method called NZSC training, in which the agents are trained across a distribution of coordination problems - which they only get to observe noisy versions of. We show that with NZSC training, RL agents can be trained to coordinate well with novel partners even when the (exact) problem setting of the coordination is not common knowledge.
- Published
- 2024
7. Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
- Author
-
Marks, Luke, Paren, Alasdair, Krueger, David, and Barez, Fazl
- Subjects
Computer Science - Machine Learning - Abstract
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.
- Published
- 2024
8. Predicting Future Actions of Reinforcement Learning Agents
- Author
-
Chung, Stephen, Niekum, Scott, and Krueger, David
- Subjects
Computer Science - Artificial Intelligence ,I.2.6 ,I.2.8 ,I.5.1 - Abstract
As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e.g., plans or neuron activations), and a simulation-based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation-based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real-world deployments., Comment: 16 pages, 8 figures
- Published
- 2024
9. Integrating uncertainty quantification into randomized smoothing based robustness guarantees
- Author
-
Däubener, Sina, Maag, Kira, Krueger, David, and Fischer, Asja
- Subjects
Computer Science - Machine Learning ,Computer Science - Cryptography and Security ,Statistics - Machine Learning - Abstract
Deep neural networks have proven to be extremely powerful, however, they are also vulnerable to adversarial attacks which can cause hazardous incorrect predictions in safety-critical applications. Certified robustness via randomized smoothing gives a probabilistic guarantee that the smoothed classifier's predictions will not change within an $\ell_2$-ball around a given input. On the other hand (uncertainty) score-based rejection is a technique often applied in practice to defend models against adversarial attacks. In this work, we fuse these two approaches by integrating a classifier that abstains from predicting when uncertainty is high into the certified robustness framework. This allows us to derive two novel robustness guarantees for uncertainty aware classifiers, namely (i) the radius of an $\ell_2$-ball around the input in which the same label is predicted and uncertainty remains low and (ii) the $\ell_2$-radius of a ball in which the predictions will either not change or be uncertain. While the former provides robustness guarantees with respect to attacks aiming at increased uncertainty, the latter informs about the amount of input perturbation necessary to lead the uncertainty aware model into a wrong prediction. Notably, this is on CIFAR10 up to 20.93% larger than for models not allowing for uncertainty based rejection. We demonstrate, that the novel framework allows for a systematic robustness evaluation of different network architectures and uncertainty measures and to identify desired properties of uncertainty quantification techniques. Moreover, we show that leveraging uncertainty in a smoothed classifier helps out-of-distribution detection.
- Published
- 2024
10. Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
- Author
-
Pres, Itamar, Ruis, Laura, Lubana, Ekdeep Singh, and Krueger, David
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipelines for these methods have primarily relied on subjective demonstrations, instead of quantitative, objective metrics. We aim to take a step towards addressing this issue by advocating for four properties missing from current evaluations: (i) contexts sufficiently similar to downstream tasks should be used for assessing intervention quality; (ii) model likelihoods should be accounted for; (iii) evaluations should allow for standardized comparisons across different target behaviors; and (iv) baseline comparisons should be offered. We introduce an evaluation pipeline grounded in these criteria, offering both a quantitative and visual analysis of how effectively a given method works. We use this pipeline to evaluate two representation engineering methods on how effectively they can steer behaviors such as truthfulness and corrigibility, finding that some interventions are less effective than previously reported., Comment: Accepted to the NeurIPS 2024 - Workshop on Foundation Model Interventions
- Published
- 2024
11. Microscopic theory of spin friction and dissipative spin dynamics
- Author
-
Lenzing, Nicolas, Krüger, David, and Potthoff, Michael
- Subjects
Condensed Matter - Mesoscale and Nanoscale Physics ,Condensed Matter - Materials Science ,Condensed Matter - Strongly Correlated Electrons - Abstract
The real-time dynamics of local magnetic moments exchange coupled to a metallic system of conduction electrons is subject to dissipative friction even in the absence of spin-orbit coupling. Phenomenologically, this is usually described by a local Gilbert damping constant. Here, we use both linear response theory and adiabatic response theory to derive the spin friction microscopically for a generic single-band tight-binding model of the electronic structure. The resulting Gilbert damping is time-dependent and nonlocal. For a one-dimensional model, we compare the emergent relaxation dynamics as obtained from LRT and ART against each other and against the full solution of the microscopic equations of motion and demonstrate the importance of nonlocality, while the time dependence turns out to be irrelevant. In two dimensions and for a few magnetic moments in different geometries, it is found that the inclusion of nonlocal Gilbert damping can counterintuitively lead to longer relaxation times. Besides the distance dependence, the directional dependence of the nonlocal Gilbert damping turns out as very important. Our results are based on an expression relating the nonlocal Gilbert damping to the nonlocal tight-binding density of states close to the Fermi energy. This is exact in case of noninteracting electrons. Effects due to electronic correlations are studied within the random-phase approximation. For the Hubbard model at half filling and with increasing interaction strength, we find a strong enhancement of the nonlocality of spin friction., Comment: 20 pages, 19 figures
- Published
- 2024
12. Influence Functions for Scalable Data Attribution in Diffusion Models
- Author
-
Mlodozeniec, Bruno, Eschenhagen, Runa, Bae, Juhan, Immer, Alexander, Krueger, David, and Turner, Richard
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an \textit{influence functions} framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.
- Published
- 2024
13. Analyzing (In)Abilities of SAEs via Formal Languages
- Author
-
Menon, Abhinav, Shrivastava, Manish, Krueger, David, and Lubana, Ekdeep Singh
- Subjects
Computer Science - Machine Learning - Abstract
Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model's computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting., Comment: Under review
- Published
- 2024
14. PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
- Author
-
Fu, Tingchen, Sharma, Mrinank, Torr, Philip, Cohen, Shay B., Krueger, David, and Barez, Fazl
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation., Comment: Tingchen Fu and Fazl Barez are core research contributors
- Published
- 2024
15. Exploring the design space of deep-learning-based weather forecasting systems
- Author
-
Siddiqui, Shoaib Ahmed, Kossaifi, Jean, Bonev, Boris, Choy, Christopher, Kautz, Jan, Krueger, David, and Azizzadenesheli, Kamyar
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Despite tremendous progress in developing deep-learning-based weather forecasting systems, their design space, including the impact of different design choices, is yet to be well understood. This paper aims to fill this knowledge gap by systematically analyzing these choices including architecture, problem formulation, pretraining scheme, use of image-based pretrained models, loss functions, noise injection, multi-step inputs, additional static masks, multi-step finetuning (including larger stride models), as well as training on a larger dataset. We study fixed-grid architectures such as UNet, fully convolutional architectures, and transformer-based models, along with grid-invariant architectures, including graph-based and operator-based models. Our results show that fixed-grid architectures outperform grid-invariant architectures, indicating a need for further architectural developments in grid-invariant models such as neural operators. We therefore propose a hybrid system that combines the strong performance of fixed-grid models with the flexibility of grid-invariant architectures. We further show that multi-step fine-tuning is essential for most deep-learning models to work well in practice, which has been a common practice in the past. Pretraining objectives degrade performance in comparison to supervised training, while image-based pretrained models provide useful inductive biases in some cases in comparison to training the model from scratch. Interestingly, we see a strong positive effect of using a larger dataset when training a smaller model as compared to training on a smaller dataset for longer. Larger models, on the other hand, primarily benefit from just an increase in the computational budget. We believe that these results will aid in the design of better weather forecasting systems in the future.
- Published
- 2024
16. Towards Interpreting Visual Information Processing in Vision-Language Models
- Author
-
Neo, Clement, Ong, Luke, Torr, Philip, Geva, Mor, Krueger, David, and Barez, Fazl
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70\% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.
- Published
- 2024
17. Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
- Author
-
Lan, Michael, Torr, Philip, Meek, Austin, Khakzar, Ashkan, Krueger, David, and Barez, Fazl
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones. This makes it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics like Singular Value Canonical Correlation Analysis to analyze these SAE features across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
- Published
- 2024
18. Permissive Information-Flow Analysis for Large Language Models
- Author
-
Siddiqui, Shoaib Ahmed, Gaonkar, Radhika, Köpf, Boris, Krueger, David, Paverd, Andrew, Salem, Ahmed, Tople, Shruti, Wutschitz, Lukas, Xia, Menglin, and Zanella-Béguelin, Santiago
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, the traditional approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary input. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with the baseline of an introspection-based influence estimator that directly asks the language model to predict the output label. The results obtained highlight the superiority of our prompt-based label propagator, which improves the label in more than 85% of the cases in an LLM agent setting. These findings underscore the practicality of permissive label propagation for retrieval augmentation., Comment: 16 pages, 11 figures
- Published
- 2024
19. Input Space Mode Connectivity in Deep Neural Networks
- Author
-
Vrabel, Jakub, Shem-Ur, Ori, Oz, Yaron, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Condensed Matter - Statistical Mechanics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Mode connectivity was originally studied within parameter space, where it describes the existence of low-loss paths between different solutions (loss minimizers) obtained through gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. Our methodology utilizes real, interpolated, and synthetic inputs created using the input optimization technique for feature visualization. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric effect that takes place even in untrained models and can be explained through percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and demonstrate its potential for adversarial detection. Additionally, we discuss applications for the interpretability of deep networks.
- Published
- 2024
20. Protecting against simultaneous data poisoning attacks
- Author
-
Alex, Neel, Siddiqui, Shoaib Ahmed, Sanyal, Amartya, and Krueger, David
- Subjects
Computer Science - Machine Learning - Abstract
Current backdoor defense methods are evaluated against a single attack at a time. This is unrealistic, as powerful machine learning systems are trained on large datasets scraped from the internet, which may be attacked multiple times by one or more attackers. We demonstrate that simultaneously executed data poisoning attacks can effectively install multiple backdoors in a single model without substantially degrading clean accuracy. Furthermore, we show that existing backdoor defense methods do not effectively prevent attacks in this setting. Finally, we leverage insights into the nature of backdoor attacks to develop a new defense, BaDLoss, that is effective in the multi-attack setting. With minimal clean accuracy degradation, BaDLoss attains an average attack success rate in the multi-attack setting of 7.98% in CIFAR-10 and 10.29% in GTSRB, compared to the average of other defenses at 64.48% and 84.28% respectively.
- Published
- 2024
21. A deeper look at depth pruning of LLMs
- Author
-
Siddiqui, Shoaib Ahmed, Dong, Xin, Heinrich, Greg, Breuel, Thomas, Kautz, Jan, Krueger, David, and Molchanov, Pavlo
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.
- Published
- 2024
22. The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
- Author
-
Fluri, Lukas, Lang, Leon, Abate, Alessandro, Forré, Patrick, Krueger, David, and Skalse, Joar
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning - Abstract
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the training distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. Our theoretical results highlight the importance of developing new ways to measure the quality of learned reward models., Comment: 58 pages, 1 figure
- Published
- 2024
23. IDs for AI Systems
- Author
-
Chan, Alan, Kolt, Noam, Wills, Peter, Anwar, Usman, de Witt, Christian Schroeder, Rajkumar, Nitarshan, Hammond, Lewis, Krueger, David, Heim, Lennart, and Anderljung, Markus
- Subjects
Computer Science - Artificial Intelligence - Abstract
AI systems are increasingly pervasive, yet information needed to decide whether and how to engage with them may not exist or be accessible. A user may not be able to verify whether a system has certain safety certifications. An investigator may not know whom to investigate when a system causes an incident. It may not be clear whom to contact to shut down a malfunctioning system. Across a number of domains, IDs address analogous problems by identifying particular entities (e.g., a particular Boeing 747) and providing information about other entities of the same class (e.g., some or all Boeing 747s). We propose a framework in which IDs are ascribed to instances of AI systems (e.g., a particular chat session with Claude 3), and associated information is accessible to parties seeking to interact with that system. We characterize IDs for AI systems, provide concrete examples where IDs could be useful, argue that there could be significant demand for IDs from key actors, analyze how those actors could incentivize ID adoption, explore a potential implementation of our framework for deployers of AI systems, and highlight limitations and risks. IDs seem most warranted in settings where AI systems could have a large impact upon the world, such as in making financial transactions or contacting real humans. With further study, IDs could help to manage a world where AI systems pervade society., Comment: Under review; accepted to RegML workshop at NeurIPS 2024
- Published
- 2024
24. The Place of Stone: Dighton Rock and the Erasure of America’s Indigenous Past by Douglas Hunter (review)
- Author
-
Krueger, David
- Published
- 2020
25. Stress-Testing Capability Elicitation With Password-Locked Models
- Author
-
Greenblatt, Ryan, Roger, Fabien, Krasheninnikov, Dmitrii, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language - Abstract
To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models' (hidden) capabilities exceed those of human demonstrators.
- Published
- 2024
26. Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- Author
-
Anwar, Usman, Saparov, Abulhair, Rando, Javier, Paleka, Daniel, Turpin, Miles, Hase, Peter, Lubana, Ekdeep Singh, Jenner, Erik, Casper, Stephen, Sourbut, Oliver, Edelman, Benjamin L., Zhang, Zhaowei, Günther, Mario, Korinek, Anton, Hernandez-Orallo, Jose, Hammond, Lewis, Bigelow, Eric, Pan, Alexander, Langosco, Lauro, Korbak, Tomasz, Zhang, Heidi, Zhong, Ruiqi, hÉigeartaigh, Seán Ó, Recchia, Gabriel, Corsi, Giulio, Chan, Alan, Anderljung, Markus, Edwards, Lilian, Petrov, Aleksandar, de Witt, Christian Schroeder, Motwan, Sumeet Ramesh, Bengio, Yoshua, Chen, Danqi, Torr, Philip H. S., Albanie, Samuel, Maharaj, Tegan, Foerster, Jakob, Tramer, Florian, He, He, Kasirzadeh, Atoosa, Choi, Yejin, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Computers and Society - Abstract
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
- Published
- 2024
27. Affirmative safety: An approach to risk management for high-risk AI
- Author
-
Wasil, Akash R., Clymer, Joshua, Krueger, David, Dardaman, Emily, Campos, Simeon, and Murphy, Evan R.
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence - Abstract
Prominent AI experts have suggested that companies developing high-risk AI systems should be required to show that such systems are safe before they can be developed or deployed. The goal of this paper is to expand on this idea and explore its implications for risk management. We argue that entities developing or deploying high-risk AI systems should be required to present evidence of affirmative safety: a proactive case that their activities keep risks below acceptable thresholds. We begin the paper by highlighting global security risks from AI that have been acknowledged by AI experts and world governments. Next, we briefly describe principles of risk management from other high-risk fields (e.g., nuclear safety). Then, we propose a risk management approach for advanced AI in which model developers must provide evidence that their activities keep certain risks below regulator-set thresholds. As a first step toward understanding what affirmative safety cases should include, we illustrate how certain kinds of technical evidence and operational evidence can support an affirmative safety case. In the technical section, we discuss behavioral evidence (evidence about model outputs), cognitive evidence (evidence about model internals), and developmental evidence (evidence about the training process). In the operational section, we offer examples of organizational practices that could contribute to affirmative safety cases: information security practices, safety culture, and emergency response capacity. Finally, we briefly compare our approach to the NIST AI Risk Management Framework. Overall, we hope our work contributes to ongoing discussions about national and global security risks posed by AI and regulatory approaches to address these risks.
- Published
- 2024
28. Safety Cases: How to Justify the Safety of Advanced AI Systems
- Author
-
Clymer, Joshua, Gabrieli, Nick, Krueger, David, and Larsen, Thomas
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence - Abstract
As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
- Published
- 2024
29. A Generative Model of Symmetry Transformations
- Author
-
Allingham, James Urquhart, Mlodozeniec, Bruno Kacper, Padhy, Shreyas, Antorán, Javier, Krueger, David, Turner, Richard E., Nalisnick, Eric, and Hernández-Lobato, José Miguel
- Subjects
Computer Science - Machine Learning - Abstract
Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though methods incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we take inspiration from group theoretic ideas to construct a generative model that explicitly aims to capture the data's approximate symmetries. This results in a model that, given a prespecified broad set of possible symmetries, learns to what extent, if at all, those symmetries are actually present. Our model can be seen as a generative process for data augmentation. We provide a simple algorithm for learning our generative model and empirically demonstrate its ability to capture symmetries under affine and color transformations, in an interpretable way. Combining our symmetry model with standard generative models results in higher marginal test-log-likelihoods and improved data efficiency.
- Published
- 2024
30. Black-Box Access is Insufficient for Rigorous AI Audits
- Author
-
Casper, Stephen, Ezell, Carson, Siegmann, Charlotte, Kolt, Noam, Curtis, Taylor Lynn, Bucknall, Benjamin, Haupt, Andreas, Wei, Kevin, Scheurer, Jérémy, Hobbhahn, Marius, Sharkey, Lee, Krishna, Satyapriya, Von Hagen, Marvin, Alberti, Silas, Chan, Alan, Sun, Qinyi, Gerovitch, Michael, Bau, David, Tegmark, Max, Krueger, David, and Hadfield-Menell, Dylan
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security - Abstract
External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to training and deployment information (e.g., methodology, code, documentation, data, deployment details, findings from internal evaluations) allows auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone., Comment: FAccT 2024
- Published
- 2024
- Full Text
- View/download PDF
31. Visibility into AI Agents
- Author
-
Chan, Alan, Ezell, Carson, Kaufmann, Max, Wei, Kevin, Hammond, Lewis, Bradley, Herbie, Bluemke, Emma, Rajkumar, Nitarshan, Krueger, David, Kolt, Noam, Heim, Lennart, and Anderljung, Markus
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence - Abstract
Increased delegation of commercial, scientific, governmental, and personal activities to AI agents -- systems capable of pursuing complex goals with limited supervision -- may exacerbate existing societal risks and introduce new risks. Understanding and mitigating these risks involves critically evaluating existing governance structures, revising and adapting these structures where needed, and ensuring accountability of key stakeholders. Information about where, why, how, and by whom certain AI agents are used, which we refer to as visibility, is critical to these objectives. In this paper, we assess three categories of measures to increase visibility into AI agents: agent identifiers, real-time monitoring, and activity logging. For each, we outline potential implementations that vary in intrusiveness and informativeness. We analyze how the measures apply across a spectrum of centralized through decentralized deployment contexts, accounting for various actors in the supply chain including hardware and software service providers. Finally, we discuss the implications of our measures for privacy and concentration of power. Further work into understanding the measures and mitigating their negative impacts can help to build a foundation for the governance of AI agents., Comment: Accepted to ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT 2024)
- Published
- 2024
32. Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
- Author
-
Chan, Alan, Bucknall, Ben, Bradley, Herbie, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Computers and Society - Abstract
Public release of the weights of pretrained foundation models, otherwise known as downloadable access \citep{solaiman_gradient_2023}, enables fine-tuning without the prohibitive expense of pretraining. Our work argues that increasingly accessible fine-tuning of downloadable models may increase hazards. First, we highlight research to improve the accessibility of fine-tuning. We split our discussion into research that A) reduces the computational cost of fine-tuning and B) improves the ability to share that cost across more actors. Second, we argue that increasingly accessible fine-tuning methods may increase hazard through facilitating malicious use and making oversight of models with potentially dangerous capabilities more difficult. Third, we discuss potential mitigatory measures, as well as benefits of more accessible fine-tuning. Given substantial remaining uncertainty about hazards, we conclude by emphasizing the urgent need for the development of mitigations., Comment: Accepted as a spotlight workshop paper at the Socially Responsible Language Modelling Research (SoLaR) workshop, held at NeurIPS 2023
- Published
- 2023
33. Spirit in the Rock: The Fierce Battle for Modoc Homelands by Jim Compton (review)
- Author
-
Krueger, David
- Published
- 2019
34. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- Author
-
Jain, Samyak, Kirk, Robert, Lubana, Ekdeep Singh, Dick, Robert P., Tanaka, Hidenori, Grefenstette, Edward, Rocktäschel, Tim, and Krueger, David Scott
- Subjects
Computer Science - Machine Learning - Abstract
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
- Published
- 2023
35. Managing extreme AI risks amid rapid progress
- Author
-
Bengio, Yoshua, Hinton, Geoffrey, Yao, Andrew, Song, Dawn, Abbeel, Pieter, Darrell, Trevor, Harari, Yuval Noah, Zhang, Ya-Qin, Xue, Lan, Shalev-Shwartz, Shai, Hadfield, Gillian, Clune, Jeff, Maharaj, Tegan, Hutter, Frank, Baydin, Atılım Güneş, McIlraith, Sheila, Gao, Qiqi, Acharya, Ashwin, Krueger, David, Dragan, Anca, Torr, Philip, Russell, Stuart, Kahneman, Daniel, Brauner, Jan, and Mindermann, Sören
- Subjects
Computer Science - Computers and Society ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although researchers have warned of extreme risks from AI, there is a lack of consensus about how exactly such risks arise, and how to manage them. Society's response, despite promising first steps, is incommensurate with the possibility of rapid, transformative progress that is expected by many experts. AI safety research is lagging. Present governance initiatives lack the mechanisms and institutions to prevent misuse and recklessness, and barely address autonomous systems. In this short consensus paper, we describe extreme risks from upcoming, advanced AI systems. Drawing on lessons learned from other safety-critical technologies, we then outline a comprehensive plan combining technical research and development with proactive, adaptive governance mechanisms for a more commensurate preparation., Comment: Published in Science: https://www.science.org/doi/10.1126/science.adn0117
- Published
- 2023
- Full Text
- View/download PDF
36. Implicit meta-learning may lead language models to trust more reliable sources
- Author
-
Krasheninnikov, Dmitrii, Krasheninnikov, Egor, Mlodozeniec, Bruno, Maharaj, Tegan, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
We demonstrate that LLMs may learn indicators of document usefulness and modulate their updates accordingly. We introduce random strings ("tags") as indicators of usefulness in a synthetic fine-tuning dataset. Fine-tuning on this dataset leads to implicit meta-learning (IML): in further fine-tuning, the model updates to make more use of text that is tagged as useful. We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML. We also use probing to examine how IML changes the way models store knowledge in their parameters. Finally, we reflect on what our results might imply about capabilities, risks, and controllability of future AI systems. Our code can be found at https://github.com/krasheninnikov/internalization.
- Published
- 2023
37. Interpreting Learned Feedback Patterns in Large Language Models
- Author
-
Marks, Luke, Abdullah, Amir, Neo, Clement, Arike, Rauno, Krueger, David, Torr, Philip, and Barez, Fazl
- Subjects
Computer Science - Machine Learning - Abstract
Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term \textit{Learned Feedback Pattern} (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety of LLMs., Comment: 19 pages, 8 figures
- Published
- 2023
38. Reward Model Ensembles Help Mitigate Overoptimization
- Author
-
Coste, Thomas, Anwar, Usman, Kirk, Robert, and Krueger, David
- Subjects
Computer Science - Machine Learning - Abstract
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization., Comment: Accepted at ICLR 2024
- Published
- 2023
39. Thinker: Learning to Plan and Act
- Author
-
Chung, Stephen, Anokhin, Ivan, and Krueger, David
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,I.2.6 ,I.2.8 ,I.5.1 - Abstract
We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments., Comment: 38 pages
- Published
- 2023
40. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- Author
-
Casper, Stephen, Davies, Xander, Shi, Claudia, Gilbert, Thomas Krendl, Scheurer, Jérémy, Rando, Javier, Freedman, Rachel, Korbak, Tomasz, Lindner, David, Freire, Pedro, Wang, Tony, Marks, Samuel, Segerie, Charbel-Raphaël, Carroll, Micah, Peng, Andi, Christoffersen, Phillip, Damani, Mehul, Slocum, Stewart, Anwar, Usman, Siththaranjan, Anand, Nadeau, Max, Michaud, Eric J., Pfau, Jacob, Krasheninnikov, Dmitrii, Chen, Xin, Langosco, Lauro, Hase, Peter, Bıyık, Erdem, Dragan, Anca, Krueger, David, Sadigh, Dorsa, and Hadfield-Menell, Dylan
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
- Published
- 2023
41. Investigating the Nature of 3D Generalization in Deep Neural Networks
- Author
-
Siddiqui, Shoaib Ahmed, Krueger, David, and Breuel, Thomas
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Visual object recognition systems need to generalize from a set of 2D training views to novel views. The question of how the human visual system can generalize to novel views has been studied and modeled in psychology, computer vision, and neuroscience. Modern deep learning architectures for object recognition generalize well to novel views, but the mechanisms are not well understood. In this paper, we characterize the ability of common deep learning architectures to generalize to novel views. We formulate this as a supervised classification task where labels correspond to unique 3D objects and examples correspond to 2D views of the objects at different 3D orientations. We consider three common models of generalization to novel views: (i) full 3D generalization, (ii) pure 2D matching, and (iii) matching based on a linear combination of views. We find that deep models generalize well to novel views, but they do so in a way that differs from all these existing models. Extrapolation to views beyond the range covered by views in the training set is limited, and extrapolation to novel rotation axes is even more limited, implying that the networks do not infer full 3D structure, nor use linear interpolation. Yet, generalization is far superior to pure 2D matching. These findings help with designing datasets with 2D views required to achieve 3D generalization. Code to reproduce our experiments is publicly available: https://github.com/shoaibahmed/investigating_3d_generalization.git, Comment: 15 pages, 15 figures, CVPR format
- Published
- 2023
42. Geometrical torque on magnetic moments coupled to a correlated antiferromagnet
- Author
-
Lenzing, Nicolas, Krüger, David, and Potthoff, Michael
- Subjects
Condensed Matter - Strongly Correlated Electrons ,Condensed Matter - Mesoscale and Nanoscale Physics - Abstract
The geometrical spin torque mediates an indirect interaction of magnetic moments, which are weakly exchange coupled to a system of itinerant electrons. It originates from a finite spin-Berry curvature and leads to a non-Hamiltonian magnetic-moment dynamics. We demonstrate that there is an unprecedentedly strong geometrical spin torque in case of an electron system, where correlations cause antiferromagnetic long-range order. The key observation is that the anomalous torque is strongly boosted by low-energy magnon modes emerging in the two-electron spin-excitation spectrum due to spontaneous breaking of SU(2) spin-rotation symmetry. As long as single-electron excitations are gapped out, the effect is largely universal, i.e., essentially independent of the details of the electronic structure, but decisively dependent on the lattice dimension and spatial and spin anisotropies. Analogous to the reasoning that leads to the Mermin-Wagner theorem, there is a lower critical dimension at and below which the spin-Berry curvature diverges., Comment: 5 pages, 3 figures, suppl. material
- Published
- 2023
- Full Text
- View/download PDF
43. Characterizing Manipulation from AI Systems
- Author
-
Carroll, Micah, Chan, Alan, Ashton, Henry, and Krueger, David
- Subjects
Computer Science - Computers and Society - Abstract
Manipulation is a common concern in many domains, such as social media, advertising, and chatbots. As AI systems mediate more of our interactions with the world, it is important to understand the degree to which AI systems might manipulate humans without the intent of the system designers. Our work clarifies challenges in defining and measuring manipulation in the context of AI systems. Firstly, we build upon prior literature on manipulation from other fields and characterize the space of possible notions of manipulation, which we find to depend upon the concepts of incentives, intent, harm, and covertness. We review proposals on how to operationalize each factor. Second, we propose a definition of manipulation based on our characterization: a system is manipulative if it acts as if it were pursuing an incentive to change a human (or another agent) intentionally and covertly. Third, we discuss the connections between manipulation and related concepts, such as deception and coercion. Finally, we contextualize our operationalization of manipulation in some applications. Our overall assessment is that while some progress has been made in defining and measuring manipulation from AI systems, many gaps remain. In the absence of a consensus definition and reliable tools for measurement, we cannot rule out the possibility that AI systems learn to manipulate humans without the intent of the system designers. We argue that such manipulation poses a significant threat to human autonomy, suggesting that precautionary actions to mitigate it are warranted., Comment: Presented at EAAMO 2023; The first two authors contributed equally; author order was decided with a coin flip
- Published
- 2023
44. Unifying Grokking and Double Descent
- Author
-
Davies, Xander, Langosco, Lauro, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
A principled understanding of generalization in deep learning may require unifying disparate observations under a single conceptual framework. Previous work has studied \emph{grokking}, a training dynamic in which a sustained period of near-perfect training performance and near-chance test performance is eventually followed by generalization, as well as the superficially similar \emph{double descent}. These topics have so far been studied in isolation. We hypothesize that grokking and double descent can be understood as instances of the same learning dynamics within a framework of pattern learning speeds. We propose that this framework also applies when varying model capacity instead of optimization steps, and provide the first demonstration of model-wise grokking., Comment: ML Safety Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
- Published
- 2023
45. Harms from Increasingly Agentic Algorithmic Systems
- Author
-
Chan, Alan, Salganik, Rebecca, Markelius, Alva, Pang, Chris, Rajkumar, Nitarshan, Krasheninnikov, Dmitrii, Langosco, Lauro, He, Zhonghao, Duan, Yawen, Carroll, Micah, Lin, Michelle, Mayhew, Alex, Collins, Katherine, Molamohammadi, Maryam, Burden, John, Zhao, Wanru, Rismani, Shalaleh, Voudouris, Konstantinos, Bhatt, Umang, Weller, Adrian, Krueger, David, and Maharaj, Tegan
- Subjects
Computer Science - Computers and Society - Abstract
Research in Fairness, Accountability, Transparency, and Ethics (FATE) has established many sources and forms of algorithmic harm, in domains as diverse as health care, finance, policing, and recommendations. Much work remains to be done to mitigate the serious harms of these systems, particularly those disproportionately affecting marginalized communities. Despite these ongoing harms, new systems are being developed and deployed which threaten the perpetuation of the same harms and the creation of novel ones. In response, the FATE community has emphasized the importance of anticipating harms. Our work focuses on the anticipation of harms from increasingly agentic systems. Rather than providing a definition of agency as a binary property, we identify 4 key characteristics which, particularly in combination, tend to increase the agency of a given algorithmic system: underspecification, directness of impact, goal-directedness, and long-term planning. We also discuss important harms which arise from increasing agency -- notably, these include systemic and/or long-range impacts, often on marginalized stakeholders. We emphasize that recognizing agency of algorithmic systems does not absolve or shift the human responsibility for algorithmic harms. Rather, we use the term agency to highlight the increasingly evident fact that ML systems are not fully under human control. Our work explores increasingly agentic algorithmic systems in three parts. First, we explain the notion of an increase in agency for algorithmic systems in the context of diverse perspectives on agency across disciplines. Second, we argue for the need to anticipate harms from increasingly agentic systems. Third, we discuss important harms from increasingly agentic systems and ways forward for addressing them. We conclude by reflecting on implications of our work for anticipating algorithmic harms from emerging systems., Comment: Accepted at FAccT 2023
- Published
- 2023
- Full Text
- View/download PDF
46. Blockwise Self-Supervised Learning at Scale
- Author
-
Siddiqui, Shoaib Ahmed, Krueger, David, LeCun, Yann, and Deny, Stéphane
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
- Published
- 2023
47. On The Fragility of Learned Reward Functions
- Author
-
McKinney, Lev, Duan, Yawen, Krueger, David, and Gleave, Adam
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature., Comment: 5 pages, 2 figures, presented at the NeurIPS Deep RL and ML Safety Workshops
- Published
- 2023
48. Domain Generalization for Robust Model-Based Offline Reinforcement Learning
- Author
-
Clark, Alan, Siddiqui, Shoaib Ahmed, Kirk, Robert, Anwar, Usman, Chung, Stephen, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning - Abstract
Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially enable increased exploration., Comment: Accepted to the NeurIPS 2022 Workshops on Distribution Shifts and Offline Reinforcement Learning
- Published
- 2022
49. Mechanistic Mode Connectivity
- Author
-
Lubana, Ekdeep Singh, Bigelow, Eric J., Dick, Robert P., Krueger, David, and Tanaka, Hidenori
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are connected via simple paths of low loss. Specifically, we ask the following question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss? We provide a definition of mechanistic similarity as shared invariances to input transformations and demonstrate that lack of linear connectivity between two models implies they use dissimilar mechanisms for making their predictions. Relevant to practice, this result helps us demonstrate that naive fine-tuning on a downstream dataset can fail to alter a model's mechanisms, e.g., fine-tuning can fail to eliminate a model's reliance on spurious attributes. Our analysis also motivates a method for targeted alteration of a model's mechanisms, named connectivity-based fine-tuning (CBFT), which we analyze using several synthetic datasets for the task of reducing a model's reliance on spurious attributes., Comment: ICML, 2023
- Published
- 2022
50. Broken Neural Scaling Laws
- Author
-
Caballero, Ethan, Gupta, Kshitij, Rish, Irina, and Krueger, David
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
We present a smoothly broken power law functional form (that we refer to as a Broken Neural Scaling Law (BNSL)) that accurately models & extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as amount of compute used for training (or inference), number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures & for each of various tasks within a large & diverse set of upstream & downstream tasks, in zero-shot, prompted, & finetuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, OOD detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, "emergent phase transitions", arithmetic, supervised learning, unsupervised/self-supervised learning, & reinforcement learning (single agent & multi-agent). When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models & extrapolates scaling behavior that other functional forms are incapable of expressing such as the nonmonotonic transitions present in the scaling behavior of phenomena such as double descent & the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws, Comment: Published as a conference paper at International Conference on Learning Representations (ICLR) 2023
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.