37 results on '"Henighan, Tom"'
Search Results
2. Specific versus General Principles for Constitutional AI
- Author
-
Kundu, Sandipan, Bai, Yuntao, Kadavath, Saurav, Askell, Amanda, Callahan, Andrew, Chen, Anna, Goldie, Anna, Balwit, Avital, Mirhoseini, Azalia, McLean, Brayden, Olsson, Catherine, Evraets, Cassie, Tran-Johnson, Eli, Durmus, Esin, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Ndousse, Kamal, Nguyen, Karina, Elhage, Nelson, Cheng, Newton, Schiefer, Nicholas, DasSarma, Nova, Rausch, Oliver, Larson, Robin, Yang, Shannon, Kravec, Shauna, Telleen-Lawton, Timothy, Liao, Thomas I., Henighan, Tom, Hume, Tristan, Hatfield-Dodds, Zac, Mindermann, Sören, Joseph, Nicholas, McCandlish, Sam, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.
- Published
- 2023
3. The Capacity for Moral Self-Correction in Large Language Models
- Author
-
Ganguli, Deep, Askell, Amanda, Schiefer, Nicholas, Liao, Thomas I., Lukošiūtė, Kamilė, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, Olsson, Catherine, Hernandez, Danny, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Mueller, Jared, Landau, Joshua, Ndousse, Kamal, Nguyen, Karina, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Mercado, Noemi, DasSarma, Nova, Rausch, Oliver, Lasenby, Robert, Larson, Robin, Ringer, Sam, Kundu, Sandipan, Kadavath, Saurav, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, Olah, Christopher, Clark, Jack, Bowman, Samuel R., and Kaplan, Jared
- Subjects
Computer Science - Computation and Language - Abstract
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
- Published
- 2023
4. Discovering Language Model Behaviors with Model-Written Evaluations
- Author
-
Perez, Ethan, Ringer, Sam, Lukošiūtė, Kamilė, Nguyen, Karina, Chen, Edwin, Heiner, Scott, Pettit, Craig, Olsson, Catherine, Kundu, Sandipan, Kadavath, Saurav, Jones, Andy, Chen, Anna, Mann, Ben, Israel, Brian, Seethor, Bryan, McKinnon, Cameron, Olah, Christopher, Yan, Da, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Khundadze, Guro, Kernion, Jackson, Landis, James, Kerr, Jamie, Mueller, Jared, Hyun, Jeeyoon, Landau, Joshua, Ndousse, Kamal, Goldberg, Landon, Lovitt, Liane, Lucas, Martin, Sellitto, Michael, Zhang, Miranda, Kingsland, Neerav, Elhage, Nelson, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Lanham, Tamera, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Clark, Jack, Bowman, Samuel R., Askell, Amanda, Grosse, Roger, Hernandez, Danny, Ganguli, Deep, Hubinger, Evan, Schiefer, Nicholas, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors., Comment: for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals
- Published
- 2022
5. Influence of local symmetry on lattice dynamics coupled to topological surface states
- Author
-
Sobota, Jonathan A., Teitelbaum, Samuel W., Huang, Yijing, Querales-Flores, José D., Power, Robert, Allen, Meabh, Rotundu, Costel R., Bailey, Trevor P., Uher, Ctirad, Henighan, Tom, Jiang, Mason, Zhu, Diling, Chollet, Matthieu, Sato, Takahiro, Trigo, Mariano, Murray, Éamonn D., Savić, Ivana, Kirchmann, Patrick S., Fahy, Stephen, Reis, David. A., and Shen, Zhi-Xun
- Subjects
Condensed Matter - Materials Science - Abstract
We investigate coupled electron-lattice dynamics in the topological insulator Bi2Te3 with time-resolved photoemission and time-resolved x-ray diffraction. It is well established that coherent phonons can be launched by optical excitation, but selection rules generally restrict these modes to zone-center wavevectors and Raman-active branches. We find that the topological surface state couples to additional modes, including a continuum of surface-projected bulk modes from both Raman- and infrared-branches, with possible contributions from surface-localized modes when they exist. Our calculations show that this surface vibrational spectrum occurs naturally as a consequence of the translational and inversion symmetries broken at the surface, without requiring the splitting-off of surface-localized phonon modes. The generality of this result suggests that coherent phonon spectra are useful by providing unique fingerprints for identifying surface states in more controversial materials. These effects may also expand the phase space for tailoring surface state wavefunctions via ultrafast optical excitation.
- Published
- 2022
- Full Text
- View/download PDF
6. Constitutional AI: Harmlessness from AI Feedback
- Author
-
Bai, Yuntao, Kadavath, Saurav, Kundu, Sandipan, Askell, Amanda, Kernion, Jackson, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Chen, Carol, Olsson, Catherine, Olah, Christopher, Hernandez, Danny, Drain, Dawn, Ganguli, Deep, Li, Dustin, Tran-Johnson, Eli, Perez, Ethan, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lukosuite, Kamile, Lovitt, Liane, Sellitto, Michael, Elhage, Nelson, Schiefer, Nicholas, Mercado, Noemi, DasSarma, Nova, Lasenby, Robert, Larson, Robin, Ringer, Sam, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Lanham, Tamera, Telleen-Lawton, Timothy, Conerly, Tom, Henighan, Tom, Hume, Tristan, Bowman, Samuel R., Hatfield-Dodds, Zac, Mann, Ben, Amodei, Dario, Joseph, Nicholas, McCandlish, Sam, Brown, Tom, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
- Published
- 2022
7. Measuring Progress on Scalable Oversight for Large Language Models
- Author
-
Bowman, Samuel R., Hyun, Jeeyoon, Perez, Ethan, Chen, Edwin, Pettit, Craig, Heiner, Scott, Lukošiūtė, Kamilė, Askell, Amanda, Jones, Andy, Chen, Anna, Goldie, Anna, Mirhoseini, Azalia, McKinnon, Cameron, Olah, Christopher, Amodei, Daniela, Amodei, Dario, Drain, Dawn, Li, Dustin, Tran-Johnson, Eli, Kernion, Jackson, Kerr, Jamie, Mueller, Jared, Ladish, Jeffrey, Landau, Joshua, Ndousse, Kamal, Lovitt, Liane, Elhage, Nelson, Schiefer, Nicholas, Joseph, Nicholas, Mercado, Noemí, DasSarma, Nova, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Johnston, Scott, Kravec, Shauna, Showk, Sheer El, Fort, Stanislav, Telleen-Lawton, Timothy, Brown, Tom, Henighan, Tom, Hume, Tristan, Bai, Yuntao, Hatfield-Dodds, Zac, Mann, Ben, and Kaplan, Jared
- Subjects
Computer Science - Human-Computer Interaction ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks., Comment: v2 fixes a few typos from v1
- Published
- 2022
8. In-context Learning and Induction Heads
- Author
-
Olsson, Catherine, Elhage, Nelson, Nanda, Neel, Joseph, Nicholas, DasSarma, Nova, Henighan, Tom, Mann, Ben, Askell, Amanda, Bai, Yuntao, Chen, Anna, Conerly, Tom, Drain, Dawn, Ganguli, Deep, Hatfield-Dodds, Zac, Hernandez, Danny, Johnston, Scott, Jones, Andy, Kernion, Jackson, Lovitt, Liane, Ndousse, Kamal, Amodei, Dario, Brown, Tom, Clark, Jack, Kaplan, Jared, McCandlish, Sam, and Olah, Chris
- Subjects
Computer Science - Machine Learning - Abstract
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
- Published
- 2022
9. Toy Models of Superposition
- Author
-
Elhage, Nelson, Hume, Tristan, Olsson, Catherine, Schiefer, Nicholas, Henighan, Tom, Kravec, Shauna, Hatfield-Dodds, Zac, Lasenby, Robert, Drain, Dawn, Chen, Carol, Grosse, Roger, McCandlish, Sam, Kaplan, Jared, Amodei, Dario, Wattenberg, Martin, and Olah, Christopher
- Subjects
Computer Science - Machine Learning - Abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability., Comment: Also available at https://transformer-circuits.pub/2022/toy_model/index.html
- Published
- 2022
10. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Author
-
Ganguli, Deep, Lovitt, Liane, Kernion, Jackson, Askell, Amanda, Bai, Yuntao, Kadavath, Saurav, Mann, Ben, Perez, Ethan, Schiefer, Nicholas, Ndousse, Kamal, Jones, Andy, Bowman, Sam, Chen, Anna, Conerly, Tom, DasSarma, Nova, Drain, Dawn, Elhage, Nelson, El-Showk, Sheer, Fort, Stanislav, Hatfield-Dodds, Zac, Henighan, Tom, Hernandez, Danny, Hume, Tristan, Jacobson, Josh, Johnston, Scott, Kravec, Shauna, Olsson, Catherine, Ringer, Sam, Tran-Johnson, Eli, Amodei, Dario, Brown, Tom, Joseph, Nicholas, McCandlish, Sam, Olah, Chris, Kaplan, Jared, and Clark, Jack
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computers and Society - Abstract
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
- Published
- 2022
11. Language Models (Mostly) Know What They Know
- Author
-
Kadavath, Saurav, Conerly, Tom, Askell, Amanda, Henighan, Tom, Drain, Dawn, Perez, Ethan, Schiefer, Nicholas, Hatfield-Dodds, Zac, DasSarma, Nova, Tran-Johnson, Eli, Johnston, Scott, El-Showk, Sheer, Jones, Andy, Elhage, Nelson, Hume, Tristan, Chen, Anna, Bai, Yuntao, Bowman, Sam, Fort, Stanislav, Ganguli, Deep, Hernandez, Danny, Jacobson, Josh, Kernion, Jackson, Kravec, Shauna, Lovitt, Liane, Ndousse, Kamal, Olsson, Catherine, Ringer, Sam, Amodei, Dario, Brown, Tom, Clark, Jack, Joseph, Nicholas, Mann, Ben, McCandlish, Sam, Olah, Chris, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing., Comment: 23+17 pages; refs added, typos fixed
- Published
- 2022
12. Scaling Laws and Interpretability of Learning from Repeated Data
- Author
-
Hernandez, Danny, Brown, Tom, Conerly, Tom, DasSarma, Nova, Drain, Dawn, El-Showk, Sheer, Elhage, Nelson, Hatfield-Dodds, Zac, Henighan, Tom, Hume, Tristan, Johnston, Scott, Mann, Ben, Olah, Chris, Olsson, Catherine, Amodei, Dario, Joseph, Nicholas, Kaplan, Jared, and McCandlish, Sam
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance., Comment: 23 pages, 22 figures
- Published
- 2022
13. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Author
-
Bai, Yuntao, Jones, Andy, Ndousse, Kamal, Askell, Amanda, Chen, Anna, DasSarma, Nova, Drain, Dawn, Fort, Stanislav, Ganguli, Deep, Henighan, Tom, Joseph, Nicholas, Kadavath, Saurav, Kernion, Jackson, Conerly, Tom, El-Showk, Sheer, Elhage, Nelson, Hatfield-Dodds, Zac, Hernandez, Danny, Hume, Tristan, Johnston, Scott, Kravec, Shauna, Lovitt, Liane, Nanda, Neel, Olsson, Catherine, Amodei, Dario, Brown, Tom, Clark, Jack, McCandlish, Sam, Olah, Chris, Mann, Ben, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work., Comment: Data available at https://github.com/anthropics/hh-rlhf
- Published
- 2022
14. Predictability and Surprise in Large Generative Models
- Author
-
Ganguli, Deep, Hernandez, Danny, Lovitt, Liane, DasSarma, Nova, Henighan, Tom, Jones, Andy, Joseph, Nicholas, Kernion, Jackson, Mann, Ben, Askell, Amanda, Bai, Yuntao, Chen, Anna, Conerly, Tom, Drain, Dawn, Elhage, Nelson, Showk, Sheer El, Fort, Stanislav, Hatfield-Dodds, Zac, Johnston, Scott, Kravec, Shauna, Nanda, Neel, Ndousse, Kamal, Olsson, Catherine, Amodei, Daniela, Amodei, Dario, Brown, Tom, Kaplan, Jared, McCandlish, Sam, Olah, Chris, and Clark, Jack
- Subjects
Computer Science - Computers and Society - Abstract
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models., Comment: Updated to reflect the version submitted (and accepted) to ACM FAccT '22. This update incorporates feedback from peer-review and fixes minor typos. See open access FAccT conference version at: https://dl.acm.org/doi/abs/10.1145/3531146.3533229
- Published
- 2022
- Full Text
- View/download PDF
15. A General Language Assistant as a Laboratory for Alignment
- Author
-
Askell, Amanda, Bai, Yuntao, Chen, Anna, Drain, Dawn, Ganguli, Deep, Henighan, Tom, Jones, Andy, Joseph, Nicholas, Mann, Ben, DasSarma, Nova, Elhage, Nelson, Hatfield-Dodds, Zac, Hernandez, Danny, Kernion, Jackson, Ndousse, Kamal, Olsson, Catherine, Amodei, Dario, Brown, Tom, Clark, Jack, McCandlish, Sam, Olah, Chris, and Kaplan, Jared
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences., Comment: 26+19 pages; v2 typos fixed, refs added, figure scale / colors fixed; v3 correct very non-standard TruthfulQA formatting and metric, alignment implications slightly improved
- Published
- 2021
16. Scaling Laws for Transfer
- Author
-
Hernandez, Danny, Kaplan, Jared, Henighan, Tom, and McCandlish, Sam
- Subjects
Computer Science - Machine Learning - Abstract
We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute., Comment: 19 pages, 15 figures
- Published
- 2021
17. Scaling Laws for Autoregressive Generative Modeling
- Author
-
Henighan, Tom, Kaplan, Jared, Katz, Mor, Chen, Mark, Hesse, Christopher, Jackson, Jacob, Jun, Heewoo, Brown, Tom B., Dhariwal, Prafulla, Gray, Scott, Hallacy, Chris, Mann, Benjamin, Radford, Alec, Ramesh, Aditya, Ryder, Nick, Ziegler, Daniel M., Schulman, John, Amodei, Dario, and McCandlish, Sam
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks., Comment: 20+17 pages, 33 figures; added appendix with additional language results
- Published
- 2020
18. Language Models are Few-Shot Learners
- Author
-
Brown, Tom B., Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, Agarwal, Sandhini, Herbert-Voss, Ariel, Krueger, Gretchen, Henighan, Tom, Child, Rewon, Ramesh, Aditya, Ziegler, Daniel M., Wu, Jeffrey, Winter, Clemens, Hesse, Christopher, Chen, Mark, Sigler, Eric, Litwin, Mateusz, Gray, Scott, Chess, Benjamin, Clark, Jack, Berner, Christopher, McCandlish, Sam, Radford, Alec, Sutskever, Ilya, and Amodei, Dario
- Subjects
Computer Science - Computation and Language - Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general., Comment: 40+32 pages
- Published
- 2020
19. Scaling Laws for Neural Language Models
- Author
-
Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B., Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, and Amodei, Dario
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence., Comment: 19 pages, 15 figures
- Published
- 2020
20. Direct Measurement of Anharmonic Decay Channels of a Coherent Phonon
- Author
-
Teitelbaum, Samuel W., Henighan, Tom, Huang, Yijing, Liu, Hanzhe, Jiang, Mason P., Zhu, Diling, Chollet, Matthieu, Sato, Takahiro, Murray, Éamonn D., Fahy, Stephen, O'Mahony, Shane, Bailey, Trevor P., Uher, Ctirad, Trigo, Mariano, and Reis, David A.
- Subjects
Condensed Matter - Materials Science - Abstract
We observe anharmonic decay of the photoexcited coherent A1g phonon in bismuth to points in the Brillouin zone where conservation of momentum and energy are satisfied for three-phonon scattering. The decay of a coherent phonon can be understood as a parametric resonance process whereby the atomic displacement periodically modulates the frequency of a broad continuum of modes. This results in energy transfer through resonant squeezing of the target modes. Using ultrafast diffuse x-ray scattering, we observe build up of coherent oscillations in the target modes driven by this parametric resonance over a wide range of the Brillouin zone. We compare the extracted anharmonic coupling constant to first principles calculations for a representative decay channel., Comment: 5 pages, 6 figures
- Published
- 2017
- Full Text
- View/download PDF
21. Phonon Spectroscopy with Sub-meV Resolution by Femtosecond X-ray Diffuse Scattering
- Author
-
Zhu, Diling, Robert, Aymeric, Henighan, Tom, Lemke, Henrik T., Chollet, Matthieu, Glownia, J. Michael, Reis, David A., and Trigo, Mariano
- Subjects
Condensed Matter - Materials Science - Abstract
We present a reconstruction of the transverse acoustic phonon dispersion of germanium from femtosecond time-resolved x-ray diffuse scattering measurements at the Linac Coherent Light Source. We demonstrate an energy resolution of 0.3 meV with momentum resolution of 0.01 nm^-1 using 10 keV x-rays with a bandwidth of ~ 1 eV. This high resolution was achieved simultaneously for a large section of reciprocal space including regions closely following three of the principle symmetry directions. The phonon dispersion was reconstructed with less than three hours of measurement time, during which neither the x-ray energy, the sample orientation, nor the detector position were scanned. These results demonstrate how time-domain measurements can complement conventional frequency domain inelastic scattering techniques., Comment: 3 figures, 4 pages
- Published
- 2015
- Full Text
- View/download PDF
22. Influence of local symmetry on lattice dynamics coupled to topological surface states
- Author
-
Sobota, Jonathan A., primary, Teitelbaum, Samuel W., additional, Huang, Yijing, additional, Querales-Flores, José D., additional, Power, Robert, additional, Allen, Meabh, additional, Rotundu, Costel R., additional, Bailey, Trevor P., additional, Uher, Ctirad, additional, Henighan, Tom, additional, Jiang, Mason, additional, Zhu, Diling, additional, Chollet, Matthieu, additional, Sato, Takahiro, additional, Trigo, Mariano, additional, Murray, Éamonn D., additional, Savić, Ivana, additional, Kirchmann, Patrick S., additional, Fahy, Stephen, additional, Reis, David A., additional, and Shen, Zhi-Xun, additional
- Published
- 2023
- Full Text
- View/download PDF
23. Discovering Language Model Behaviors with Model-Written Evaluations
- Author
-
Perez, Ethan, primary, Ringer, Sam, additional, Lukosiute, Kamile, additional, Nguyen, Karina, additional, Chen, Edwin, additional, Heiner, Scott, additional, Pettit, Craig, additional, Olsson, Catherine, additional, Kundu, Sandipan, additional, Kadavath, Saurav, additional, Jones, Andy, additional, Chen, Anna, additional, Mann, Benjamin, additional, Israel, Brian, additional, Seethor, Bryan, additional, McKinnon, Cameron, additional, Olah, Christopher, additional, Yan, Da, additional, Amodei, Daniela, additional, Amodei, Dario, additional, Drain, Dawn, additional, Li, Dustin, additional, Tran-Johnson, Eli, additional, Khundadze, Guro, additional, Kernion, Jackson, additional, Landis, James, additional, Kerr, Jamie, additional, Mueller, Jared, additional, Hyun, Jeeyoon, additional, Landau, Joshua, additional, Ndousse, Kamal, additional, Goldberg, Landon, additional, Lovitt, Liane, additional, Lucas, Martin, additional, Sellitto, Michael, additional, Zhang, Miranda, additional, Kingsland, Neerav, additional, Elhage, Nelson, additional, Joseph, Nicholas, additional, Mercado, Noemi, additional, DasSarma, Nova, additional, Rausch, Oliver, additional, Larson, Robin, additional, McCandlish, Sam, additional, Johnston, Scott, additional, Kravec, Shauna, additional, El Showk, Sheer, additional, Lanham, Tamera, additional, Telleen-Lawton, Timothy, additional, Brown, Tom, additional, Henighan, Tom, additional, Hume, Tristan, additional, Bai, Yuntao, additional, Hatfield-Dodds, Zac, additional, Clark, Jack, additional, Bowman, Samuel R., additional, Askell, Amanda, additional, Grosse, Roger, additional, Hernandez, Danny, additional, Ganguli, Deep, additional, Hubinger, Evan, additional, Schiefer, Nicholas, additional, and Kaplan, Jared, additional
- Published
- 2023
- Full Text
- View/download PDF
24. Predictability and Surprise in Large Generative Models
- Author
-
Ganguli, Deep, primary, Hernandez, Danny, additional, Lovitt, Liane, additional, Askell, Amanda, additional, Bai, Yuntao, additional, Chen, Anna, additional, Conerly, Tom, additional, Dassarma, Nova, additional, Drain, Dawn, additional, Elhage, Nelson, additional, El Showk, Sheer, additional, Fort, Stanislav, additional, Hatfield-Dodds, Zac, additional, Henighan, Tom, additional, Johnston, Scott, additional, Jones, Andy, additional, Joseph, Nicholas, additional, Kernian, Jackson, additional, Kravec, Shauna, additional, Mann, Ben, additional, Nanda, Neel, additional, Ndousse, Kamal, additional, Olsson, Catherine, additional, Amodei, Daniela, additional, Brown, Tom, additional, Kaplan, Jared, additional, McCandlish, Sam, additional, Olah, Christopher, additional, Amodei, Dario, additional, and Clark, Jack, additional
- Published
- 2022
- Full Text
- View/download PDF
25. The Cyclopean city: a fantasy image of decadence
- Author
-
Henighan, Tom
- Subjects
Creation in art -- Analysis ,Creation (Literary, artistic, etc.) -- Analysis ,Degeneration -- Portrayals -- Analysis ,Literature/writing ,Analysis ,Portrayals - Abstract
In The Decline of the West, Oswald Spengler memorably evokes the image of a 'cosmopolis' he perceives as standing at the end of every great culture. Spengler's visionary history reveals [...]
- Published
- 1994
26. Nanny
- Author
-
Henighan, Tom
- Subjects
Literature/writing - Published
- 2000
27. Phonon spectroscopy with sub-meV resolution by femtosecond x-ray diffuse scattering
- Author
-
Zhu, Diling, primary, Robert, Aymeric, additional, Henighan, Tom, additional, Lemke, Henrik T., additional, Chollet, Matthieu, additional, Glownia, J. Mike, additional, Reis, David A., additional, and Trigo, Mariano, additional
- Published
- 2015
- Full Text
- View/download PDF
28. Lensless Imaging of Nano- and Meso-Scale Dynamics with X-rays
- Author
-
Clark, Jesse N., primary, Trigo, Mariano, additional, Henighan, Tom, additional, Harder, Ross, additional, Abbey, Brian, additional, Katayama, Tetsuo, additional, Kozina, Mike, additional, Dufresne, Eric, additional, Wen, Haidan, additional, Walko, Donald, additional, Li, Yuelin, additional, Huang, Xiaojing, additional, Robinson, Ian, additional, and Reis, David, additional
- Published
- 2015
- Full Text
- View/download PDF
29. Home at Grasmere. (Poem)
- Author
-
Henighan, Tom
- Subjects
Literature/writing - Published
- 2003
30. Opera no longer a 'museum culture'
- Author
-
HENIGHAN, TOM
- Subjects
Canadian Opera Co. ,Banking, finance and accounting industries ,Business ,Business, international - Published
- 2001
31. Letters to the Editor
- Author
-
Henighan, Tom
- Subjects
Education ,Family and marriage - Abstract
To the Editor of Resource Links Dear Ms. Pennell I must protest Margaret Mackey's review of my novel Doom Lake Holiday, which appeared in the April issue of Resource Links, [...]
- Published
- 2009
32. The Presumption of Culture and Ideas of North: A Reply to Frank Davey
- Author
-
Henighan, Tom, primary
- Published
- 1998
- Full Text
- View/download PDF
33. Enemies of culture
- Author
-
Henighan, Tom
- Published
- 1989
34. Tietjens Transformed: A Reading of Parade's End
- Author
-
Henighan, Tom.
- Published
- 1972
35. Jungle Epitaph.
- Author
-
Henighan, Tom
- Subjects
- JUNGLE Epitaph (Poem), HENIGHAN, Tom
- Abstract
The article presents the poem "Jungle Epitaph," by Tom Henighan. First Line: Great Tarzan is dead. Last Line: the impostures of words . . .
- Published
- 2004
36. Home at Grasmere.
- Author
-
Henighan, Tom
- Subjects
- HOME at Grasmere (Poem), HENIGHAN, Tom
- Abstract
Presents the poem "Home at Grasmere," by Tom Henighan.
- Published
- 2003
37. The Forsaken Garden: Four Conversations on the Deep Meaning of Environmental Illness.
- Author
-
Henighan, Tom
- Subjects
ENVIRONMENTALISM ,NONFICTION - Abstract
The article reviews the book "The Forsaken Garden: Four Conversations on the Deep Meaning of Environmental Illness," by Nancy Ryley.
- Published
- 2000
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.