1. Discovering Language Model Behaviors with Model-Written Evaluations
- Author
-
Perez, Ethan, primary, Ringer, Sam, additional, Lukosiute, Kamile, additional, Nguyen, Karina, additional, Chen, Edwin, additional, Heiner, Scott, additional, Pettit, Craig, additional, Olsson, Catherine, additional, Kundu, Sandipan, additional, Kadavath, Saurav, additional, Jones, Andy, additional, Chen, Anna, additional, Mann, Benjamin, additional, Israel, Brian, additional, Seethor, Bryan, additional, McKinnon, Cameron, additional, Olah, Christopher, additional, Yan, Da, additional, Amodei, Daniela, additional, Amodei, Dario, additional, Drain, Dawn, additional, Li, Dustin, additional, Tran-Johnson, Eli, additional, Khundadze, Guro, additional, Kernion, Jackson, additional, Landis, James, additional, Kerr, Jamie, additional, Mueller, Jared, additional, Hyun, Jeeyoon, additional, Landau, Joshua, additional, Ndousse, Kamal, additional, Goldberg, Landon, additional, Lovitt, Liane, additional, Lucas, Martin, additional, Sellitto, Michael, additional, Zhang, Miranda, additional, Kingsland, Neerav, additional, Elhage, Nelson, additional, Joseph, Nicholas, additional, Mercado, Noemi, additional, DasSarma, Nova, additional, Rausch, Oliver, additional, Larson, Robin, additional, McCandlish, Sam, additional, Johnston, Scott, additional, Kravec, Shauna, additional, El Showk, Sheer, additional, Lanham, Tamera, additional, Telleen-Lawton, Timothy, additional, Brown, Tom, additional, Henighan, Tom, additional, Hume, Tristan, additional, Bai, Yuntao, additional, Hatfield-Dodds, Zac, additional, Clark, Jack, additional, Bowman, Samuel R., additional, Askell, Amanda, additional, Grosse, Roger, additional, Hernandez, Danny, additional, Ganguli, Deep, additional, Hubinger, Evan, additional, Schiefer, Nicholas, additional, and Kaplan, Jared, additional
- Published
- 2023
- Full Text
- View/download PDF