Model evaluation for extreme risks

Authors :: Shevlane, Toby
Farquhar, Sebastian
Garfinkel, Ben
Phuong, Mary
Whittlestone, Jess
Leung, Jade
Kokotajlo, Daniel
Marchal, Nahema
Anderljung, Markus
Kolt, Noam
Ho, Lewis
Siddarth, Divya
Avin, Shahar
Hawkins, Will
Kim, Been
Gabriel, Iason
Bolina, Vijay
Clark, Jack
Bengio, Yoshua
Christiano, Paul
Dafoe, Allan
Shevlane, Toby
Farquhar, Sebastian
Garfinkel, Ben
Phuong, Mary
Whittlestone, Jess
Leung, Jade
Kokotajlo, Daniel
Marchal, Nahema
Anderljung, Markus
Kolt, Noam
Ho, Lewis
Siddarth, Divya
Avin, Shahar
Hawkins, Will
Kim, Been
Gabriel, Iason
Bolina, Vijay
Clark, Jack
Bengio, Yoshua
Christiano, Paul
Dafoe, Allan
Publication Year :: 2023
Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.<br />Comment: Fixed typos; added citation

Tools