Author: "Hänni, Kaarel" / Topic: computer science - artificial intelligence - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Hänni, Kaarel"' showing total 2 results

Start Over Author "Hänni, Kaarel" Topic computer science - artificial intelligence

2 results on '"Hänni, Kaarel"'

1. Mathematical Models of Computation in Superposition

Author: Hänni, Kaarel, Mendel, Jake, Vaintrob, Dmitry, and Chan, Lawrence
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $\tilde{O}(d^{1.5})$ and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition., Comment: 28 pages, 5 figures. Published at the ICML 2024 Mechanistic Interpretability (MI) Workshop
Published: 2024

2. Cluster-norm for Unsupervised Probing of Knowledge

Author: Laurito, Walter, Maiya, Sharan, Dhimoïla, Grégoire, Owen, Yeung, and Hänni, Kaarel
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in a given dataset can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of differentiating between knowledge in general and simulated knowledge - a major issue in the literature of latent knowledge elicitation (Christiano et al., 2021) - it significantly improves the ability of unsupervised probes to identify the intended knowledge amidst distractions., Comment: 30 pages, 35 figures
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

2 results on '"Hänni, Kaarel"'

1. Mathematical Models of Computation in Superposition

2. Cluster-norm for Unsupervised Probing of Knowledge

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

2 results on '"Hänni, Kaarel"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources