Scaling Monosemanticity | Research Blog

The Scaling Monosemanticity paper, part of Anthropic’s Transformer Circuits Thread, explores the use of sparse dictionary learning to extract interpretable features from large language models, specifically Claude 3 Sonnet. The goal is to improve the interpretability of AI by identifying monosemantic features—neuronal activations that represent a singular concept—across increasingly larger and more complex models.

The researchers demonstrated that these features could be correlated with specific behaviors and topics, such as concepts related to political figures, programming errors, or even abstract ideas like biological weapons. By "steering" the activation of these features (manipulating their intensity), they could control the model’s behavior in predictable ways. For example, enhancing a feature related to a landmark like the Golden Gate Bridge would make the model focus more on related content. This provides new tools for AI safety, allowing developers to fine-tune models to avoid harmful or inappropriate behaviors by adjusting specific features.

This work also revealed that the size of the model and autoencoder plays a role in the richness of features extracted, but even the largest models (34 million features) still have incomplete coverage of the concepts within Claude 3 Sonnet. The researchers suggest that achieving full monosemanticity may require even larger models with more computational resources.

The implications of this research are significant for AI interpretability and safety, as it provides new techniques for understanding and controlling the internal representations of large language models, potentially reducing the risks associated with opaque AI systems.

https://www.anthropic.com/research/mapping-mind-language-model https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html https://www.lesswrong.com/posts/wg6E3oJJrNnmJezNz/a-scaling-monosemanticity-explainer https://tryalign.ai/resources/blog/scaling-monosemanticity-extracting-interpretable-features-from-claude-3-sonnet https://www.pelayoarbues.com/literature-notes/Articles/Scaling-Monosemanticity-Extracting-Interpretable-Features-From-Claude-3-Sonnet