Large language models (LLMs) have made remarkable progress in recent years. But understanding how they work remains a challenge and scientists at artificial intelligence labs are trying to peer into ...
New interpretability leap: Anthropic's Natural Language Autoencoders convert AI's internal activations into human-readable summaries, offering direct insight into chatbot reasoning. Safety and trust ...
Anthropic says it may have found a way to understand what its AI model Claude is "thinking" internally. The company's new ...
Bhalla, Usha, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio Calmon. "Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability." ...
Jiaxun Li, Aaron, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju. "Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders." Proceedings of the Conference of the ...