Nandi — LessWrong

Intricacies of Feature Geometry in Large Language Models

Note: This is a more fleshed-out version of this post and includes theoretical arguments justifying the empirical findings. If you've read that one, feel free to skip to the proofs. We challenge the thesis of the ICML 2024 Mechanistic Interpretability Workshop 1st prize winning paper: The Geometry of Categorical and...

Dec 7, 202472

The Geometry of Feelings and Nonsense in Large Language Models

by 7vik and Nandi

This post has some ablation results around the thesis of the ICML 2024 Mech. Interp. workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in Large Language Models The main takeaway is that the orthogonality they observe in categorical and hierarchical concepts occurs practically everywhere, even at...

Sep 27, 202462

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

by Yohan Mathew, joanv, robert mccarthy, ollie, Nandi, and Dylan Cope

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations...

Sep 25, 202437

Robustness of Contrast-Consistent Search to Adversarial Prompting

Produced as part of the AI Safety Hub Labs programme run by Charlie Griffin and Julia Karbing. This project was mentored by Nandi Schoots. Image generated by DALL-E 3. Introduction We look at how adversarial prompting affects the outputs of large language models (LLMs) and compare it with how the...

Nov 1, 202318

Machine Unlearning Evaluations as Interpretability Benchmarks

by NickyP and Nandi

Interpreting Models by Ablation. Image generated by DALL-E 3. Introduction Interpretability in machine learning, especially in language models, is an area with a large number of contributions. While this can be quite useful for improving our understanding of models, one issue is that there is the lack of robust benchmarks...

Oct 23, 202333

Splitting Debate up into Two Subsystems

In this post I will first recap how debate can help with value learning and that a standard debater optimizes for convincingness. Then I will illustrate how two subsystems could help with value learning in a similar way, without optimizing for convincingness. (Of course this new system could have its...

Jul 3, 202013

Acknowledging Human Preference Types to Support Value Learning

We analyze the usefulness of the framework of preference types [Berridge et al. 2009] to value learning by an artificial intelligence. In the context of AI the purpose of value learning is giving an AI goals aligned with humanity. We will lay the groundwork for establishing how human preferences of...

Nov 13, 201834