peligrietzer — LessWrong

Shallow review of technical AI safety, 2025

by technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordinne, ozziegooen, Violet Hour, and lenz

WebsiteEditorialRepo Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025) This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website. It’s shallow in the sense that...

Dec 17, 2025195

The Problem With the Word ‘Alignment’

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushita jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment. The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human...

May 21, 202464

Paper: Understanding and Controlling a Maze-Solving Policy Network

by TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, and lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...

Oct 13, 202370

Some Thoughts on Virtue Ethics for AIs

This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition. Praxis-based values as I define them are, informally, reflective decision-influences matching...

May 2, 202388

Behavioural statistics for a maze-solving agent

Summary: Understanding and controlling a maze-solving policy network analyzed a maze-solving agent's behavior. We isolated four maze properties which seemed to predict whether the mouse goes towards the cheese or towards the top-right corner: In this post, we conduct a more thorough statistical analysis, addressing issues of multicollinearity. We show...

Apr 20, 202346

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

by TurnTrout, peligrietzer, and lisathiergart

Overview: We modify the goal-directed behavior of a trained network, without any gradients or finetuning. We simply add or subtract "motivational vectors" which we compute in a straightforward fashion. In the original post, we defined a "cheese vector" to be "the difference in activations when the cheese is present in...

Mar 31, 2023101

Understanding and controlling a maze-solving policy network

by TurnTrout, peligrietzer, Ulisse Mini, Monte M, and David Udell

Previously: Predictions for shard theory mechanistic interpretability results Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied...

Mar 11, 2023336