LESSWRONG
LW

Puria — LessWrong

Shaping the exploration of the motivation-space matters for AI safety

Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment

Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I'm misunderstanding this sentence!

A Conceptual Framework for Exploration Hacking

Puria2mo30

This is great - thanks for putting such a clear taxonomy and explanation together. I've not come across 'value shaping' before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:

By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively "implanting" a persistent backdoor into the resulting policy, even after th

197

Cam, Puria, Kyle O’Brien, David Africa, Samuel Ratnam, andyk

Ω 584mo

TL;DR

LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.

Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch

Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!

Abstract

We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority...

(Continue Reading - 2401 more words)

Architectures for Increased Externalisation of Reasoning

Karthik Viswanathan, Liza Pavlova, Mariia Koroliuk, Puria, Cam, Edward James Young

4mo

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!

Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream

Architectures for Externalisation of Cognition

Many AI safety researchers are converging around the idea that the chain-of-thought (CoT) of models may be a valuable tool for monitoring the actions they may take...

(Continue Reading - 3769 more words)

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

Cam, Puria

Ω 145mo

Background

Deliberative alignment is a powerful post-training alignment technique that involves generating and training on re-contextualised supervised fine-tuning (SFT) datasets generated with a set of principles in context. The process takes three steps:

With the set of principles^[1] (henceforth: the constitution) in a base model’s^[2] context. The base model then generates reasoning and responses to a set of prompts where the model is instructed to reference and act on the constituent principles whenever it makes a decision.
Filter out final outputs and reasoning traces that do not adhere to the constitution, so that the training dataset consists of outputs with the right actions for the right reason.
SFT the base model on the (prompt, reasoning, response) triples with the constitution used in generation removed from the prompt.^[3]

The aim of deliberative alignment is to train a model to learn...

(Continue Reading - 2251 more words)

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Shi, Puria

7mo

This is a linkpost for https://www.researchgate.net/publication/395030062_I_Am_Large_I_Contain_Multitudes_Persona_Transmission_via_Contextual_Inference_in_LLMs

We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”, inspired by past work on deception detection) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to the target misalignment behaviours, and each answer providing only one bit of information. The majority of the semantic content (the nonsensical questions) are identical in the contrasting versions of the encoding context; the only difference is the binary answers succeeding each one. This isolates the effect of self-consistency due to contextual inference, from that caused by entangled tokens causing subliminal learning.

LESSWRONG
LW

LESSWRONG
LW

Puria

Puria

Puria

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

Architectures for Increased Externalisation of Reasoning

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Puria

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

Architectures for Increased Externalisation of Reasoning

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

TL;DR

Abstract

Architectures for Externalisation of Cognition

Background