LESSWRONG
LW

tobypullan — LessWrong

First of all, thanks for linking Ryan Greenblatt's posts! These are super interesting - I hadn't read them before and they're super relevant to my project.

I like your suggestion of looking at time horizons of success after removing chunks of the CoT, and applying to more open-ended tasks. It would be interesting to compare time horizon success when removing different sentences from the CoT.

I think the model's understanding of the correct answer/solution depends on both the number of tokens in the CoT and its content. In Ryan Greenblatt's Measuring no CoT math time horizon (single forward pass) post, he shows that models are able to tackle longer time horizon problems when given... (read more)

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability

tobypullan

20d

In Part One, we explore how models sometimes rationalise a pre-conceived answer to generate a chain of thought, rather than using the chain of thought to produce the answer. In Part Two, the implications of this finding for frontier models, and their reasoning efficiency, are discussed.

TLDR

Reasoning models are doing more efficient reasoning whilst getting better at tasks such as coding. As models continue to improve, they will take bigger reasoning jumps which will make it very challenging to use the CoT as a method of interpretability. Further into the future, models may stop using CoT entirely and move towards reasoning directly in the latent space - never reasoning using language.

Part 1: Experimentation

The... (read 1521 more words →)

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

tobypullan

1mo

Executive Summary

A prompt response after being perceived as a novice earlier in the chat, compared to the same conversation where the response is steered in the expert direction.

The problem being solved

In this project, the LLM’s ability to accurately assess a user’s Python programming ability (expert or novice) is determined. The work also identifies how this inference is stored in different layers, how the inference shapes the model’s behaviour, and the extent to which the behaviour of the LLM can be manipulated to treat the user more like an expert or a novice.

This work is inspired by Chen et al on LLM user models.

Research questions and high level takeaways:

Can we train a probe

... (read 2442 more words →)