LESSWRONG
LW

2474
Stephen McAleese
1230Ω29121971
Message
Dialogue
Subscribe

Software Engineer interested in AI and AI safety.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3Stephen McAleese's Shortform
3y
14
Interpretability is the best path to alignment
Stephen McAleese11d30

Thanks for the post. It covers an important debate: whether mechanistic interpretability is worth pursuing as a path towards safer AI. The post is logical and makes several good points but I find it's style too formal for LessWrong and it could be rewritten to be more readable.

Reply
Understanding LLMs: Insights from Mechanistic Interpretability
Stephen McAleese11d40

Thank you for taking the time to comment and for pointing out some errors in the post! Your attention to detail is impressive. I updated the post to reflect your feedback:

  • I removed the references to S1 and S2 in the IOI description and fixed the typos you mentioned.
  • I changed "A typical vector of neuron activations such as the residual stream..." to "A typical activation vector such as the residual stream..."

Good luck with the rest of the ARENA curriculum! Let me know if you come across anything else.

Reply
Turing-Test-Passing AI implies Aligned AI
Stephen McAleese14d20

While I disagree with a lot of this post, I thought it was interesting and I don't think it should have negative karma.

Reply
Stephen McAleese's Shortform
Stephen McAleese15d50

I haven't heard anything about RULER on LessWrong yet:

RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for hand-crafted reward functions by using an LLM-as-judge to automatically score agent trajectories. Simply define your task in the system prompt, and RULER handles the rest—no labeled data, expert feedback, or reward engineering required.

✨ Key Benefits:

  • 2-3x faster development - Skip reward function engineering entirely
  • General-purpose - Works across any task without modification
  • Strong performance - Matches or exceeds hand-crafted rewards in 3/4 benchmarks
  • Easy integration - Drop-in replacement for manual reward functions

Apparently it allows LLM agents to learn from experience and significantly improves reliability.

Link: https://github.com/OpenPipe/ART

Reply
Summary of our Workshop on Post-AGI Outcomes
Stephen McAleese17d64

These talks are fascinating. Thanks for sharing.

Reply
My current guess at the effect of AI automation on jobs
Stephen McAleese23d30

Great post, it explained some of the economics of job automation in simple terms and clarified my thinking on the subject which is not easy to do. This post has fewer upvotes than it should have.

Reply
Daniel Kokotajlo's Shortform
Stephen McAleese2mo104

An alternative idea is to put annual quotas on GPU production. The oil and dairy industries already do this to control prices and the fishing industry does it to avoid overfishing.

Reply
Foom & Doom 2: Technical alignment is hard
Stephen McAleese2moΩ350

Thank you for the reply!

Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:

  • It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
  • It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
  • It seems like there are quite a few examples of learned classifiers working well in practice:
    • It's hard to write spam that gets past an email spam classifier.
    • It's hard to jailbreak LLMs.
    • It's hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.

That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
 

Reply
Foom & Doom 2: Technical alignment is hard
Stephen McAleese2mo*Ω350

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

But it seems likely to me that programmers won't know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach "direct specification" of values and argues that it's naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:

  1. The human programmers have a dataset of correct behaviors or a natural language description of what they want and they use this information to create a reward function or model automatically (e.g. Text2Reward).
  2. This learned reward model or generated code is used to train the policy.

If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI's behavior directly.

Edit: see AGI will have learnt reward functions for an in-depth post on the subject.

Reply
Foom & Doom 1: “Brain in a box in a basement”
Stephen McAleese3mo42

I think it depends on the context. It's the norm for employees in companies to have managers though as @Steven Byrnes said, this is partially for motivational purposes since the incentives of employees are often not fully aligned with those of the company. So this example is arguably more of an alignment than a capability problem.

I can think of some other examples of humans acting in highly autonomous ways:

  • To the best of my knowledge, most academics and PhD students are expected to publish novel research in a highly autonomous way.
  • Novelists can work with a lot of autonomy when writing a book (though they're a minority).
  • There are also a lot of personal non-work goals like saving for retirement or raising kids which require high autonomy over a long period of time.
  • Small groups of people like a startup can work autonomously for years without going off the rails like a group of LLMs probably would after a while (e.g. the Claude bliss attractor).
Reply
Load More
39Understanding LLMs: Insights from Mechanistic Interpretability
17d
2
16How Can Average People Contribute to AI Safety?
6mo
4
195Shallow review of technical AI safety, 2024
Ω
9mo
Ω
35
23Geoffrey Hinton on the Past, Present, and Future of AI
1y
5
34Could We Automate AI Alignment Research?
Ω
2y
Ω
10
73An Overview of the AI Safety Funding Situation
2y
10
26Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4
3y
6
112GPT-4 Predictions
3y
27
3Stephen McAleese's Shortform
3y
14
8AGI as a Black Swan Event
3y
8
Load More
Road To AI Safety Excellence
3 years ago
(+3/-2)