Ramana Kumar

Will Capabilities Generalise More?

Nate and Eliezer (Lethality 21) claim that capabilities generalise further than alignment once capabilities start generalising far at all. However, they have not articulated particularly detailed arguments for why this is the case. In this post I collect the arguments for and against the position I have been able to find or generate, and develop them (with a few hours’ effort). I invite you to join me in better understanding this claim and its veracity by contributing your own arguments and improving mine. Thanks to these people for their help with writing and/or contributing arguments: Vikrant Varma, Vika Krakovna, Mary Phuong, Rory Grieg, Tim Genewein, Rohin Shah. For: 1. Capabilities have much shorter description length than alignment. There are simple “laws of intelligence” that underwrite highly general and competent cognitive abilities, but no such simple laws of corrigibility or laws of “doing what the principal means” – or at least, any specification of these latter things will have a higher description length than the laws of intelligence. As a result, most R&D pathways optimising for capabilities and alignment with anything like a simplicity prior (for example) will encounter good approximations of general intelligence earlier than good approximations of corrigibility or alignment. 2. Feedback on capabilities is more consistent and reliable than on alignment. Reality hits back on cognitive strategies implementing capabilities – such as forming and maintaining accurate beliefs, or making good predictions – more consistently and reliably than any training process hits back on motivational systems orienting around incorrect optimisation targets. Therefore there’s stronger outer optimisation pressure towards good (robust) capabilities than alignment, so we see strong and general capabilities first. 3. There’s essentially only one way to get general capabilities and it has a free parameter for the optimisation target. There are many paths but only one

133Jun 29, 2022

Ramana Kumar

Message

1513

517

199

Dialogue on What It Means For Something to Have A Function/Purpose

Context for LW audience: Ramana, Steve and John regularly talk about stuff in the general cluster of agency, abstraction, optimization, compression, purpose, representation, etc. We decided to write down some of our discussion and post it here. This is a snapshot of us figuring stuff out together. Hooks from Ramana:...

Jul 15, 202441

Consent across power differentials

I'd like to put forward another description of a basic issue that's been around for a while. I don't know if there's been significant progress on a solution, and would be happy to pointed to any such progress. I've opted to go for a relatively rough and quick post that...

Jul 9, 202452

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

A Sharp Left Turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling). This post will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring). In a previous...

Nov 25, 202239

Threat Model Literature Review

TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model. Where available we cribbed from the summary in the Alignment Newsletter. For other people's overviews of...

Nov 1, 202279

Clarifying AI X-risk

TL;DR: We give a threat model literature review, propose a categorization and describe a consensus threat model from some of DeepMind's AGI safety team. See our post for the detailed literature review. The DeepMind AGI Safety team has been working to understand the space of threat models for existential risk...

Nov 1, 2022127

Autonomy as taking responsibility for reference maintenance

I think semantics – specifically, maintaining reference relationships – is a core component of intelligent behaviour. Consequently, I think a better understanding of semantics would enable a better understanding of what machine intelligence that is “trying to do the right thing” ought to look like and how to build it....

Aug 17, 202261

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

This is our current distillation of the sharp left turn threat model and an attempt to make it more concrete. We will discuss our understanding of the claims made in this threat model, and propose some mechanisms for how a sharp left turn could happen. This is a work in...

Aug 12, 202286

Load More (7/16)

LESSWRONG
LW

LESSWRONG
LW

Ramana Kumar

Ramana Kumar

Ramana Kumar

Will Capabilities Generalise More?

Clarifying AI X-risk

Thoughts on Human Models

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Ramana Kumar

Dialogue on What It Means For Something to Have A Function/Purpose

Consent across power differentials

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Threat Model Literature Review

Clarifying AI X-risk

Autonomy as taking responsibility for reference maintenance

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Will Capabilities Generalise More?

Clarifying AI X-risk

Thoughts on Human Models

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Dialogue on What It Means For Something to Have A Function/Purpose

Consent across power differentials

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Threat Model Literature Review

Clarifying AI X-risk

Autonomy as taking responsibility for reference maintenance

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms