AI for Agent Foundations etc.?

Valentine

What's the state of using current AIs for agent foundations research, or other theoretical AI safety work?

I'd be pretty surprised if no one has thought to do this, so I'm guessing this is just a matter of me catching up on what's going on.

I'm thinking for instance about how I saw Terence Tao talking about using Archimedes for advancing math. And I'd think that current AIs could explain their theoretical advances in ways that could convey the key insights to skilled humans. So they could maybe act as insight-searchers rather than just theorem-proving machines.

I get the sense that the current thinking is, this kind of work can't be fast enough at this point for many short timelines. That raising the alarms in public and politics is more key right now. And at first brush that basically looks right to me.

But I don't know, doing a straight AI-assisted shot at theoretical alignment work seems like it's probably worth trying, and easy enough for anyone to try, so I imagine someone is working on it. I just haven't yet heard of anyone doing this.

So, what's the current status of this kind of work?

David tries to punt things to LLMs at least once a day on average when we're working. So far, they continue to work best when they can act as Google Search Plus Plus - i.e. when there's some already-known fact relevant to what we're doing, and they surface that fact to us. Occasionally they can complete a conceptually-simple-but-technically-dense proof by combining a few such facts, and very often they can write some useful code by combining a few already-known pieces.

For anything novel, they remain almost always useless in our experience; they string together words which sound relevant but the semantics don't make any sense.

This section of last year's shallow review of TAIS is 3 months out of date and maybe too coarse-grained but is a decent starting point?

Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
See also: Aligning what? · Tiling agents · Dovetail
Some names: Abram Demski, Alex Altair, Sam Eisenstat, Thane Ruthenis, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
Some outputs (10)
Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games. Cole Wyeth et al.
UAIASI. Cole Wyeth
Clarifying "wisdom": Foundational topics for aligned AIs to prioritize before irreversible decisions
Agent foundations: not really math, not really science. Alex_Altair
Off-switching not guaranteed. Sven Neth
Formalizing Embeddedness Failures in Universal Artificial Intelligence. Cole Wyeth, Marcus Hutter
Is alignment reducible to becoming more coherent?. Cole Wyeth
What Is The Alignment Problem?. johnswentworth
Good old fashioned decision theory
Report & retrospective on the Dovetail fellowship. Alex Altair

For anything novel, they remain almost always useless in our experience; they string together words which sound relevant but the semantics don't make any sense.

This section of last year's shallow review of TAIS is 3 months out of date and maybe too coarse-grained but is a decent starting point?

Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
Theory of change: Rigorously understand optimization processed and agents, and what it means for them to be aligned in a substrate independent way → identify impossibility results and necessary conditions for aligned optimizer systems → use this theoretical understanding to eventually design safe architectures that remain stable and safe under self-reflection
General approach: cognitive · Target case: worst-case
Orthodox alignment problems: Value is fragile and hard to specify, Corrigibility is anti-natural, Goals misgeneralize out of distribution
See also: Aligning what? · Tiling agents · Dovetail
Some names: Abram Demski, Alex Altair, Sam Eisenstat, Thane Ruthenis, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
Some outputs (10)
Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games. Cole Wyeth et al.
UAIASI. Cole Wyeth
Clarifying "wisdom": Foundational topics for aligned AIs to prioritize before irreversible decisions
Agent foundations: not really math, not really science. Alex_Altair
Off-switching not guaranteed. Sven Neth
Formalizing Embeddedness Failures in Universal Artificial Intelligence. Cole Wyeth, Marcus Hutter
Is alignment reducible to becoming more coherent?. Cole Wyeth
What Is The Alignment Problem?. johnswentworth
Good old fashioned decision theory
Report & retrospective on the Dovetail fellowship. Alex Altair

16

[ Question ]

AI for Agent Foundations etc.?

16

16

Agent foundations

16

Agent foundations