What's next for the field of Agent Foundations?

Alexander Gietelink Oldenziel; mattmacdermott

The title of this dialogue promised a lot, but I'm honestly a bit disappointed by the content. It feels like the authors are discussing exactly how to run particular mentorship programs and structure grants and how research works in full generality, while no one is actually looking at the technical problems. All field-building efforts must depend on the importance and tractability of technical problems, and this is just as true when the field is still developing a paradigm. I think a paradigm is established only when researchers with many viewpoints build a sense of which problems are important, then try many approaches until one successfully solves many such problems, thus proving the value of said approach. Wanting to find new researchers to have totally new takes and start totally new illegible research agendas is a level of helplessness that I think is unwarranted-- how can one be interested in AF without some view on what problems are interesting?

I would be excited about a dialogue that goes like this, though the format need not be rigid:

What are the most important [1] problems in agent foundations, with as much specificity as possible?
- Responses could include things like:
  - A sound notion of "goals with limited scope": can't nail down precise desiderata now, but humans have these all the time, we don't know what they are, and they could be useful in corrigibility or impact measures.
  - Finding a mathematical model for agents that satisfies properties of logical inductors but also various other desiderata
  - Further study of corrigibility and capability of agents with incomplete preferences
- Participants discuss how much each problem scratches their itch of curiosity about what agents are.
What techniques have shown promise in solving these and other important problems?
- Does [infra-Bayes, Demski's frames on embedded agents, some informal 'shard theory' thing, ...] have a good success to complexity ratio?
  - probably none of them do?
What problems would benefit the most from people with [ML, neuroscience, category theory, ...] expertise?

[1]: (in the Hamming sense that includes tractability)

[-]Alexander Gietelink Oldenziel1y20

You may be positively surprised to know I agree with you. :)

For context, the dialogue feature just came out on LW. We gave it a try and this was the result. I think we mostly concluded that the dialogue feature wasn't quite worth the effort. Anyway

I like what you're suggesting and would be open to do a dialogue about it !

[-]aysja2y159

I've definitely also seen the failure mode where someone is only or too focused on "the puzzles of agency" without having an edge in linking those questions up with AI risk/alignment. Some ways of asking about/investigating agency are more and less relevant to alignment, so I think it's important that there is a clear/strong enough "signal" from the target domain (here: AI risk/alignment) to guide the search/research directions

I disagree—I think that we need more people on the margin who are puzzling about agency, relative to those who are backchaining from a particular goal in alignment. Like you say elsewhere, we don’t yet know what abstractions make sense here; without knowing what the basic concepts of "agency" are it seems harmful to me to rely too much on top-down approaches, i.e., ones that assume something of an end goal.

In part that’s because I think we need higher variance conceptual bets here, and I think that over-emphasizing particular problems in alignment risks correlating people's minds. In part it's because I suspect that there are surprising, empirical things left to learn about agency that we'll miss if we prefigure the problem space too much.

But also: many great scientific achievements have been preceded by bottom-up work (e.g., Shannon, Darwin, Faraday), and afaict their open-ended, curious explorations are what laid the groundwork for their later theories. I feel that it is a real mistake to hold all work to the same standards of legible feedback loops/backchained reasoning/clear path to impact/etc, given that so many great scientists did not follow this. Certainly, once we have a bit more of a foundation this sort of thing seems good to me (and good to do in abundance). But I think before we know what we’re even talking about, over-emphasizing narrow, concrete problems risks the wrong kind of conceptual research—the kind of “predictably irrelevant” work that Alexander gestures towards.

[-]Alex_Altair2y121

I'd like to gain clarity on what we think the relationship should be between AI alignment and agent foundations. To me, the relationship is 1) historical, in that the people bringing about the field of agent foundations are coming from the AI alignment community and 2) motivational, in that the reason they're investigating agent foundations is to make progress on AI alignment, but not 3) technical, in that I think agent foundations should not be about directly answering questions of how to make the development of AI beneficial to humanity. I think it makes more sense to pursue agent foundations as a quest to understand the nature of agents as a technical concept in its own right.

If you are a climate scientist, then you are very likely in the field in order to help humanity reduce the harms from climate change. But on a day-to-day basis, the thing you are doing is trying to understand the underlying patterns and behavior of the climate as a physical system. It would be unnatural to e.g. exclude papers from climate science journals on the grounds of not being clearly applicable to reducing climate change.

For agent foundations, I think some of the core questions revolve around things like, how does having goals work? How stable are goals? How retargetable are goals? Can we make systems that optimize strongly but within certain limitations? But none of those question are are directly about aligning the goals with humanity.

There's also another group of questions like, what are human's goals? How can we tell? How complex and fragile are they? How can we get an AI system to imitate a human? Et cetera. But I think these questions come from a field that is not agent foundations.

There should certainly be constant and heavy communication between these fields. And I also think that even individual people should be thinking about the applicability questions. But they're somewhat separate loops. A climate scientist will have an outer loop that does things like, chooses a research problem because they think the answer might help reduce climate change, and they should keep checking on that belief as they perform their research. But while they're doing their research, I think they should generally be using an inner loop that just thinks, "huh, how does this funny 'climate' thing work?"

[-]Jan_Kulveit2y96

These are especially common, surprisingly perhaps, in AI and ML departments.

This is somewhat unsurprising given human psychology.
- Scaling up LLMs killed a lot of research agendas inside ML, particularly NLP. Imagine your whole research career was built on improving benchmarks on some NLP problem using various clever ideas. Now, the whole thing is better solved by three sentence prompt to GPT4 and everything everyone in the subfield worked on is irrelevant for all practical purposes... how do you feel? In love with scaled LLMs?
- Overall, people often like about research is coming up with smart ideas, and there is some aesthetics going into it. What's traditionally not part of the aesthetics is 'and you also need to get $100M in compute', and it's reasonably to model a lot of people as having a part which hates this.

[-]Viliam2y51

Kinda like mathematicians hated it when the four color theorem was solved by a computer brute-forcing thousands of options. Only imagine that the same thing happens to hundreds of important mathematical problems -- the proper way to solve them becomes to reduced them to a huge by finite number of cases, then throw lots of money at a computer who will handle these cases one by one, producing a "proof" that no human will ever be able to verify directly.

[-]johnswentworth2y50

I don't think, for example, there's a good intro resource you can send somebody that makes a common-sense case for "basic research into agency could be useful for avoiding risks from powerful AI"

My talk for the alignment workshop at the ALIFE conference this past summer was roughly what I think you want. Unfortunately I don't think it was recorded. Slides are here, but they don't really do it on their own.

[-]Nora_Ammann2y*90

FWIW I also think the "Key Phenomena of AI risk" reading curriculum (h/t TJ) does some of this at least indirectly (it doesn't set out to directly answer this question, but I think a lot of the answers to the question are comprise in the curriculum).

(Edit: fixed link)

[-]Nora_Ammann2y30

How confident are you about it not having been recorded? If not very, seems props worth checking again

[-]rorygreig2y60

The workshop talks from the previous year's ALIFE conference (2022) seem to be published on YouTube, so I'm following up with whether John's talk from this year's conference can be released as well.

[-]rorygreig2y120

The video of John's talk has now been uploaded on YouTube here.

[-]johnswentworth2y50

I mean, I could always re-present it and record if there's demand for that.

... or we could do this the fun way: powerpoint karaoke. I.e. you make up the talk and record it, using those slides. I bet Alexander could give a really great one.

[-]Nora_Ammann2y84

I have no doubt Alexander would shine!

Happy to run a PIBBSS speaker event for this, record it and make it publicly available. Let me know if you're keen and we'll reach out to find a time.

[-]Nora_Ammann2y40

To follow up on this, we'll be hosting John's talk on Dec 12th, 9:30AM Pacific / 6:30PM CET.

Join through this Zoom Link.

Title: AI would be a lot less alarming if we understood agents

Description: In this talk, John will discuss why and how fundamental questions about agency - as they are asked, among others, by scholars in biology, artificial life, systems theory, etc. - are important to making progress in AI alignment. John gave a similar talk at the annual ALIFE conference in 2023, as an attempt to nerd-snipe researchers studying agency in a biological context.

--

To be informed about future Speaker Series events by subscribing to our SS Mailing List here. You can also add the PIBBSS Speaker Events to your calendar through this link.

[-]Alex_Altair2y20

You can also add the PIBBSS Speaker Events to your calendar through this link.

FYI this link redirects to a UC Berkeley login page.

[-]cousin_it2y*40

Maybe an even better analogy is non-Euclidean geometry. Agent foundations is studying a strange alternate world where agents know the source code to themselves and the universe, where perfect predictors exist and so on. It's not an abstraction of our world, but something quite different. But surprisingly it turns out that many aspects of decision-making in our world have counterparts in the alternate world, and in doing so we shed a strange light on what decision-making in our world actually means.

I'm not even sure these investigations should be tied to AI risk (though that's very important too). To me the other world offers mathematical and philosophical interest on its own, and frankly I'm curious where these investigations will lead (and have contributed to them where I could).

[-]Alexander Gietelink Oldenziel2y86

Modelling always requires idealisation. Currently, in many respects the formal models that Agent Foundations use to capture the informal notion of agency, intention, goal etc are highly idealised. This is not an intrinsic feature of Agent Foundations or mathematical modelling- just a reflection of the inadequate mathematical and conceptual state of the world.

By analogy - intro to Newtonian Mechanics begins with frictionless surfaces and the highly simple orbits of planetary systems. That doesn't mean that Newtonian Mechanics in more sophisticated forms cannot be applied to the real world.

One can get lost in the ethereal beauty of ideal worlds. That should not detract from the ultimate aim of mathematical modelling of the real world.

[-]Alex_Altair2y30

Agent foundations is studying a strange alternate world where agents know the source code to themselves and the universe, where perfect predictors exist and so on

I just want to flag that this is very much not a defining characteristic of agent foundations! Some work in agent foundations will make assumptions like this, some won't -- I consider it a major goal of agent foundations to come up with theories that do not rely on assumptions like this.

(Or maybe you just meant those as examples?)

[-]Nicholas Kross2y31

Another idea that Matt suggested was a BlueDot -style "Agent Foundations-in-the-broad-sense' course.

I would love this and take this myself, fwiw. (Even if I didn't get in, I'd still make "working through such a course's syllabus" one of my main activities in the short term.)

[-]Alex_Altair2y32

FWIW I saw "Anti-MATS" in the sidebar and totally assumed that meant that someone in the dialogue was arguing that the MATS program was bad (instead of discussing the idea of a program that was like MATS but opposite).

[-]kave2y10

Same. My friend Bob suggests "co-MATS"

[-]Raemon2y30

"Reverse MATS"?

(I think I agree that "co-MATS" is in some sense a more accurate description of what's going on, but Reverse MATS feels like it gets the idea across better at first glance)

[-]mattmacdermott2y10

Oops, thanks, I’ve changed it to Reverse MATS to avoid confusion.

Why agent foundations?

My own reasoning for foundational work on agency being a potentially fruitful direction for alignment research is:

Most misalignment threat models are about agents pursuing goals that we'd prefer they didn't pursue (I think this is not controversial)

Existing formalisms about agency don't seem all that useful for understanding or avoiding those threats (again probably not that controversial)

Developing new and more useful ones seems tractable (this is probably more controversial)

The main reason I think it might be tractable is that so far not that many person-hours have gone into trying to do it. A priori it seems like the sort of thing you can get a nice mathematical formalism for, and so far I don't think that we've collected much evidence that you can't.

So I think I'd like to get a large number of people with various different areas of expertise thinking about it, and I'd hope that some small fraction of them discovered something fundamentally important. And a key question is whether the way the field currently works is conducive to that.

LESSWRONG
LW

LESSWRONG
LW

59

What's next for the field of Agent Foundations?

59

Ω 24

59

Ω 24

Should it look more like a normal research field?

Why agent foundations?

Does it need a new name?

Epistemic Pluralism and Path to Impact

Pockets of Deep Expertise

How to Get a Range of Bets

Reverse MATS

Appealing to Researchers