An important thing that the AGI alignment field never understood:
Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.
But what I think people haven't understood is
There's exceptions and nuances and possible escape routes. And the older Yudkowsky-led research about decision theory and tiling and reflective probability is relevant. But this basic argument is in some sense simpler (less advanced, but also more radical ("at the root")) than those essays. The response to the failure of those essays can't just be to "try something else about alignment"; the basic problem is still there and has to be addressed.
(related elaboration: https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html https://tsvibt.blogspot.com/2023/01/the-voyage-of-novelty.html )
Agreed! I tried to say the same thing in The alignment stability problem.
I think most people in prosaic alignment aren't thinking about this problem. Without this, they're working on aligning AI, but not on aligning AGI or ASI. It seems really likely on the current path that we'll soon have AGI that is reflective. In addition, it will do continuous learning, which introduces another route to goal change (e.g., learning that what people mean by "human" mostly applies to some types of artificial minds, too).
The obvious route past this problem, that I think prosaic alignment often sort of assumes without being explicit about it, is that humans will remain in charge of how the AGI updates its goals and beliefs. They're banking on corrigible or instruction-following AGI.
I think that's a viable approach, but we should be more explicit about it. Aligning AI probably helps with aligning AGI, but they're not the same thing, so we should try to get more sure that prosaic alignment really helps align a reflectively stable AGI.
Say more about point 2 there? Thinking about 5 and 6 though - I think I now maybe have a hopeworthy intuition worth sharing later.
Say you have a Bayesian reasoner. It's got hypotheses; it's got priors on them; it's got data. So you watch it doing stuff. What happens? Lots of stuff changes, tide goes in, tide goes out, but it's still a Bayesian, can't explain that. The stuff changing is "not deep". There's something stable though: the architecture in the background that "makes it a Bayesian". The update rules, and the rest of the stuff (for example, whatever machinery takes a hypothesis and produces "predictions" which can be compared to the "predictions" from other hypotheses). And: it seems really stable? Like, even reflectively stable, if you insist?
So does this solve stability? I would say, no. You might complain that the reason it doesn't solve stability is just that the thing doesn't have goal-pursuits. That's true but it's not the core problem. The same issue would show up if we for example looked at the classical agent architecture (utility function, counterfactual beliefs, argmaxxing actions).
The problem is that the agency you can write down is not the true agency. "Deep change" is change that changes elements that you would have considered deep, core, fundamental, overarching... Change that doesn't fit neatly into the mind, change that isn't just another piece of data that updates some existing hypotheses. See https://tsvibt.blogspot.com/2023/01/endo-dia-para-and-ecto-systemic-novelty.html
You might complain that the reason it doesn't solve stability is just that the thing doesn't have goal-pursuits.
Not so - I'd just call it the trivial case and implore us to do better literally at all!
Apart from that, thanks - I have a better sense of what you meant there. "Deep change" as in "no, actually, whatever you pointed to as the architecture of what's Really Going On... can't be that, not for certain, not forever."
I'd go stronger than just "not for certain, not forever", and I'd worry you're not hearing my meaning (agree or not). I'd say in practice more like "pretty soon, with high likelihood, in a pretty deep / comprehensive / disruptive way". E.g. human culture isn't just another biotic species (you can make interesting analogies but it's really not the same).
I'd go stronger than just "not for certain, not forever", and I'd worry you're not hearing my meaning (agree or not).
That's entirely possible. I've thought about this deeply for entire tens of minutes, after all. I think I might just be erring (habitually) on the side of caution in qualities of state-changes I describe expecting to see from systems I don't fully understand. OTOH... I have a hard time believing that even (especially?) an extremely capable mind would find it worthwhile to repeatedly rebuild itself from the ground up, such that few of even the ?biggest?/most salient features of a mind stick around for long at all.
I have no idea what goes on in the limit, and I would guess that what determines the ultimate effects (https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html) would become stable in some important senses. Here I'm mainly saying that the stuff we currently think of as being core architecture would be upturned.
I mean it's complicated... like, all minds are absolutely subject to some constraints--there's some Bayesian constraint, like you can't "concentrate caring in worlds" in a way that correlates too much with "multiversally contingent" facts, compared to how much you've interacted with the world, or something... IDK what it would look like exactly, and if no one else know then that's kinda my point. Like, there's
Another class of constraints like this are those discussed in computational complexity theory.
So there are probably constraints, but we don't really understand them and definitely don't know how to weild them, and in particular we understand the ones about goal-pursuits much less well than we understand the ones about probability.
This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.
Well, it's a quick take. My blog has more detailed explanations, though not organized around this particular point.
We'd have to talk more / I'd have to read more of what you wrote, for me to give a non-surface-level / non-priors-based answer, but on priors (based on, say, a few dozen conversations related to multiple agency) I'd expect that whatever you mean by hierarchical agency is dodging the problem. It's just more homunculi. It could serve as a way in / as a centerpiece for other thoughts you're having that are more so approaching the problem, but the hierarchicalness of the agency probably isn't actually the relevant aspect. It's like if someone is trying to explain how a car goes and then they start talking about how, like, a car is made of four wheels, and each wheel has its own force that it applies to a separate part of the road in some specific position and direction and so we can think of a wheel as having inside of it, or at least being functionally equivalent to having inside of it, another smaller car (a thing that goes), and so a car is really an assembly of 4 cars. We're just... spinning our wheels lol.
Just a guess though. (Just as a token to show that I'm not completely ungrounded here w.r.t. multi-agency stuff in general, but not saying this addresses specifically what you're referring to: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html)
It's just more homunculi.
It's a bit annoying to me that "it's just more homunculi" is both kind of powerful for reasoning about humans, but also evades understanding agentic things. I also find it tempting because it gives a cool theoretical foothold to work off, but I wonder whether the approach is hiding most of the complexity of understanding agency.
Agreed we would have to talk more. I think I mostly get the homunculi objection. Don't have time now to write an actual response, so here are some signposts:
- part of what you call agency is explained by roughly active inference style of reasoning
-- some type of "living" system is characteristic by having boundaries between them and the environment (boundaries mostly in sense of separation of variables)
-- maintaining the boundary leads to need to model the environment
-- modelling the environment introduces a selection pressure toward approximating Bayes
- other critical ingredient is boundedness
-- in this universe, negentropy isn't free
-- this introduces fundamental tradeoff / selection pressure for any cognitive system: length isn't free, bitflips aren't free, etc.
(--- downstream of that is compression everywhere, abstractions)
-- empirically, the cost/returns function for scaling cognition usually hits diminishing returns, leading to minds where it's not effective to grow the single mind further
--- this leads to the basin of convergent evolution I call "specialize and trade"
-- empirically, for many cognitive systems, there is a general selection pressure toward modularity
--- I don't know what are all the reasons for that, but one relatively simple is 'wires are not free'; if wires are not free, you get colocation of computations like brain regions or industry hubs
--- other possibilities are selection pressures from CAP theorem, MVG, ...
(modularity also looks a bit like box-inverted specialize and trade)
So, in short, I think where I agree with the spirit of If humans didn't have a fixed skull size, you wouldn't get civilization with specialized members and my response is there seems to be extremely general selection pressure in this direction. If cells were able to just grow in size and it was efficient, you wouldn't get multicellulars. If code bases were able to just grow in size and it was efficient, I wouldn't get a myriad of packages on my laptop, it would all be just kernel. (But even if it was just kernel, it seems modularity would kick in and you still get the 'distinguishable parts' structure.)