Reference class tennis, yay!
I think I see somewhat where you are coming from, but can you spell it out for me a bit more? Maybe through describing a somewhat fleshed out concrete example scenario all the while I can acknowledge that this is just one hastily put together possibility of many.
Let me start by proposing one such possibility but feel free to start going in another direction entirely too. Let's suppose the altruistic few put together sanctuaries or "wild human life reserves", how might this play out after this? Will the selfish ones somehow try to intrude or curtail this practice? By our scenarios granted premises, the altruistic ones do wield real power, and they do use some fraction of it to maintain this sanctum. Even if the others are many, would they have a lot to gain by trying to mess with this? Is it just entertainment, or sport for them? What do they stand to gain? Not really anything economic or more power, or maybe you think that they do?
There is one counterargument that I sometimes hear that I'm not sure how convincing I should find:
Do you agree or disagree with any parts of this?
p.s. this might go without saying but this question might only be relevant if technical alignment can be and is solved in any fashion. With that said I think it's entirely good to ask this question lest we find ourselves in a world where we clear one impossible seeming hurdle and still find ourselves in a world of hurt all the same.
This only needs there to exist something of a pareto frontier of either very altruistic okay-offs, or well-off only-a-little-altruists, or somewhere in-between. If we have many very altruistic very-well-offs, then the argument might just make itself, so I'm arguing in a less convenient context.
This might truly be tiny indeed, like one one-millionth of someone's wealth, truly a rounding error. Someone arguing for side A might be positing a very large amount of callousness if all other points stand. Or indifference. Or some other force that pushes against the desire to help.
> The low-trust attractor starts to bend other people into reciprocal low-trust shapes, just like a prion twisting nearby proteins.
Convincing people using your actions sounds disgusting!
Could you expand on what you mean here? I'm not sure I or others followed you. Perhaps you mean what you say sarcastically?
(Formatting wise: not sure how to quote a quote here, perhaps someone knows?)
Is an audiobook version also planned per chance? Could preordering that one also help?
Judging from Stephen Fry's endorsement and, as I've seen, his interest in the topic for some time in general, perhaps a delightful and maybe even eager deal could be made where he narrates? Unless some other choice might be better for either party of course. And I also understand if negotiations or existing agreements prevent anyone from confirming anything on this aspect, I'd be happy to hear whether the audio version is planned/intended to begin with and when if that can be known.
I might be missing something that's written on this page, including comments, but if not, here is a vague understanding of mine of what people might fear regarding money pumps. I'm going to diverge from your model a bit, and use the concept of sub-world-state, denoted A', B', and C', which includes everything related to the world that can be preferred except for how much money you have, which I handle separately in this comment.
A' -> B' -> C' -> A' preferences hold in a cycle.
-> preference also holds for money had.
I think, agents, either intrinsically, or instrumentally, (have to) simplify their decisions by factoring them at each timestep.
So they ask themselves:
Do I prefer going from A' -> B' more than having -> , more concretely -> ?
In this example, the non-money preference is strong, so the answer is clearly yes.
Even if the agent plans ahead a bit, and considers:
Do I prefer A' -> B' -> C' more than having -> ?
The answer will still be a clear yes.
The interesting question is, what might someone who fears money pumps say an agent would do if it occurs to them to plan ahead enough and consider:
Do I prefer A' -> B' -> C' -> A' more than having -> ?
According to both this and your formalisms, this agent should clearly realize that they much prefer -> and stay put at A'. And I think you are correct to ask whether planning is allowed by these different formalisms, and how it fits in.
I think concerns come in two flavors:
One is how you put it: if the agent is stupid, (or more charitably, computationally bounded, as we all are) they might not realize that they are going in circles and trading away value in the process for no (comparable(?)) benefit to themselves. Maybe agents are more prone to notice repetition and stop after a few cycles, since prediction and planning are famously hard.
The other concern is what we seem to notice in other humans and might notice in ourselves as well (and therefore might in practice diverge from idealized formalisms): sometimes we know or strongly suspect that something is likely not a good choice, and yet we do it anyway. How come? One simple answer can be how preference evaluations work in humans, if A' -> B' is strongly enough preferred in itself, knowing or suspecting A' -> B' -> C' -> A' + -> might not be strong enough to override it.
It might be important that, if we can, we construct agents that do not exhibit this 'flaw'. Although one might need to be careful with such wishes since such an agent might monomaniacally pursue a peak to which it then statically sticks if reached. Which humans might dis-prefer. Which might be incoherent. (This has interested me for a while and I am not yet convinced that human values do not contain some (fundamental(?)) incoherence, e.g. in the form of such loops. For better or for worse I expanded a bit on this below, though not at all very formally and I fear less than clearly.)
So in summary, I think that if an agent
then it cannot be made worse by money pumps around things it cares about.
I think your questions are very important to be clear on as much as we can, and I only responded to a little bit of what you wrote. I might respond to more, hopefully in a more targeted and clear way, if I have more time later. And I also really hope that others also provide answers to your questions.
Some bonus pondering is below, much less connected to your post just felt nice to think through this a little and perhaps invite others' thoughts on as well.
Let's imagine the terminus of human preference satisfaction. Let's assume that all preferences are fulfilled, importantly enough: in a non-wire-headed fashion. What would that look like, at least abstractly?
a) Rejecting the premise: All (human) preferences can never be fulfilled. If one has n dyson spheres, one can always wish for n+1. There will always be an orderable list of world states that we can inch ever higher on. And even if it's hard to imagine what might someone desire if they could have everything we currently can imagine--by definition--new desires will always spring up. In a sense, dissatisfaction might be a constant companion.
b) We find a static peak of human preferences. Hard to imagine what this might be, especially if we ruled out wireheading. Hard to imagine not dis-preferring it at least a little.
c) A (small or large) cycle is found and occupied at the top. This might also fulfill the (meta-)preference against boringness. But it's hard to escape that this might have to be a cycle. And if nothing else we are spending negentropy anyway, so maybe this is a helical money-pump spiral to the bottom still?
d) Something more chaotic is happening at the top of the preferences, with no true cycles, but maybe dynamically changing fads and fashions, never much deviating from the peak. This is hard to see how or why states would be transitioned to and from if one believes in cardinal utility. This still spends negentropy, but if we never truly return to a prior world-state even apart from that, maybe it's not a cycle in the formal sense?
I welcome thoughts and votes on the above possibilities.
I think this is a nice write-up, let me add some nuance in two directions:
Indeed these are quick-and-dirty heuristics that can be subpar, but you may or may not be surprised just how often decisions don't reach even this bar. In my work, when we are about to make a decision, I sometimes explicitly have to ask: do we have even a single reason to pick the option that we were about to pick over one or more others? And I find myself saying that (one of) those other options actually have reason(s) for us to pick them--I didn't bring up the question for nothing after all.
In these cases I could argue that we upgraded from no-reason deciding to at least any-reason deciding. (If we even did, in some contexts it's not unheard of that the answer to the above is something along the lines of "I cannot name any reasons but I still want to pick the first option.")
This is how we can cross from lower sophistication to the middle. However, there are perils of going ever higher: once we have identified at least one set of opposing reasons, we cross into a regime that can be immensely costly: how to weigh reasons against each other, especially when people disagree. And I'd argue that people in general are quite bad at doing this, hence why this can take up a lot of resources and have results with questionable arbitrariness.
Of course this has to all balance in how important the decision even is and how much effort, if any, should be extended towards it. And I think humans are quite bad at judging this too but we do approximate it somewhat at least with large variance.
Thank you for naming these patterns!
I have found it! This was the one:
https://www.lesswrong.com/posts/qvNrmTqywWqYY8rsP/solutions-to-problems-with-bayesianism
Seems to have seen better reception at: https://forum.effectivealtruism.org/posts/3z9acGc5sspAdKenr/solutions-to-problems-with-bayesianism
The winning search strategy was quite interesting as well I think:
I took the history of all LW articles I have roughly ever read, I had easy access to all such titles and URLs, but not article contents. I fed them one by one into a 7B LLM asking it to rate how likely based on the title alone the unseen article content could match what I described above, as vague as that memory may be. Then I looked at the highest ranking candidates, and they were a dud. Did the same thing with a 70B model, et voila, the solution was near the top indeed.
Now I just need to re-read it if it was worth dredging up, I guess when a problem starts to itch it's hard to resist solving it.
Thanks for your response, can I ask the same question of you as I do here in this cousin comment?