The observation that gains saturate has a fairly simple explanation from evolutionary theory: the increased evolutionary fitness advantage from large material gains saturates (especially in a hunter-gatherer environment). Successfully hunting a rabbit will keep me from starving for a day; but if I successfully hunt a mammoth, I can't keep the meat for long enough for it to feed me for years. The best I can do is feed everyone in the village for a few days, hoping they remember this later when my hunting is less successful, and do a bunch of extra work to make some jerky with the rest. The evolutionary advantage is sub-linear in the kilograms of raw meat. In more recent agricultural societies, rich and powerful men like Ramses II who had O(100) children needed a lot more than 50 times the average resources of men in their society to achieve that outcome (and of course that evolutionary strategy isn't possible for women). Even today, if I'm unlucky enough to get pancreatic cancer it doesn't matter how rich I am: all that money isn't going to save me, even if I'm as rich as Steve Jobs.
Similarly, on the downside, from a personal evolutionary fitness point of view, saturation also makes sense, since there is a limit to how bad things can get: once I, my family, and everyone else in the tribe related to me are all dead, it's game over, and my personal evolutionary fitness doesn't really care whether everyone else in the region also died, or not.
So it seems to me that at least the first diagram above of prospect theory may be an example of humans being aligned with evolution's utility function.
I don't have a good evolutionary explanation for the second diagram, unless it's a mechanism to compensate for some psychological or statistical bias in how hunter-gatherers obtain information about and estimate risks, and/or how that compares to modern mathematical risk measures like probabilities and percentages.
We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.
The sequence on Infra-Bayesianism motivates the min (a.k.a. Murphy) part of its argmax min by wanting to establish lower bounds on utility — that's a valid viewpoint. My own interest in Infra-Bayesianism comes from a different motivation: Murphy's min encodes directly into Infra-Bayesian decision making the generally true, inter-related facts that 1) for an optimizer, uncertainty on the true world model injects noise into your optimization process which almost always makes the outcome worse 2) the optimizer's curse usually results in you exploring outcomes whose true utility you had overestimated, so your regret is generally higher than you had expected 3) most everyday environments and situations are already highly optimized, so random perturbations of their state almost invariably make things worse. All of which justify pessimism and conservatism.
The problem with this argument, is that it's only true when the utility of the current state is higher than the utility of the maximum-entropy equilibrium state of the environment (the state that increasingly randomizing it tends to move it towards due to the law of big numbers). In everyday situations this is almost always true -- making random changes to a human body or in a city will almost invariable make things worse, for example. In most physical environments, randomizing them sufficiently (e.g. by raining meteorites on them, or whatever) will tend to reduce their utility to that of a blasted wasteland (the surface of the moon, for example, has pretty-much reached equilibrium under randomization-by-meteorites, and has a very low utility). However, it's a general feature of human utility functions that there often can be states worse than the maximum-entropy equilibrium. If your environment is a 5-choice multiple-choice test whose utility is the score, the entropic equilibrium is random guessing which will score 20%, and there are chose-wrong-answer policies that score less than that, all the way down to 0% -- and partially randomizing away from one of those policies will make its utility increase towards 20%. Similarly, consider a field of anti-personnel mines left over from a war -- as an array of death and dismemberment waiting to happen, randomizing it with meteorite impacts will clear some mines and make its utility better -- since it starts off actually worse than a blasted wasteland. Or, if a very smart GAI working on alignment research informed you that it had devised an acausal attack that would convert all permanent hellworlds anywhere in the multiverse into blasted wastelands, your initial assumption would probably be that doing that would be a good thing (modulo questions about whether it was pulling your leg, or whether the inhabitants of the hellworld would consider it a hellworld).
In general, hellworlds (or at least local hell-landscapes) like this are rare — they have a negligible probability of arising by chance, so creating one requires work by an unaligned optimizer. So currently, with humans the strongest optimizers on the planet, they usually only arise in adversarial situations such as wars between groups of humans ("War is Hell", as the saying goes). However, Infra-Bayesianism has exactly the wrong intuitions about any hell-environment whose utility is currently lower than that of the entropic equilibrium. If you have an environment that has been carefully optimized by a powerful very-non-aligned optimizer so as to maximize human suffering, then random sabotage such as throwing monkey wrenches in the works or assassinating the diabolical mastermind is actually very likely to improve things (from a human point of view), at least somewhat. Infra-Bayesianism would predict otherwise. I think having your GAI's decision theory based on a system that gives exactly the wrong intuitions about hellworlds is likely to be extremely dangerous.
The solution to this would be what one might call Meso-Bayesianism -- renormalize your utility scores so that of the utility of the maximal entropy state of the environment is by definition zero, and then assume that Murphy minimizes the absolute value of the utility towards the equilibrium utility of zero, not towards a hellworld. (I'm not enough of a pure mathematician to have any idea what this modification does to the network of proofs, other then making the utility renormalization part of a Meso-Bayesian update more complicated.) So then your decision theory understands that any unaligned optimizer trying to create a hellworld is also fighting Murphy, and when fighting them on their home turf Murphy is your ally, since "it's easier to destroy than create" is also true of hellworlds. [Despite the usual formulation of Murphy's law, I actually think the name 'Murphy' suits this particular metaphysical force better -- Infra-Bayesianism's original 'Murphy' might have been better named 'Satan', since it's is wiling to go to any length to create a hellworld, hobbled only by some initially-unknown physical laws.]
Having since started learning about Infra-Bayesianism, my initial impression is that it's a formal and well-structures mathematical formalism for doing of exactly the kind of mechanism I sketched above for breaking the Optimizer's Curse, by taking the most pessimistic view of the utility over Knightian uncertainty in your hypotheses set.
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
Open AI were only looking at explaining single neurons, so combining their approach with the original paper's sparse probing technique for superpositions seems like the obvious next step.
My impression is that he's trying to do GOFAI with fully differentiable neural networks. I'm also not sure he's describing a GAI — I think he's starting by aiming for parity with the capabilities of a typical mammal, not human-level, and that's why he uses self-driving cars as an example.
Personally I think a move towards GOFAI-like ideas is a good intuition, but that insisting on keeping things fully differentiable is too constraining. I believe that at some level, we are going to need to move away from doing everything with gradient descent, and use something more like approximate Bayesianism, or at least RL.
I also think he's underestimating the influence of genetics in mammalian mental capabilities. He talks about the step of babies learning that the world is 3D not 2D — I think it's very plausible that adaptations for processing sensory data from a 3D rather than 2D world are already encoded in our genome, brain structure, and physiology in a many places.
If this is going to be a GAI architecture, then I think he's massively underthinking alignment.
I'm not very scared of any AGI that isn't capable of being a scientist — it seems unlikely to be able to go FOOM. In order to do that, it needs to:
I think it's going to be hard to do all of these things well if its world models aren't fairly modular and separable from the rest of its mental architecture.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.
Up to a certain size, LLMs are going to be a commodity, and academic/amateur/open-source versions will be available. Currently that scale's around 7B-20B parameters, likely it will soon increase. However, GPT-4 supposedly cost >$100m to create, of which I've seen estimates that the raw foundation model training cost was O($40m) [which admittedly will decrease with Moore's Law for GPUs], and there is also a significant cost for filtering the training data, doing instruct-following and safety RLHF, and so forth. It's not clear to me why any organization able to get its hands on that much money and the expertise necessary to spend it would open-source the result, at least while the result remains near-cutting edge (leaks are of course possible, as Meta already managed to prove). So I suspect models as capable as GPT-4 will not get open-sourced any time soon (where things are moving fast enough that 'soon' means 'this year, or maybe next'). But there are quite a lot of companies/governments that can devote >$100m to an important problem, so the current situation that there are only 3-or-4 companies with LLMs with capabilities comparable to GPT-4 isn't likely to last very long.
As for alignment, it's very unlikely that all sources for commodity LLMs will do an excellent job of persuading them not to tell you how to hotwire cars, not roleplaying AI waifus, or not being able to simulate 4Chan, and some will just release a foundation model with no such training. So we can expect 'unaligned' open-source LLMs. However, none of those are remotely close to civilization-ending problems. The question then is whether Language Model Cognitive Architectures (LMCAs) along the lines of AutoGPT can be made to run effectively on a suitable fine-tuned LLM of less-than-GPT-4 level of complexity and can still increase their capabilities to an AGI level, or if that (if possible at all) requires an LLM of GPT-4-scale or larger. AutoGPT isn't currently that capable when run with GPT-3.5, but generally, if a foundation LLM shows some signs of a capability at a task, suitable fine-tuning can usually greatly increase the reliability of it performing that task.
I take your point that the way an Infra-Bayesian system makes decisions isn't the same as a human — it presumably doesn't share our cognitive biases, and the pessimism element 'Murphy' in it seems stronger than for most humans. I normally assume that if there's something I don't understand about the environment that's injecting noise into the outcome of my actions, the noise-related parts of results aren't going to be well-optimized, so they're going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don't generally assume that everything I can't control will have literally the worst possible outcome. So I guess in Infra-Bayesian terms I'm assuming that Murphy is somewhat constrained by laws that I'm not yet aware of, and may never be aware of.
My take on Murphy is that it's a systematization of the force of entropy trying to revert the environment to a thermodynamic equilibrium state, and of the common fact that the utility of that equilibrium state is usually pretty low. One of the flaws I see in Infra-Bayesianism is that there are sometimes (hard to reach but physically possible) states whose utility to me is even lower than the thermodynamic equilibrium (such as a policy that scores less than 20% on a 5-option multiple choice quiz so does worse than random guessing, or a minefield left over after a war that is actually worse than a blasted wasteland) where increasing entropy would actually help improve things. In a hellworld, randomly throwing money wrenches in the gears is a moderately effective strategy. In those unusual cases Infra-Bayesianism's Murphy no longer aligns with the actual effects of entropy/Knightian uncertainty.