Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
Thanks! This makes sense.
I agree model-free RL wouldn't necessarily inherit the risk aversion, although I'd guess there's still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards.
Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We'd still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan.
Risk aversion wouldn't help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI's perspective are still gonna be bad for humans.
I think there are (at least) two ways to reduce this risk:
And you could fix this to be insensitive to butterfly effects, by comparing quantile functions as random variables instead.
Could we be sure that a bare utility maximizer doesn't modify itself into a mugging-proof version? I think we can. Such modification drastically decreases expected utility.
Maybe for positive muggings, when the mugger is offering to make the world much better than otherwise. But it might self-modify to not give into threats to discourage threats.
You could use a bounded utility function, with sufficiently quickly decreasing marginal returns. Or use difference-making risk aversion or difference-making ambiguity aversion.
Maybe also just aversion to Pascal's mugging itself, but then the utility maximizer needs to be good enough at recognizing Pascal's muggings.
FWIW, I'm not sure if that's true relative to the average person, and I'd guess non-consequentialist philosophers are more averse to biting bullets than the average EA and maybe rationalist.
I think your scenario only illustrates a problem with outer alignment (picking the right objective function), and I think it's possible to state an objective, that if it could be implemented sufficiently accurately and we could guarantee the AI followed it (inner alignment), would not result in a dystopia like this. If you think the model would do well at inner alignment if we fixed the problems with outer alignment, then it seems like a very promising direction and this would be worth pointing out and emphasizing.
I think the right direction is modelling how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents, especially how humans and other moral patients are affected (including what happened along the way, e.g. violations of consent and killing). I don't think you need the coherency of coherent extrapolated volition, because you can still capture people finding this future substantially worse than some possible futures, including the ones where the AI doesn't intervene, by some important lights, and just set a difference-making ambiguity averse objective. Or, maybe require it not to do substantially worse by any important light, if that's feasible: we would allow the model to flexibly represent the lights by which humans judge outcomes where doing badly in one means doing badly overall. Then it would be incentivized to focus on acts that seem robustly better to humans.
I think an AI that actually followed such an objective properly would not, by the lights of whichever humans whose judgements it's predicting, increase the risk of dystopia through its actions (although building it may be s-risky, in case of risks of minimization). Maybe it would cause us to slow moral progress and lock in the status quo, though. If the AI is smart enough, it can understand "how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents", but it can still be hard to point the AI at that even if it did understand it.
Another approach I can imagine is to split up the rewards into periods, discount them temporally, check for approval and disapproval signals in each period, and make it very costly relative to anything else to miss one approval or receive a disapproval. I describe this more here and here. As JBlack pointed out in the comments of the second post, there's incentive to hack the signal. However, as long as attempts to do so are risky enough by the lights of the AI and the AI is sufficiently averse to losing approval or getting disapproval and the risk of either is high enough, it wouldn't do it. And of course, there's still the problem of inner alignment; maybe it doesn't even end up caring about approval/disapproval in the way our objective says it should out of distribution.
I would say it seems like it's almost there, but it also seems to me to already have some fluid intelligence, and that might be why it seems close. If it doesn't have fluid intelligence, then my intuition that it's close may not be very reliable.
(Your reply is in response to a comment I deleted, because I thought it was basically a duplicate of this one, but I'd be happy if you'd leave your reply up, so we can continue the conversation.)
See here, starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.
That seems like a high bar to me for testing for any fluid intelligence, though, and the vast majority of humans would do about as bad or worse (but possibly because of far worse crystallized intelligence). Similarly, in your post, "No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes."
I would say to look at it based on definitions and existing tests of fluid intelligence. These are about finding patterns and relationships between unfamiliar objects and any possible rules relating to them, applying those rules and/or inference rules with those identified patterns and relationships, and doing so more or less efficiently. More fluid intelligence means noticing patterns earlier, taking more useful steps and fewer useless steps.
Some ideas for questions:
My impression is that there are few cross-applicable techniques in these areas, and the ones that exist often don't get you very far to solving problems. To do NP-hardness proofs, you need to identify patterns and relationships between two problems. The idea of using "gadgets" is way too general and hides all of the hard work, which is finding the right gadget to use and how to use it. EDIT: For IMO and Putnam problems, there are some common tools, too, but if just simple pattern matching for those was all it took, math undergrads would generally be good at them, and they're not, so it probably does take considerable fluid intelligence.
I guess one possibility is that an LLM can try a huge number of steps and combinations of steps before generating the next token, possibly looking ahead multiple steps internally before picking one. Maybe it could solve hard problems this way without fluid intelligence.
I think many EAs/rationalists shouldn't find this to be worse for humans than life today on the views they apparently endorse, because each human looks better off under standard approaches to intrapersonal aggregation: they get more pleasure, less suffering, more preference satisfaction (or we can imagine some kind of manipulation to achieve this), but at the cost of some important frustrated preferences.
I see a few comments here on fortified foods. I think the vitamin D and iron is usually in less bioavailable forms (D2 and non-heme iron) in fortified plant-based foods than in animal products, and I don’t know if % daily value on labels account for that. I take a multivitamin with apparently 100% of the daily value for both, and it wasn't enough based on my bloodwork and lightheadedness/blackouts (few of the other vegans I know got lightheaded; I had been giving blood every 2 months, but even after I stopped, the multivitamin alone didn't seem to be enough), so I take separate supplements for each on top of the multivitamin. I was also eating nuts, legumes, and kale or spinach for iron most days.
Looking at my soy milk (Silk), per cup, the label says it has only 2 micrograms of vitamin D (D2) and 10% of the daily value. Some of my other vegan dairy products have no vitamin D. There's no vitamin D in Beyond Burgers, but there is a decent amount of (non-heme) iron per patty (5.5 mg, 31% DV). Similarly for my Yves veggie chicken. I'd be surprised if most vegans not taking supplements with vitamin D even reach 100% DV according to the labels (whether they account for bioavailbility or not) through diet alone.
I live in Canada, though, so vitamin D is harder to get from the sun during the winter.