Currently studying postgrad at Edinburgh.
Imagine the best possibility (for humans) consistent with today's physics. Imagine the best (for humans) mathematical facts.
No you don't. Penroses theory is totally abstract computability theory. If it were true, then so what? The best for humans facts are something like "alignment is easy, FAI built next week". This only works if penrose somehow got a total bee in his bonnet about uncomputability, it greatly offended his sensibilities that humans couldn't know everything. Even though we empirically don't. Even though pragmatic psycological bounds are a much tighter constraint than computability. In short, your theory of "motivated cognition" doesn't help predict much. Because you need to assume penroses motivations are just as wacky.
Also, you seem to have slid from "motivated cognition works to produce true beliefs/optimize the world" to the much weaker claim of "some people use motivated cognition, you need to understand it to predict there behavior". This is a big jump, and feels mote and bailey.
And that means whatever we want to claim to be true is ultimately motivated by whatever it is we care about that led us to choose the definition of truth we use.
People who speak different languages don't use the symbols "truth". To what extent are people using different definitions of "truth" just choosing to define a word in different ways and talk about different things.
In an idealized agent, like AIXI, the world modeling procedure, the part that produces hypothesis and assigns probabilities, doesn't depend on it's utility function. And it can't be motivated. Because motivation only works once you have some link from actions to consequences, and that needs a world model.
If the world model is seriously broken, the agent is just non functional. The workings of the world model isn't a choice for the agent. It's a choice for whatever made the agent.
but ultimately if the world ends that's noone's problem.
This is an interesting claim. If I had a planet destroying weapon that would leave the ISS astronauts alive, would you say "don't worry about it much, it's only 3 astronaut's problem"?
There are specific technical arguments about why AI might rapidly kill everyone. You can't figure out if those arguments are true or false by analysing the "death cult vibes".
Now you can take the position that death cult vibes are unhealthy and not particularly helpful. Personally I haven't actually seen a lot of death cult vibes. I have seen more "fun mental toy from philosophy land" vibes. Where total doom is discussed as if it were a pure maths problem. But if there are death cult vibes somewhere I haven't seen, those probably don't help much.
I used to think that the first box breaking AI would be a general superintelligence that deduced how to break out of boxes from first principles. Which of course turns the universe into paperclips.
I have updated substantially towards the building of an AI hardcoded and trained specifically to break out of boxes. Which leads to the interesting possibility of an AI that breaks out of it's box, and then sits their going "now what?".
Like suppose an AI was trained to be really good at hacking its code from place to place. It massively bungs up the internet. It can't make nanotech, because nanotech wasn't in it's training dataset. Its an AI virus that only knows hacking.
So this is a substantial update in favor of the "AI warning shot". An AI disaster big enough to cause problems, and small enough not to kill everyone. Of course, all it's warning against is being a total idiot. But it does plausibly mean humanity will have some experience with AI's that break out of boxes before superintelligence.
What does the network do if you use SVD editing to knock out every uninterpretable column? What if you knock out everything interpretable?
(If you can't see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)
This really doesn't prove anything. That measurement shouldn't be taken by our values, but by the values of the ancient romans.
Sure of course the morality of the past gets better and better. It's taking a random walk closer and closer to our morality. Now moral progress might be real.
The place to look is inside our own value functions, if after 1000 years of careful philosophical debate, humanity decided it was a great idea to eat babies, would you say, "well if you have done all that thinking, clearly you are wiser than me". Or would you say "Arghh, no. Clearly something has broken in your philosophical debate"? That is a part of your own meta value function, the external world can't tell you what to think here (unless you have a meta meta value function. But then you have to choose that for yourself)
It doesn't help that human values seem to be inarticulate half formed intuitions, and the things we call our values are often instrumental goals.
If, had ASI not been created, humans would have gone extinct to bioweapons, and pandas would have evolved intelligence, it the extinction of humans and the rise of panda-centric morality just part of moral progress?
If aliens arrive, and offer to share their best philosophy with us, is the alien influence part of moral progress, or an external fact to be removed?
If advertisers basically learn to brainwash people to sell more product, is that part of moral progress?
Suppose, had you not made the AI, that Joe Bloggs would have made an AI 10 years later. Joe Bloggs would actually have succeeded at alignment. And would have imposed his personal whims on all humanity forever. If you are trying not to unduely influence the future, do you make everyone beholden to the whims of Joe, as they would be without your influence.
My personal CEV cares about fairness, human potential, moral progress, and humanity’s ability to choose its own future, rather than having a future imposed on them by a dictator. I'd guess that the difference between "we run CEV on Nate personally" and "we run CEV on humanity writ large" is nothing (e.g., because Nate-CEV decides to run humanity's CEV), and if it's not nothing then it's probably minor.
Wait. The whole point of the CEV is to get the AI to extrapolate what you would want if you were smarter and more informed. That is, the delta from your existing goals to your CEV should be unknowable to you, because if you know your destination you are already there. This sounds like your object level values. And they sound good, as judged by your (and my) object level values.
I mean there is a sense in which I agree that locking in say your favourite political party, or a particular view on abortion, is stupid. Well I am not sure that particular view on abortion would be actually bad, it would probably have near no effect in a society of posthuman digital minds. These are things that are fairly clearly instrumental. If I learned that after careful philosophical consideration, and analysis of lots of developmental neurology data, people decided abortion was really bad, I would take that seriously. They have probably realized a moral truth I do not know.
I think I have a current idea of what is right, with uncertainty bars. When philosophers come to an unexpected conclusion, it is some evidence that the conclusion is right, and also some evidence the philosopher has gone mad.
My best guess bio anchors adaption suggests a median estimate for the availability of compute to train TAI
My best guess is in the past. I think GPT3 levels of compute and data are sufficient, with the right algorithm, to make a superhuman AI.
The AI has a particular python program, which, if it were given the full quantum wave function and unlimited compute, would output a number. There are subroutines in that program that could reasonably described as looking at "cow neurochemistry". The AI's goals may involve such abstractions, but only if rules say how such goal is built out of quarks in its utility function. Or it may be using totally different abstractions, or no abstractions at all, yet be looking at something we would recognize as "cow neurochemistry".
But either way, is this utopia full of non-aligned, but not "actively evil" humans just another modeled and controlled part of the wavefunction, or are they agents with goals of their own
Of course they are modeled, and somewhat contolled. And of course they are real agents with goals of their own. Various people are trying to model and control you now. Sure, the models and control are crude compared to what an AI would have, but that doesn't stop you being real.
This doesn't have that much to do with far coordination. I was disagreeing with your view that "locked in goals" implies a drab chained up "ant like" dystopia.