I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as "predict what a human will thumbs-up or thumbs-down". The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.
If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
The argument we are trying to explain has an additional step that you're missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI's preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem. Even if, in fact, the ball-bearings would legitimately be part of the mechanism if you could build one! Making lots of progress on smoother, lower-friction ball-bearings is even so not the sort of thing that should cause you to become much more hopeful about the perpetual motion machine. It is on the wrong side of a theoretical divide between what is straightforward and what is not.
You will probably protest that we phrased our argument badly relative to the sort of thing that you could only possibly be expected to hear, from your perspective. If so this is not surprising, because explaining things is very hard. Especially when everyone in the audience comes in with a different set of preconceptions and a different internal language about this nonstandardized topic. But mostly, explaining this thing is hard and I tried taking lots of different angles on trying to get the idea across.
In modern times, and earlier, it is of course very hard for ML folk to get their AI to make completely accurate predictions about human behavior. They have to work very hard and put a lot of sweat into getting more accurate predictions out! When we try to say that this is on the shallow end of a shallow-deep theoretical divide (corresponding to Hume's Razor) it often sounds to them like their hard work is being devalued and we could not possibly understand how hard it is to get an AI to make good predictions.
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs. We did not expect GPT-4 to happen, in fact, intermediate trajectories are harder to predict than endpoints, so we did not carefully phrase all our explanations in a way that would make them hard to misinterpret after GPT-4 came around.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. You could then have asked us in a shocked tone how this could possibly square up with the notion of "the hidden complexity of wishes" and we could have explained that part in advance. Alas, nobody actually predicted GPT-4 so we do not have that advance disclaimer down in that format. But it is not a case where we are just failing to process the collision between two parts of our belief system; it actually remains quite straightforward theoretically. I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
There's perhaps more detail in Project Lawful and in some nearby stories ("for no laid course prepare", "aviation is the most dangerous routine activity").
Have you ever seen or even heard of a person who is obese who doesn't eat hyperpalatable foods? (That is, they only eat naturally tasting, unprocessed, "healthy" foods).
Tried this for many years. Paleo diet; eating mainly broccoli and turkey; trying to get most of my calories from giant salads. Nothing.
I am not - $150K is as much as I care to stake at my present weath levels - and while I refunded your payment, I was charged a $44.90 fee on the original transmission which was not then refunded to me.
Though I disagree with @RatsWrongAboutUAP (see this tweet) and took the other side of the bet, I say a word of praise for RatsWrong about following exactly the proper procedure to make the point they wanted to make, and communicating that they really actually think we're wrong here. Object-level disagreement, meta-level high-five.
My $150K against your $1K if you're still up for it at 150:1. Paypal to yudkowsky@gmail.com with "UFO bet" in subject or text, please include counterparty payment info if it's not "email the address which sent me that payment".
Key qualifier: This applies only to UFOs spotted before July 19th, 2023, rather than applying to eg future UFOs generated by secret AI projects which were not putatively flying around and spotted before July 19th, 2023.
ADDED: $150K is as much as I care to stake at my current wealth level, to rise to this bettors' challenge and make this point; not taking on further bets except at substantially less extreme odds.
TBC, I definitely agree that there's some basic structural issue here which I don't know how to resolve. I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI's early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn't involve trying to promote an LLM's predictive model of human discourse about morality to be in charge of a superintelligence's dominion of the galaxy. What you would like to promote to values are concepts like "corrigibility", eg "low impact" or "soft optimization", which aren't part of everyday human life and aren't in the training set because humans do not have those values.