I think the Internet has in fact been a prelude to the attitude adaptive for the martial shifts, but mostly because the failure of e.g. social media to produce good discourse has revealed that a lot of naive implicit models about democratization being good have been falsified. Democracy in fact turns out to be bad, giving people what they want turns out to be bad. I expect the elite class in Democratic Republics to get spitefully misanthropic because they are forced to live with the consequences of normal people's decisions in a way e.g. Chinese elites aren't.
Of course, LLMs will help with cyber defense as well. But even if the offense-defense balance from AI favors defense, that won’t matter in the short term! As Bruce Schneier pointed out, the red team will take the lead.
Did he point that out? I agree to be clear, and I would expect Schneier to agree because he's a smart dude, but I scanned this article several times and even did a full read through and I don't see where he says that he expects offense to overtake defense in the short term.
This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam's Razor. It's possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It's Bostrom, he's a pretty smart dude, this wouldn't surprise me, it might even be in the text somewhere but I'm not reading the whole thing again). But even if that's the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it's simply not rigorously imaginable how you would get the relevant representations in the first place.
It's important to remember also that Bostrom's primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it's genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY "human level is a weird threshold to expect AI progress to stop at" thesis as the default.
Clearly! I'm a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:
"- Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like "happiness" at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: "But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.""
Part of why I didn't write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.
I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will "Just Work". If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in "alignment by default" and push back on such things frequently whenever they're brought up. What has happened is that the problem has gone from "not clear how you would do this even in principle, basically literally impossible with current knowledge" to merely tricky.
Let's think phrase by phrase and analyze myself in the third person.
First let's extract the two sentences for comparison:
JDP: Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent.
Bostrom: The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.
An argument from ethos: JDP is an extremely scrupulous author and would not plainly contradict himself in the same sentence. Therefore this is either a typo or my first interpretation is wrong somehow.
Context: JDP has clarified it is not a typo.
Modus Tollens: If "understand" means the same thing in both sentences they would be in contradiction. Therefore understand must mean something different between them.
Context: After Bostrom's statement about understanding, he says that the AI's final goal is to make us happy, not to do what the programmers meant.
Association: The phrase "not to do what the programmers meant" is the only other thing that JDP's instance of the word "understand" could be bound to in the text given.
Context: JDP says "before they are superintelligent", which doesn't seem to have a clear referent in the Bostrom quote given. Whatever he's talking about must appear in the full passage, and I should probably look that up before commenting, and maybe point out that he hasn't given quite enough context in that bullet and may want to consider rephrasing it.
Reference: Ah I see, JDP has posted the full thing into this thread. I now see that the relevant section starts with:
But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”
Association: Bostrom uses the frame "understand" in the original text for the question from his imagined reader. This implies that JDP saying "AIs will probably understand what we mean" must be in relation to this question.
Modus Tollens: But wait, Bostrom already answers this question by saying the AI will understand but not care, and JDP quotes this, so if JDP meant the same thing Bostrom means he would be contradicting himself, which we assume he is not doing, therefore he must be interpreting this question differently.
Inference: JDP is probably answering the original hypothetical readers question as "Why wouldn't the AI behave as though it understands? Or why wouldn't the AI's motivation system understand what we meant by the goal?"
Context: Bostrom answers (implicitly) that this is because the AI's epistemology is developed later than its motivation system. By the time the AI is in a position to understand this its goal slot is fixed.
Association: JDP says that subsequent developments have disproved this answers validity. So JDP believes either that the goal slot will not be fixed at superintelligence or that the epistemology does not have to be developed later than the motivation system.
Modus Tollens: If JDP said that the goal slot will not be fixed at superintelligence, he would be wrong, therefore since we are assuming JDP is not wrong this is not what he means.
Context: JDP also says "before superintelligence", implying he agrees with Bostrom that the goal slot is fixed by the time the AI system is superintelligent.
Process of Elimination: Therefore JDP means that the epistemology does not have to be developed later than the motivation system.
Modus Tollens: But wait. Logically the final superintelligent epistemology must be developed alongside the superintelligence if we're using neural gradient methods. Therefore since we are assuming JDP is not wrong this must not quite be what he means.
Occam's Razor: Theoretically it could be made of different models, one of which is a superintelligent epistemology, but epistemology is made of parts and the full system is presumably necessary to be "superintelligent".
Context: JDP says that "AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent", this implies the existence of non superintelligent epistemologies which understand what we mean.
Inference: If there are non superintelligent epistemologies which are sufficient to understand us, and JDP believes that the motivation system can be made to understand us before we develop a superintelligent epistemology, then JDP must mean that Bostrom is wrong because there are or will be sufficient neural representations of our goals that can be used to specify the goal slot before we develop the superintelligent epistemology.
This is correct, though that particular chain of logic doesn't actually imply the "before superintelligence" part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it's important context that this to me doesn't come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn't feel like a hard thing from the inside, so I wouldn't expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn't understand what I'm talking about until several people went "no I don't get it", that's how basic it feels from the inside here. I now understand that no this actually isn't obvious, the hostile tone above was frustration from not knowing that yet.
Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.
Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn't fully match what you mean, the latter very much is and the latter is the only kind of thing that's going to have the proper referents for concepts like "happiness".
ChatGPT still thinks I am wrong so let's think step by step. Bostrom says (i.e. leads the reader to understand through his gestalt speech, not that he literally says this in one passage) that, in the default case:
Apparently not even Xi thinks it's a good idea!
https://www.ft.com/content/c4e81a67-cd5b-48b4-9749-92ecf116313d