However, there’s an important dynamic this model misses, which is that, when the detector becomes less effective, the model’s capabilities might also decline. For example, suppose a lie detector functions because it leverages some important representations that help a model reason about its situation. A single gradient step could make the lie detector less reliable, but to do so, it might need to distort some of those important internal representations. And as a result, the model would become worse at e.g., writing code. So it’s possible that reducing the effectiveness of a detector requires paying a tax.
An important implication of this result is that good detectors of misalignment should have the property that either AIs will be detected, or if the AI can undetectably be misaligned, it should have lower capabilities compared to the hypothetical aligned AI.
And this is why even if we never get holy-grail interpretability until we face misalignment risk from AI, if at all, interpretability research is still useful even if we cannot explain everything that's going on in the model, because you can use this to make detectors that make AIs pay a tax for undetectable misalignment.
Similar stories hold for AI control, and it's a big reason why I like the fact that AI control is getting funded right now.
The other part is that partisans of the narrative overfocus on the other side's bad arguments because of most people not being able to check the arguments, and to be frank the entire area is a mess that I'm not willing to go down, and I instead focus on less-charged topics.
Like, all 4 of the examples are great demonstrations of why you need to be able to steelman your opponent, and one of the central problems in politics is people are trapped in a loop of destroying bad arguments instead of focusing on good arguments.
Yeah, that particular part sounded a lot like "I can't understand why people disagree with IABIED without suffering from PR mindset".
Like, this would be not out of place if someone couldn't actually understand or tolerate disagreement with the Grand Ideas, and I really wish this quote was stricken from the post entirely:
Another example I would cite was the response to If Anyone Builds It, Everyone Dies by the core EA people, including among others Will MacAskill himself and also the head of CEA. This was a very clear example of PR mindset, where quite frankly a decision was made that this was a bad EA look, the moves it proposes were unstrategic, and thus the book should be thrown overboard. If Will is sincere about this reckoning, he should be able to recognize that this is what happened.
Conditional on a slowdown in AI progress, my primary hypothesis is that the problem is that recent AI models haven't scaled much in compute compared to past models and have relied on RL progress, and current RL is becoming less and less of a free lunch than before and is actually less efficient than pre-training.
Which is a slight update against software-only singularity stories occurring.
I somewhat agree with this, but I don't agree with conclusions like illegible problems being made legible means the value of working on the problem flips sign, and I want to explain why I disagree with this:
More generally, a big reason why people are focusing on more legible problems nowadays is in large part because of a shift in focus in what regime of AI capabilities to target for safety interventions, and in particular there's a lot less focus on the post-intelligence explosion era where AIs can do stuff that 0 humans can reliably hope to do, and much more focus on the first steps of say, AI fully automating AI R&D, where it's easier to reason about intervention effectiveness and you can rely more on non-perfect/not fully justifiable solutions like AI control.
You didn't actually answer the question posed, which was "Why couldn't humans and ASI have peaceful trades even in the absence of empathy/love/alignment to us rather than killing us?" and not "Why would we fail at making AIs that are aligned/have empathy for us?"
However, I'd also update toward the current paradigm continuing to progress at a pretty fast clip and this would push towards expecting powerful AI in the current paradigm within 15 years (and probably within 10).
This specific prediction can't occur, due to data limitations, but Moore's law leading to passable (not a perfect emulation) simulation of human brains means that I do agree with the timeline medians, so I do generate similar conclusions.
The next 5-6 years will be a big test of whether the pure/strong version of the scaling hypothesis is correct in AI.
So yeah, TAI timelines being 10-15 years as a expectation are reasonable, but that has to come through a different paradigm.
Indeed, you could make a very reasonable argument that the entire reason AI might be dangerous is because once it's able to automate away the entire economy, as an example, defection no longer has any cost and has massive benefits (at least conditional on no alignment in values).
The basic reason why you can't defect easily and gain massive amounts of utility from social systems is a combo of humans not being able to evade enforcement reliably, due to logistics issues, combined with people being able to reliably detect defection in small groups due to reputation/honor systems, and combined with the fact that humans as individuals are far, far less powerful even selfishly as individuals than as cooperators.
This of course breaks once AGI/ASI is invented, but John Wentworth's post doesn't need to apply to post-AGI/ASI worlds.
BTW, that part of the interview is also why the claim that Anthropic violated its RSP by not stopping research/deployment of new models upon not having ASL-3 security is incorrect, as RSPs never were a framework that allowed for pausing unilaterally.
More generally, it's useful to keep this in mind the next time a controversy over a RSP violation happens, as I predict it will happen again.
Link to long comments that I want to pin, but are too long to be pinned:
https://www.lesswrong.com/posts/Zzar6BWML555xSt6Z/?commentId=aDuYa3DL48TTLPsdJ
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD
https://www.lesswrong.com/posts/DCQ8GfzCqoBzgziew/?commentId=RhTNmgZqjJpzGGAaL