Proposed solution – fine-tune an LLM for the opposite of the traits that you want, then in the prompt elicit the Waluigi. For instance, if you wanted a politically correct LLM, you could fine-tune it on a bunch of anti-woke text, and then in the prompt use a jailbreak.
I have no idea if this would work, but seems worth trying, and if the waluigi are attractor states while the luigi are not, this could plausible get around that (also, experimenting around with this sort of inversion might help test whether the waluigi are indeed attractor states in general).
"Putin has stated he is not bluffing"
I think this is very weak evidence of anything. Would you expect him to instead say that he was bluffing?
Great post!
I was curious what some of this looked like, so I graphed it, using the dates you specifically called out probabilities. For simplicity, I assumed constant probability within each range (though I know you said this doesn't correspond to your actual views). Here's what I got for cumulative probability:
And here's the corresponding probabilities of TAI being developed per specific year:
The dip between 2026 and 2030 seems unjustified to me. (I also think the huge drop from 2040-2050 is too aggressive, as even if we expect a plateauing of compute/another AI winter/etc, I don't think we can be super confident exactly when that would happen, but this drop seems more defensible to me than the one in the late 2020s.)
If we instead put 5% for 2026, here's what we get:
which seems more intuitively defensible to me. I think this difference may be important, as even shift of small numbers of years like this could be action-relevant when we're talking about very short timelines (of course, you could also get something reasonable-seeming by shifting up the probabilities of TAI in the 2026-2030 range).
I'd also like to point out that your probabilities would imply that if TAI is not developed by 2036, there would be an implied 23% conditional chance of it then being developed in the subsequent 4 years ((50%-35%)/(100%-35%)), which also strikes me as quite high from where we're now standing.
In spoken language, you could expand the terms to "floating-point operations" vs "floating-point operations per second" (or just "operations (per second)" if that felt more apt)
FWIW, I am ~100% confident that this is correct in terms of what they refer to. Typical estimates of the brain are that it uses ~10^15 FLOP/s (give or take a few OOM) and the fastest supercomputer in the world uses ~10^18 FLOP/s when at maximum (so there's no way GPT-3 was trained on 10^23 FLOP/s).
If we assume the exact numbers here are correct, then the actual conclusion is that GPT-3 was trained on the amount of compute the brain uses in 10 million seconds, or around 100 days.
It's interesting the term 'abused' was used with respect to AI. It makes me wonder if the authors have misalignment risks in mind at all or only misuse risks.
A separate press release says, "It is important that the federal government prepare for unlikely, yet catastrophic events like AI systems gone awry" (emphasis added), so my sense is they have misalignment risks in mind.
Hmm, does this not depend on how the Oracle is making its decision? I feel like there might be versions of this that look more like the smoking lesion problem – for instance, what if the Oracle is simply using a (highly predictive) proxy to determine whether you'll 1-box or 2-box? (Say, imagine if people from cities 1-box 99% of the time, and people from the country 2-box 99% of the time, and the Oracle is just looking at where you're from).
How common is it for transposon count to increase in a cell? If it's a generally uncommon event for any one cell, then it could simply be that clones from a large portion of cells will only start off with marginally more (if any) extra transposons, while those that do start off with a fair bit more don't make it past the early development process.
I'm not sure if this is the main thing going on or not. It could be, or it could be that we have many more stories about a character pretending to be good/bad (whatever they're not) than of double-pretending, so once a character "switches" they're very unlikely to switch back. Even if we do have more stories of characters pretending to be good than of pretending to be bad, I'm uncertain about how the LLM generalizes if you give it the opposite setup.