I think trying to win the memetic war and trying to find the truth are fundamentally at odds with each other, so you have to find the right tradeoff. fighting the memetic war actively corrodes your ability to find the truth. this is true even if you constrain yourself to never utter any knowing falsehoods - even just arguing against the bad arguments over and over again calcifies your brain and makes you worse at absorbing new evidence and changing your mind. conversely, committing yourself to finding the truth means you will get destroyed when arguing against people whose only goal is to win arguments.
the premise that i'm trying to take seriously for this thought experiment is, what if the "claude is really smart and just a little bit away from agi" people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don't think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say "claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the 'dangerous-core-of-generalization' threshold, so that's also when it becomes super dangerous." it's way stronger a claim than "claude is far away from being agi, we're going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude." or, like, sure, the agi threshold is a pretty special threshold, so it's reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i'd tell about how this happens, it just feels like i'm starting from the bottom line first, and the stories don't feel like the strongest part of my argument.
(also, i'm generally inclined towards believing alignment is hard, so i'm pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i'm not trying to argue that alignment is easy. or like i guess i'm arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn't accept the argument, but you know what i mean. i think X is probably false but it's plausible that it isn't and importantly a lot of evidence will come in over the next year or so on whether X is true)
I agree. I think spending all of one's time thinking about and arguing with weakman arguments is one of the top reasons why people get set in their ways and stop tracking the truth. I aspire not to do this
I think this is just because currently AI systems are viscerally dumb. I think almost all people will change their minds on this eventually. I think people are at greater risk of forgetting they ever believed AI is dumb, than denying AI capabilities long after they're obviously superhuman.
I'm claiming something like 3 (or 2, if you replace "given tremendous uncertainty, our best guess is" with "by assumption of the scenario") within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter
suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it's correct. this applies even if we need to adopt a new frame of looking at the world - we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
i guess so? i don't know why you say "even as capability levels rise" - after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.
i'm mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
to be clear, "just train on philosophy textbooks lol" is not at all what i'm proposing, and i don't think that works
what i meant by that is something like:
assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.
i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don't seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don't yet have the AGI/ASI, so we can't do alignment research with good empirical feedback loops.
like tbc it's also not trivial to align current models. companies are heavily incentivized to do it and yet they haven't succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
i mean like writing kernels or hill climbing training metrics is viscerally fun even separate from any of the status parts. i know because long before any of this ai safety stuff, before ai was such a big deal, i would do ML stuff literally purely for fun without getting paid or trying to achieve glorious results or even publishing it anywhere for anyone else to see.