Solid idea! I'd worry a bit, around the edges, for a couple reasons:
(I agree that they're kinda scary-looking; not sure whether that's a plus or minus. I guess the point is you want to make the message scary, but not the messengers.)
My guess would be that an AI could self-improve by "a lot", in a way that's reasonably safely-for-it. Cf. https://www.lesswrong.com/posts/dho4JQytfHWXtTvkt/on-the-adolescence-of-technology?commentId=t2hKmhsS6yLyJFQwh
[on the fence about posting this comment, consider it low expected value [<--RFW?]]
Re/ legitimacy of large marches, I'd guess there's other additional ways to get legitimacy. Basically, stuff that normal marches have. E.g.
suffering is a sign that one has little slack, and so entities that are out to get you will target those who signal suffering less.
Well, depending on the targeter, it counts against being targeted because there's relatively less to expropriate, and it counts towards being targeted because you have less defenses and are more desperate / have a worse negotiation position.
I'd speculate that you have a large advantage with practical partial solutions to alignment by being in silico. Some of the standard AI advantages for capability improvements maybe also be significant advantages for alignment (auto-alignment?). For example:
You said "immediately solve the alignment problem, in an ambitious way...", but you could have a smoother takeoff-paced series of alignment solutions. Maybe.
Could be helpful, yeah. I'd caution people to not put too much burden on yourself. This is one of those things where, at least at first, I'd want to do a minimal, stripped-down version, that requires as little extra effort as possible--because it's the sort of thing that you might not do at all due to cognitive costs. I do sometimes write something down during or after, but when I do it's usually a tiny snippet of just a couple words, to prompt myself to maybe unpack later.
then corrections are spaghetti coded added on to prevent particular failures with data from real experiments
My guess would be that the failures would be quite systematic, and would reflect the absence of substantial algorithms. That would suggest that you either have to come up with more algorithms, and/or you have to learn them from data. But to learn them from data without coming up with the algorithms or with algorithmic search spaces that sufficiently promote the relevant pieces, you need a lot of data; and brain algorithms that work on a time scale of an hour or a day have correspondingly or less data feasibly available compared to ~second-long events.
(Just noting that I agree with your footnote that the learning part is the hard part; that's the part that seems necessary for real minds, and that, when I ask neuro people about it, they're like "oh yeah, we don't know how to do that".)
Very little / hard to evaluate. I have been doing my best to carefully avoid saying things like "do math/science research", unless speaking really loosely, because I believe that's quite a poor category. It's like "programming"; sure, there's a lot in common between writing a CRUD app or tweaking a UI, but it's really not the same thing as "think of a genuinely novel algorithm and implement it effectively in context". Quoting from https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions :
So the outcome of "think of a novel math concept that mathematicians then find interesting" is nontrivially more narrow than "prove something nontrivial that hasn't been proved before". I haven't evaluated the actual results though and don't know how they work. If mathematicians starting reading off lots of concepts directly from the LLMs's reasonings and find them interesting and start using them in further theory, that would be surprising / alarming / confusing, yeah--though it also wouldn't be mindblowing if it turns out that "a novel math concept that mathematicians then find interesting" is also a somewhat poor category in the same way, and actually there are such things to be found without it being much of a step toward general intelligence.