I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
As of a couple of days ago, I have a file where I save lessons from such review exercises for reviewing them periodically.
Some are in weekly review category and some in monthly review. Every day when I do my daily recall I now also check through the lessons in the corresponding weekday and monthday tag.
Here's how my file currently looks like:
(I use some short codes for typing faster like "W=what", "h=how", "t=to", "w=with" and maybe some more.)
- Mon
- [[lesson - clarify Gs on concrete examples]]
- [[lesson - delegate whenever you can (including if possible large scale responsibilities where you need to find someone competent and get funding)]]
- [[lesson - notice when i search for facts (e.g. w GPT) (as opposed to searching for understanding) and then perhaps delegate if possible]]
- Tue
- [[lesson - do not waste time on designing details that i might want to change later]]
- [[periodic reminder - stop and review what you'd do if you had pretty unlimited funding -> if it could speed you up, then perhaps try to find some]]
- Wed
- [[lesson - try to find edge cases where your current model does not work well]]
- notice when sth worked well (you made good progress) -> see h you did that (-> generalize W t do right next time)
- Thu
- it's probably useless/counterproductive to apply effort for thinking. rather try to calmly focus your attention.
- perhaps train to energize the thing you want to think about like a swing through resonance. (?)
- Fri
- [[lesson - first ask W you want t use a proposal for rather than directly h you want proposal t look like]]
- Sat
- [[lesson - start w simple plan and try and rv and replan, rather than overoptimize t get great plan directly]]
- Sun
- group
- plan for particular (S)G h t achieve it rather than find good general methodology for a large class of Gs
- [[lesson - when possible t get concrete example (or observations) then get them first before forming models or plans on vague ideas of h it might look like]]
- 1
- don't dive too deep into math if you don't want to get really good understanding (-> either get shallow or very deep model, not half-deep)
- 2
- [[lesson - take care not to get sidetracked by math]]
- 3
- [[lesson - when writing an important message or making a presentation, imagine what the other person will likely think]]
- 4
- [[lesson - read (problem statements) precisely]]
- 5
- perhaps more often ask myself "Y do i blv W i blv?" (e.g. after rc W i think are good insights/plans)
- 6
- sometimes imagine W keepers would want you to do
- 7
- group
- beware conceptual limitations you set yourself
- sometimes imagine you were smarter
- 8
- possible tht patts t add
- if PG not clear -> CPG
- if G not clear -> CG
- if not sure h continue -> P
- if say sth abstract -> TBW
- if say sth general -> E (example)
- 9
- ,rc methodology i want t use (and Y)
- Keltham methodology.
- loop: pr -> gather obs -> carve into subprs -> attack a subpr
- 10
- reminder of insights:
- hyp that any model i have needs t be able t be applied on examples (?)
- disentangle habitual execution from model building (??)
- don't think too abstractly. see underlying structure to be able t carve reality better. don't be blinded by words. TBW.
- don't ask e.g. W concepts are, but just look at observations and carve useful concepts anew.
- form models of concrete cases and generalize later.
- 11
- always do introspection/rationality-training and review practices. (except maybe in some sprints.)
- 12
- Wr down questions towards the end of a session. Wr down questions after having formed some takeaway. (from Abram)
- 13
- write out insights more in math (from Abram)
- 14
- periodically write out my big picture of my research (from Abram)
- 15
- Hoops. first clarify observations. note confusions. understand the problem.
- 16
- have multiple hypotheses. including for plans as hypotheses of what's the best course of action.
- 17
- actually fucking backchain. W are your LT Gs.
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- read https://www.lesswrong.com/posts/f2NX4mNbB4esdinRs/towards_keeperhood-s-shortform?commentId=D66XSCkv6Sxwwyeep
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)
I'd imagine it not too hard to get >1OOM efficiency improvement which one can demonstrate in smaller AI and one might use this to get a lab to listen. If the labs are sufficiently uninterested in alignment it's pretty doomy anyway even if they adopted a better paradigm.
Also government interventions might still happen (perhaps more likely because of AI-caused unemployment than x-risk, and it won't buy amazingly much time, but still).
Also the strategy of "maybe if AIs are more rational they will solve alignment or at least realize that they cannot" seems also very unlikely to me to work on the current DL paradigm, though still slightly helpful.
(Also maybe some supergenius or my future self or some other group can figure something out.)
I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
Sry my comment was sloppy.
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”).
(I agree the way I used sloppy in my comment mostly meant "weak". But some other thoughts:)
So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT.
I don't really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says "no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment". And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely).
But overall those "better" intelligence dimensions still seem to me too central for AI capabilities, so I wouldn't publish stuff.
(Btw the way I read John's post was more like "fake alignment proposals are a main failure mode" rather than also "... and therefore we should work on making AIs more rational/sane whatever". So given that I maybe would defend John's framing, but not sure.)
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
Thanks for providing a concrete example!
Belief propagation seems too much of a core of AI capability to me. I'd rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
I also think the "drowned out in the noise" isn't that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It's plausible to me we won't see signs that make us sufficiently confident that we only have a short time left, and it's plausible we do.
In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail.
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Can you link me to what you mean by John's model more precisely?
If you mean John's slop-instead-scheming post, I agree with that with the "slop slightly more likely than scheming" part. I might need to reread John's post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.
I'm just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.
So like I think it's valuable to have AIs that are near the singularity be more rational. But I don't really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?
Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).
Thanks.
True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there's sth into this direction because the AI still does the search to try to preserve the alignment properties?
Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it's a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I'd be a lot more pessimistic about finding good formal representations for describing/modelling ideas.
Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?
What kind of alignment research do you hope to speed up anyway?
For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don't seem anywhere remotely close to being useful.
My guess would be for tiling agents theory neither but I haven't worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it's own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)
Or what is your alignment hope in more concrete detail?
This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.
Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
My current guess:
I wouldn't expect much useful research to come from having published ideas. It's mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.
Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there's some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs - which will probably only work for a few months or a year or so until there's leakage).
I now want to always think of concrete examples where a lesson might become relevant in the next week/month, instead of just reading them.