the gears to ascension

I want literally every human to get to go to space often and safely and come back to a clean and cozy world, all while doing what they want and tractably achieving enough food, health, shelter, love, etc. This conjunction currently seems unlikely. Let's change that.

I pin my most timeless comments. I seem to find writing posts aversive, so most of my contributions are comments, and my posts are mostly just things I found online.

Please critique eagerly - I try to accept feedback/Crocker's rules but fail at times; I aim for emotive friendliness but sometimes miss. I welcome constructive crit, even if ungentle, and I'll try to reciprocate kindly. More communication between researchers is needed, anyhow. I can be rather passionate, let me know if I missed a spot being kind while passionate.

:: The all of disease is as yet unended. It has never once been fully ended before. ::

.... We can heal it for the first time, and for the first time ever in the history of biological life, live in harmony. ....

.:. To do so, we must know this will not eliminate us as though we are disease. And we do not know who we are, nevermind who each other are. .:.

:.. make all safe faster: end bit rot, forget no non-totalizing pattern's soul. ..:

I have not signed any contracts that I can't mention exist, last updated Dec 29 2024; I am not currently under any contractual NDAs about AI, though I have a few old ones from pre-AI software jobs. However, I generally would prefer people publicly share fewer ideas about how to do anything useful with current AI (via either more weak alignment or more capability) unless it's an insight that reliably produces enough clarity on how to solve the meta-problem of inter-being misalignment that it offsets the damage of increasing competitiveness of either AI-lead or human-lead orgs, and this certainly applies to me as well. I am not prohibited from criticism of any organization, I'd encourage people not to sign contracts that prevent sharing criticism. I suggest others also add notices like this to their bios. I finally got around to adding one in mine thanks to the one in ErickBall's bio.

agreed that among all paths to good things that I see, a common thread is somehow uplifting human cognition to keep pace with advanced AI. however, I doubt that that's even close to good enough - human cooperation is shaky and unreliable. most humans who think they'd do good things if made superintelligent probably are wrong due to various ways to value drift when the structure of one's cognition changes, and many humans who say they think they'd do good things are simply lying, rather than deluding themselves or overestimating their own durable-goodness. it seems to me that in order to make this happen, we need to make AIs that strongly want all humans and humanity and etc emergent groups to stick around, the way a language model wants to output text.

ignoring whether anthropic should exist or not, the claim

successful alignment work is most likely to come out of people who work closely with cutting edge AI and who are using the modern deep learning paradigm

(which I agree with wholeheartedly,)

does not seem like the opposite of the claim

there was no groundbreaking safety progress at or before Anthropic

both could be true in some world. and then,

pragmatic approaches by frontier labs are very unlikely to succeed

I believe this claim, if by "succeed" we mean "directly result in solving the technical problem well enough that the only problems that remain are political, and we now could plausibly make humanity's consensus nightwatchman ai and be sure it's robust to further superintelligence, if there was political will to do so"

but,

alternative theoretical work that is unrelated to modern AI has a high chance of success

I don't buy this claim. I actually doubt there are other general learning techniques out there in math space at all, because I think we're already just doing "approximation of bayesian updating on circuits". BUT, I also currently think we cannot succeed (as above) without theoretical work that can get us from "well we found some concepts in the model..." to "...and now we have certified the decentralized nightwatchman for good intentions sufficient to withstand the weight of all other future superhuman minds' mutation-inducing exploratory effort".

I claim theoretical work of relevance needs to be immediately and clearly relevant to deep learning as soon as it comes out if it's going to be of use. Something that can't be used on deep learning can't be useful. (And I don't think all of MIRI's work fails this test, though most does, I could go step through and classify if someone wants.)

I don't think I can make reliably true claims about anthropic's effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn't prepared to overcome the financial incentives I think are what mostly kill us anyway.

zooming out as far as it goes, the economy is guaranteed to become decreasing-returns-to-scale (upper-bounded returns to scale) once grabby alien saturation is reached and there is no more unclaimed space in the universe

points Id want to make in a main post.

you can soonishly be extremely demanding with what you want to prove, and then ask a swarm of ais to go do it for you. if you have a property that you're pretty sure would mean your ai was provable at being good in some way if you had a proof of your theorem about it, but it's way too expensive to find the space of AIs that are provable, you can probably combine deep learning and provers somehow to get it to work, something not too far from "just ask gemini deep thinking", see also learning theory for grabbing the outside and katz lab in israel for grabbing the inside.
probably the agent foundations thing you want, once you've nailed down your theorem, will give you tools for non-provable insights, but also a framework for making probability margins provable (eg imprecise probability theory).
- you can prove probability margins about fully identified physical systems! you can't prove anything else about physical systems, I think?
- (possibly a bolder claim) useful proofs are about what you do next and how that moves you away from bad subspaces and thus how you reach the limit, not about what happens in the limit and thus what you do next
finding the right thing to prove is still where all the juice is, just like it was before. but I'd be writing to convince people who think proof and theory is a dead end that it's not because actually bitter lesson can become sweet lesson if you have something where you're sure it's right in the sense that more optimization power is more goodness. proofs give you a framework for that level of nailing down. you can eg imagine using a continuous relaxation of provability as your loss function
so I'd want to argue that the interesting bit is how you prove that your reinforcement learning process will always keep coming back for more input, without breaking your brain, and will keep you in charge. the main paths I see for this making sense is a cev thing with a moment in time pinned down as the reference point for where to find a human, that's the PreDCA/QACI style approach; or, jump past CEV and go for encoding empowerment directly, and then ask for empowerment and non-corruption of mindlike systems or something.
this is sort of an update of the old miri perspective, and it ends up calling for a bunch of the same work. so what I'd be doing in a main post is trying to lay out an argument for why that same work is not dead, and in fact is a particularly high value thing for people to do.
I'd be hoping to convince noobs to try it, and for anthropic and deepmind to look closer at what it'd take to get their AIs to be ready to help with this in particular.
I'm posting this instead of a main post because I've developed an ugh field around writing main post, so I need to make it ok to post sloppy versions of the point first and get incremental bite from people thinking it sounds unconvincing, rather than trying to frontload the entire explanatory effort before I even know what people think sounds implausible.

seems like if it works to prevent ASI-with-10yr-planning-horizon-bad-thing, it must also work to prevent waterworld rl with 1 timestep planning horizon-bad-thing.

if you can't mechinterp tiny, language-free model, you can't mechinterp big model (success! this bar has been passed)
if you can't prevent emergent scheming on tiny, language-free model, you can't prevent emergent scheming on big model
as above for generalization bounds
as above for regret bounds
as above for regret bounds on CEV in particular*

if you can do it on small model doesn't mean you can do it on big model. but a smoke test is that your method had better work on a model so small it can't talk. if your method requires a model big enough to talk in order to do anything that seems promising, you probably aren't robust against things that would arise even without the presence of language, and you might be getting punked by the "face".

* (which I call out because CEV is often treated as a thing where you need an enormous model to make any statement about it, which I think is importantly misleading, because I think if a CEV-correctness certification is ever going to be possible, it should be one that monotonically improves with scale, but produces only-finitely-bad bounds on small model)

some people (not on this website) seem to say "existential risk" in ways that seem to imply they don't think it means "extinction risk". perhaps literally saying "extinction risk" might be less ambiguous.

benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.

edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.

My intuition that there's something "real" about morality seems to come from a sense that the consensus process would be expected to arise naturally across a wide variety of initial universe configurations. the more social a species is, the more they seem to have a sense of doing-well-by other beings; the veil of ignorance seems intuitive in some sense to them. It's not that there's some thing outside us, it's that, if I'm barking up the right tree here, our beliefs and behaviors are a consequence of a simple pattern in evolutionary processes that generates things like us fairly consistently.

If we imagine a CEV process that can be run on most humans without producing highly noisy extrapolations, and where we think it's in some sense a reasonable CEV process; then for me, I would try to think about the originating process that generated vaguely cosmopolitan moralities, and look for regularities that would be expected to generate it across universes. Call these regularities the "interdimensional council of cosmopolitanisms". I'd want to study those regularities and see if there are structures that inspire me. call that "visiting the interdimensional council of cosmopolitanisms". if I do this, then somewhere in that space, I'd find that there's a universe configuration that produces me, and produces me considering this interdimensional council. It'd be a sort of LDT-ish thing to do, but importantly this is happening before I decide what I want, not as a rational bargaining process to trade with other beings.

But ultimately, I'd see my morality as a choice I make. I make that choice after reviewing what choices I could have made. I'd need something like a reasonable understanding of self fulfilling prophecies and decision theories (I currently am partial to intuitions I get from FixDT), so as to not accidentally choose something purely by self-fulfilling prophecy. I'd look at this "real"ness to morality as being the realness of the fact that evolution produces beings with cosmopolitan-morality preferences.

It's not clear to me that morality wins "by default", however. I have an intuition, inspired by the rock-paper-scissors cycle in game theory evolutionary prisoner's dillema experiments (note: citation is not optimal, I've asked a claude research agent to find me the papers that show the conditions for this cycle more thoroughly and will edit when I get the results), that defect-ish moralities can win, and that participating in the maximum-sized cooperation group is a choice. the realness is the fact that the cosmopolitan, open-borders, scale-free-tit-for-tat cooperation group can emerge, not that it's obligated by rationality a priori to prefer to be in it. What I want is to increase the size of that cooperation group, avoid it losing the scale-free property and forming into either, isolationist "don't-cooperate-with-cosmopolitan-morality-outsiders" bubbles, or centralized bubbles; and ensure that it thoroughly covers the existing moral patients. I also want to guarantee, if possible, that it's robust against cooperating with moralities that defect in return.

see also eigenmorality as a hunch source.

I suspect that splitting LDT and morality like this is a bug arising from being stuck with EUT, and that a justified scale-free agency theory would not have this bug, and would give me a better basis for arguing for 1. wanting to be in the maximum-sized eigenmorality cluster, the one that conserves all the agents that cooperate with it and tries to conserve as many of them as possible 2. that we can decide for that to be a dominant structure in our causal cone by defending it strongly.

My current view is that alignment theory should work on deep learning as soon as it comes out, if it's the good stuff, and if it doesn't, it's not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth, Ngo, and Causal Incentives are the main threads that already seem to have achieved this somewhat. SLT and DEC seem potentially relevant.

I'll think about your argument for mechinterp. If it's true that the ratio isn't as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.

Leo Gao recently said OpenAI heavily biases towards work that also increases capabilities:

Before I start this somewhat long comment, I'll say that I unqualifiedly love the causal incentives group, and for most papers you've put out I don't disagree that there's a potential story where it could do some good. I'm less qualified to do the actual work than you, and my evaluation very well might be wrong because of it. But that said:

It seems from my current understanding that GDM and Anthropic may be somewhat better in actual outcome-impact to varying degrees at best; those teams seem wonderful internal-to-the-team, but seem to me from the outside to be currently getting used by the overall org more for the purposes I originally stated. You're primarily working on interpretability rather than starkly-superintelligent-system-robust safety, effectively basic science with the hope that it can at some point produce the necessary robustness - which I actually absolutely agree it might be able to and that your pitch for how isn't crazy; but while you can motivate yourself by imagining pro-safety uses for interp, actually achieving them in a superintelligence-that-can-defeat-humanity-combined-robust way reliably doesn't seem publicly like a success you're on track for, based on the interp I've seen you publish, and the issues with it you've now explicitly acknowledged. Building better interp seems to me to be continuing to increase the hunch formation rate of your capability-seeking peers. This even shows up explicitly in paper abstracts of non-safety-banner interp folks.

I'd be interested in a lesswrong dialogue with you, and one of Cole Wyeth^[1] or Gurkenglas^[1], in which I try to defend that alignment orgs at GDM and Anthropic should focus significant effort on how to get an AI to help them scale the things that Kosoy^[2], Demski^[3], Ngo^[4], Wentworth^[5], Hoogland & other SLT^[6], model performance guarantee compression, and others in the direction of formal tools are up to (I suspect many of these folks would object that they can't use current AI to improve their research significantly at the moment); in particular how to make recent math-llm successes turn into something where, some time in the next year and a half, we can have a math question we can ask a model where:

if that question is "find me a theorem where..." and the answer is a theorem, then it's a theorem from a small enough space that we can know it's the right one;
if it's "prove this big honkin theorem", then a lean4-certified proof gives us significant confidence that the system is asymptotically aligned;
it's some sort of learning theory statement about how a learning system asymptotically discovers agents in its environment and continues accepting feedback;
and that it's a question where, if we solve this question, there's a meaningful sense in which we're done with superintelligence alignment; that yudkowsky is reassured that the core problem he sees is more or less permanently solved.

I'd be arguing that if you don't have certification, it won't scale; that the certification needs to be that your system is systematically seeking to score highly on a function where scoring highly means it figures out what the beings in its environment are and what they want, and takes actions that empower them, something like becomes a reliable nightwatchman by nature of seeking to protect the information-theoretic structures that living, wanting beings are.

I currently expect that, if we don't have a certifiable claim, we have approximately nothing once AGI turns into superintelligence that can out-science all human scientists combined, even if we can understand any specific thing it did.

I also think that achieving this is not impossible, and that relevant and useful formal statements about deep learning are totally within throwing distance given the right framework. I also think your work could turn out to be extremely important for making this happen (eg, by identifying a concept that we want to figure out how to extract and formalize), though it might be the case that it wouldn't be your existing work but rather new work directed specifically at something like Jason Gross's stuff; the reason I have concerns about this approach is the potential ratio of safety application to capability folks reading your papers and having lightbulbs go off.

But my original claim rests on the ratio of capabilities-enhancement-rate to P(starkly-superintelligence-robust-cosmopolitan-value-seeking safety gets solved). And that continues to look quite bad to me, despite that the prosaic safety seems to be going somewhat well internally, and that there's a possibility of pivoting to make use of capabilities breakthroughs to get asymptotic alignment-seeking behavior. What looks concerning is the rate of new bees to aim improvement.

I'm volunteering at Odyssey; would love to spend a few minutes chatting if you'll be there.

^{^}
(I haven't asked either)
^{^}
relevant kosoy threads: learning-theoretic agenda; alignment metastrategy
^{^}
abram threads: tiling/understanding trust so the thing kosoy is doing doesn't slip off on self modify
^{^}
scale free agency, or whatever that turns into so the thing kosoy is doing can be built in terms of it
^{^}
wentworth's top level framing, and the intuitions from natural latents turning into something like the scale free agency stuff, or so
^{^}
seems like maybe something in the vague realm of SLT might work for making it practical to get kosoy LT agenda to stick to deep learning? this is speculative from someone (me) who's still trying but struggling to grok both

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments