I want literally every human to get to go to space often and come back to a clean and cozy world. This currently seems unlikely. Let's change that.
I pin my most timeless comments.
Please critique eagerly - I try to accept feedback/Crocker's rules but fail at times; I aim for emotive friendliness but sometimes miss. I welcome constructive crit, even if ungentle, and I'll try to reciprocate kindly. More communication between researchers is needed, anyhow. I can be rather passionate, let me know if I missed a spot being kind while passionate.
:: The all of disease is as yet unended. It has never once been fully ended before. ::
.... We can heal it for the first time, and for the first time ever in the history of biological life, live in harmony. ....
.:. To do so, we must know this will not eliminate us as though we are disease. And we do not know who we are, nevermind who each other are. .:.
:.. make all safe faster: end bit rot, forget no non-totalizing pattern's soul. ..:
I have not signed any contracts that I can't mention exist, last updated Dec 29 2024; I am not currently under any contractual NDAs about AI, though I have a few old ones from pre-AI software jobs. However, I generally would prefer people publicly share fewer ideas about how to do anything useful with current AI (via either more weak alignment or more capability) unless it's an insight that reliably produces enough clarity on how to solve the meta-problem of inter-being misalignment that it offsets the damage of increasing competitiveness of either AI-lead or human-lead orgs, and this certainly applies to me as well. I am not prohibited from criticism of any organization, I'd encourage people not to sign contracts that prevent sharing criticism. I suggest others also add notices like this to their bios. I finally got around to adding one in mine thanks to the one in ErickBall's bio.
zooming out as far as it goes, the economy is guaranteed to become decreasing-returns-to-scale (upper-bounded returns to scale) once grabby alien saturation is reached and there is no more unclaimed space in the universe
points Id want to make in a main post.
seems like if it works to prevent ASI-with-10yr-planning-horizon-bad-thing, it must also work to prevent waterworld rl with 1 timestep planning horizon-bad-thing.
if you can do it on small model doesn't mean you can do it on big model. but a smoke test is that your method had better work on a model so small it can't talk. if your method requires a model big enough to talk in order to do anything that seems promising, you probably aren't robust against things that would arise even without the presence of language, and you might be getting punked by the "face".
* (which I call out because CEV is often treated as a thing where you need an enormous model to make any statement about it, which I think is importantly misleading, because I think if a CEV-correctness certification is ever going to be possible, it should be one that monotonically improves with scale, but produces only-finitely-bad bounds on small model)
benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.
edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.
My intuition that there's something "real" about morality seems to come from a sense that the consensus process would be expected to arise naturally across a wide variety of initial universe configurations. the more social a species is, the more they seem to have a sense of doing-well-by other beings; the veil of ignorance seems intuitive in some sense to them. It's not that there's some thing outside us, it's that, if I'm barking up the right tree here, our beliefs and behaviors are a consequence of a simple pattern in evolutionary processes that generates things like us fairly consistently.
If we imagine a CEV process that can be run on most humans without producing highly noisy extrapolations, and where we think it's in some sense a reasonable CEV process; then for me, I would try to think about the originating process that generated vaguely cosmopolitan moralities, and look for regularities that would be expected to generate it across universes. Call these regularities the "interdimensional council of cosmopolitanisms". I'd want to study those regularities and see if there are structures that inspire me. call that "visiting the interdimensional council of cosmopolitanisms". if I do this, then somewhere in that space, I'd find that there's a universe configuration that produces me, and produces me considering this interdimensional council. It'd be a sort of LDT-ish thing to do, but importantly this is happening before I decide what I want, not as a rational bargaining process to trade with other beings.
But ultimately, I'd see my morality as a choice I make. I make that choice after reviewing what choices I could have made. I'd need something like a reasonable understanding of self fulfilling prophecies and decision theories (I currently am partial to intuitions I get from FixDT), so as to not accidentally choose something purely by self-fulfilling prophecy. I'd look at this "real"ness to morality as being the realness of the fact that evolution produces beings with cosmopolitan-morality preferences.
It's not clear to me that morality wins "by default", however. I have an intuition, inspired by the rock-paper-scissors cycle in game theory evolutionary prisoner's dillema experiments (note: citation is not optimal, I've asked a claude research agent to find me the papers that show the conditions for this cycle more thoroughly and will edit when I get the results), that defect-ish moralities can win, and that participating in the maximum-sized cooperation group is a choice. the realness is the fact that the cosmopolitan, open-borders, scale-free-tit-for-tat cooperation group can emerge, not that it's obligated by rationality a priori to prefer to be in it. What I want is to increase the size of that cooperation group, avoid it losing the scale-free property and forming into either, isolationist "don't-cooperate-with-cosmopolitan-morality-outsiders" bubbles, or centralized bubbles; and ensure that it thoroughly covers the existing moral patients. I also want to guarantee, if possible, that it's robust against cooperating with moralities that defect in return.
see also eigenmorality as a hunch source.
I suspect that splitting LDT and morality like this is a bug arising from being stuck with EUT, and that a justified scale-free agency theory would not have this bug, and would give me a better basis for arguing for 1. wanting to be in the maximum-sized eigenmorality cluster, the one that conserves all the agents that cooperate with it and tries to conserve as many of them as possible 2. that we can decide for that to be a dominant structure in our causal cone by defending it strongly.
My current view is that alignment theory should work on deep learning as soon as it comes out, if it's the good stuff, and if it doesn't, it's not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth, Ngo, and Causal Incentives are the main threads that already seem to have achieved this somewhat. SLT and DEC seem potentially relevant.
I'll think about your argument for mechinterp. If it's true that the ratio isn't as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
Leo Gao recently said OpenAI heavily biases towards work that also increases capabilities:
Before I start this somewhat long comment, I'll say that I unqualifiedly love the causal incentives group, and for most papers you've put out I don't disagree that there's a potential story where it could do some good. I'm less qualified to do the actual work than you, and my evaluation very well might be wrong because of it. But that said:
It seems from my current understanding that GDM and Anthropic may be somewhat better in actual outcome-impact to varying degrees at best; those teams seem wonderful internal-to-the-team, but seem to me from the outside to be currently getting used by the overall org more for the purposes I originally stated. You're primarily working on interpretability rather than starkly-superintelligent-system-robust safety, effectively basic science with the hope that it can at some point produce the necessary robustness - which I actually absolutely agree it might be able to and that your pitch for how isn't crazy; but while you can motivate yourself by imagining pro-safety uses for interp, actually achieving them in a superintelligence-that-can-defeat-humanity-combined-robust way reliably doesn't seem publicly like a success you're on track for, based on the interp I've seen you publish, and the issues with it you've now explicitly acknowledged. Building better interp seems to me to be continuing to increase the hunch formation rate of your capability-seeking peers. This even shows up explicitly in paper abstracts of non-safety-banner interp folks.
I'd be interested in a lesswrong dialogue with you, and one of Cole Wyeth[1] or Gurkenglas[1], in which I try to defend that alignment orgs at GDM and Anthropic should focus significant effort on how to get an AI to help them scale the things that Kosoy[2], Demski[3], Ngo[4], Wentworth[5], Hoogland & other SLT[6], model performance guarantee compression, and others in the direction of formal tools are up to (I suspect many of these folks would object that they can't use current AI to improve their research significantly at the moment); in particular how to make recent math-llm successes turn into something where, some time in the next year and a half, we can have a math question we can ask a model where:
I'd be arguing that if you don't have certification, it won't scale; that the certification needs to be that your system is systematically seeking to score highly on a function where scoring highly means it figures out what the beings in its environment are and what they want, and takes actions that empower them, something like becomes a reliable nightwatchman by nature of seeking to protect the information-theoretic structures that living, wanting beings are.
I currently expect that, if we don't have a certifiable claim, we have approximately nothing once AGI turns into superintelligence that can out-science all human scientists combined, even if we can understand any specific thing it did.
I also think that achieving this is not impossible, and that relevant and useful formal statements about deep learning are totally within throwing distance given the right framework. I also think your work could turn out to be extremely important for making this happen (eg, by identifying a concept that we want to figure out how to extract and formalize), though it might be the case that it wouldn't be your existing work but rather new work directed specifically at something like Jason Gross's stuff; the reason I have concerns about this approach is the potential ratio of safety application to capability folks reading your papers and having lightbulbs go off.
But my original claim rests on the ratio of capabilities-enhancement-rate to P(starkly-superintelligence-robust-cosmopolitan-value-seeking safety gets solved). And that continues to look quite bad to me, despite that the prosaic safety seems to be going somewhat well internally, and that there's a possibility of pivoting to make use of capabilities breakthroughs to get asymptotic alignment-seeking behavior. What looks concerning is the rate of new bees to aim improvement.
I'm volunteering at Odyssey; would love to spend a few minutes chatting if you'll be there.
(I haven't asked either)
relevant kosoy threads: learning-theoretic agenda; alignment metastrategy
abram threads: tiling/understanding trust so the thing kosoy is doing doesn't slip off on self modify
scale free agency, or whatever that turns into so the thing kosoy is doing can be built in terms of it
wentworth's top level framing, and the intuitions from natural latents turning into something like the scale free agency stuff, or so
seems like maybe something in the vague realm of SLT might work for making it practical to get kosoy LT agenda to stick to deep learning? this is speculative from someone (me) who's still trying but struggling to grok both
agent foundations research is what I'm talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don't know the entire theorem we want to ask for a proof of, we can show there aren't many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I've been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there's something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you're looking for plausible compressions, or simplifications or generalizations of an expression, or something.
ignoring whether anthropic should exist or not, the claim
(which I agree with wholeheartedly,)
does not seem like the opposite of the claim
both could be true in some world. and then,
I believe this claim, if by "succeed" we mean "directly result in solving the technical problem well enough that the only problems that remain are political, and we now could plausibly make humanity's consensus nightwatchman ai and be sure it's robust to further superintelligence, if there was political will to do so"
but,
I don't buy this claim. I actually doubt there are other general learning techniques out there in math space at all, because I think we're already just doing "approximation of bayesian updating on circuits". BUT, I also currently think we cannot succeed (as above) without theoretical work that can get us from "well we found some concepts in the model..." to "...and now we have certified the decentralized nightwatchman for good intentions sufficient to withstand the weight of all other future superhuman minds' mutation-inducing exploratory effort".
I claim theoretical work of relevance needs to be immediately and clearly relevant to deep learning as soon as it comes out if it's going to be of use. Something that can't be used on deep learning can't be useful. (And I don't think all of MIRI's work fails this test, though most does, I could go step through and classify if someone wants.)
I don't think I can make reliably true claims about anthropic's effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn't prepared to overcome the financial incentives I think are what mostly kill us anyway.