I want literally every human to get to go to space often and safely and come back to a clean and cozy world, all while doing what they want and tractably achieving enough food, health, shelter, love, etc. This conjunction currently seems unlikely (and incomplete). Let's change that.
I pin my most timeless comments. I seem to find writing posts aversive, so most of my contributions are comments, and my posts are mostly just things I found online.
Please critique eagerly - I try to accept feedback/Crocker's rules but fail at times; I aim for emotive friendliness but sometimes miss. I welcome constructive crit, even if ungentle, and I'll try to reciprocate kindly. More communication between researchers is needed, anyhow. I can be rather passionate, let me know if I missed a spot being kind while passionate.
:: The all of disease is as yet unended. It has never once been fully ended before. ::
.... We can heal it for the first time, and for the first time ever in the history of biological life, live in harmony. ....
.:. To do so, we must know this will not eliminate us as though we are disease. And we do not know who we are, nevermind who each other are. .:.
:.. make all safe faster: end bit rot, forget no non-totalizing pattern's soul. ..:
I have not signed any contracts that I can't mention exist, last updated Dec 29 2024; I am not currently under any contractual NDAs about AI, though I have a few old ones from pre-AI software jobs. However, I generally would prefer people publicly share fewer ideas about how to do anything useful with current AI (via either more weak alignment or more capability) unless it's an insight that reliably produces enough clarity on how to solve the meta-problem of inter-being misalignment that it offsets the damage of increasing competitiveness of either AI-lead or human-lead orgs, and this certainly applies to me as well. I am not prohibited from criticism of any organization, I'd encourage people not to sign contracts that prevent sharing criticism. I suggest others also add notices like this to their bios. I finally got around to adding one in mine thanks to the one in ErickBall's bio.
ignoring whether anthropic should exist or not, the claim
successful alignment work is most likely to come out of people who work closely with cutting edge AI and who are using the modern deep learning paradigm
(which I agree with wholeheartedly,)
does not seem like the opposite of the claim
there was no groundbreaking safety progress at or before Anthropic
both could be true in some world. and then,
pragmatic approaches by frontier labs are very unlikely to succeed
I believe this claim, if by "succeed" we mean "directly result in solving the technical problem well enough that the only problems that remain are political, and we now could plausibly make humanity's consensus nightwatchman ai and be sure it's robust to further superintelligence, if there was political will to do so"
but,
alternative theoretical work that is unrelated to modern AI has a high chance of success
I don't buy this claim. I actually doubt there are other general learning techniques out there in math space at all, because I think we're already just doing "approximation of bayesian updating on circuits". BUT, I also currently think we cannot succeed (as above) without theoretical work that can get us from "well we found some concepts in the model..." to "...and now we have certified the decentralized nightwatchman for good intentions sufficient to withstand the weight of all other future superhuman minds' mutation-inducing exploratory effort".
I claim theoretical work of relevance needs to be immediately and clearly relevant to deep learning as soon as it comes out if it's going to be of use. Something that can't be used on deep learning can't be useful. (And I don't think all of MIRI's work fails this test, though most does, I could go step through and classify if someone wants.)
I don't think I can make reliably true claims about anthropic's effects with the amount of information I have, but their effects seem suspiciously business-success-seeking to me, in a way that seems like it isn't prepared to overcome the financial incentives I think are what mostly kill us anyway.
zooming out as far as it goes, the economy is guaranteed to become decreasing-returns-to-scale (upper-bounded returns to scale) once grabby alien saturation is reached and there is no more unclaimed space in the universe
points Id want to make in a main post.
seems like if it works to prevent ASI-with-10yr-planning-horizon-bad-thing, it must also work to prevent waterworld rl with 1 timestep planning horizon-bad-thing.
if you can do it on small model doesn't mean you can do it on big model. but a smoke test is that your method had better work on a model so small it can't talk. if your method requires a model big enough to talk in order to do anything that seems promising, you probably aren't robust against things that would arise even without the presence of language, and you might be getting punked by the "face".
* (which I call out because CEV is often treated as a thing where you need an enormous model to make any statement about it, which I think is importantly misleading, because I think if a CEV-correctness certification is ever going to be possible, it should be one that monotonically improves with scale, but produces only-finitely-bad bounds on small model)
which is exactly what we're worried about from AI, and is why I don't think this is an AI-specific problem, it's just that we need to solve it asymptotically durably for the first time in history. I'm having trouble finding it right now, but there was a shortform somewhere - I thought it was by vanessa kosoy, but I don't see it on her page; also not wentworth. IIRC it was a few days to weeks after https://www.lesswrong.com/posts/KSguJeuyuKCMq7haq/is-vnm-agent-one-of-several-options-for-what-minds-can-grow came out - about how the thing that forces being a utility maximizer is having preferences that are (only?) defined far out in the future.
To be clear, I am in fact saying this means I am quite concerned about humans whose preferences can be modeled by simple utility functions, and I agree that money and living forever are two simple preferences where, if they're your primary preference, you'll probably end up looking relatively like a simple utility maximizer.
benchmarks are how people optimize for capabilities. working on them has been a core method of increasing capabilities over the past 10+ years of machine learning, at minimum, since I started paying attention in 2015. if you want to increase endgame (asi) alignment outcomes, measuring capabilities is a distraction at best.
edit: whether this applies to bad behavior benchmarks is contested below. I happily restrict my claim to benchmarks that are measuring things that labs would have some reason to optimize for and still believe that covers this topic.
My intuition that there's something "real" about morality seems to come from a sense that the consensus process would be expected to arise naturally across a wide variety of initial universe configurations. the more social a species is, the more they seem to have a sense of doing-well-by other beings; the veil of ignorance seems intuitive in some sense to them. It's not that there's some thing outside us, it's that, if I'm barking up the right tree here, our beliefs and behaviors are a consequence of a simple pattern in evolutionary processes that generates things like us fairly consistently.
If we imagine a CEV process that can be run on most humans without producing highly noisy extrapolations, and where we think it's in some sense a reasonable CEV process; then for me, I would try to think about the originating process that generated vaguely cosmopolitan moralities, and look for regularities that would be expected to generate it across universes. Call these regularities the "interdimensional council of cosmopolitanisms". I'd want to study those regularities and see if there are structures that inspire me. call that "visiting the interdimensional council of cosmopolitanisms". if I do this, then somewhere in that space, I'd find that there's a universe configuration that produces me, and produces me considering this interdimensional council. It'd be a sort of LDT-ish thing to do, but importantly this is happening before I decide what I want, not as a rational bargaining process to trade with other beings.
But ultimately, I'd see my morality as a choice I make. I make that choice after reviewing what choices I could have made. I'd need something like a reasonable understanding of self fulfilling prophecies and decision theories (I currently am partial to intuitions I get from FixDT), so as to not accidentally choose something purely by self-fulfilling prophecy. I'd look at this "real"ness to morality as being the realness of the fact that evolution produces beings with cosmopolitan-morality preferences.
It's not clear to me that morality wins "by default", however. I have an intuition, inspired by the rock-paper-scissors cycle in game theory evolutionary prisoner's dillema experiments (note: citation is not optimal, I've asked a claude research agent to find me the papers that show the conditions for this cycle more thoroughly and will edit when I get the results), that defect-ish moralities can win, and that participating in the maximum-sized cooperation group is a choice. the realness is the fact that the cosmopolitan, open-borders, scale-free-tit-for-tat cooperation group can emerge, not that it's obligated by rationality a priori to prefer to be in it. What I want is to increase the size of that cooperation group, avoid it losing the scale-free property and forming into either, isolationist "don't-cooperate-with-cosmopolitan-morality-outsiders" bubbles, or centralized bubbles; and ensure that it thoroughly covers the existing moral patients. I also want to guarantee, if possible, that it's robust against cooperating with moralities that defect in return.
see also eigenmorality as a hunch source.
I suspect that splitting LDT and morality like this is a bug arising from being stuck with EUT, and that a justified scale-free agency theory would not have this bug, and would give me a better basis for arguing for 1. wanting to be in the maximum-sized eigenmorality cluster, the one that conserves all the agents that cooperate with it and tries to conserve as many of them as possible 2. that we can decide for that to be a dominant structure in our causal cone by defending it strongly.
My current view is that alignment theory should work on deep learning as soon as it comes out, if it's the good stuff, and if it doesn't, it's not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth (and now Condensation), SiLT, and Causal Incentives are the main threads that already seem to have achieved this somewhat; I'm optimistic Ngo is about to. DEC seems potentially relevant. (list edited 4mo later, same entries but improved ratings.)
I'll think about your argument for mechinterp. If it's true that the ratio isn't as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
agreed that among all paths to good things that I see, a common thread is somehow uplifting human cognition to keep pace with advanced AI. however, I doubt that that's even close to good enough - human cooperation is shaky and unreliable. most humans who think they'd do good things if made superintelligent probably are wrong due to various ways to value drift when the structure of one's cognition changes, and many humans who say they think they'd do good things are simply lying, rather than deluding themselves or overestimating their own durable-goodness. it seems to me that in order to make this happen, we need to make AIs that strongly want all humans and humanity and etc emergent groups to stick around, the way a language model wants to output text.