I've definitely also seen the failure mode where someone is only or too focused on "the puzzles of agency" without having an edge in linking those questions up with AI risk/alignment. Some ways of asking about/investigating agency are more and less relevant to alignment, so I think it's important that there is a clear/strong enough "signal" from the target domain (here: AI risk/alignment) to guide the search/research directions
I disagree—I think that we need more people on the margin who are puzzling about agency, relative to those who are backchaining from a particular goal in alignment. Like you say elsewhere, we don’t yet know what abstractions make sense here; without knowing what the basic concepts of "agency" are it seems harmful to me to rely too much on top-down approaches, i.e., ones that assume something of an end goal.
In part that’s because I think we need higher variance conceptual bets here, and I think that over-emphasizing particular problems in alignment risks correlating people's minds. In part it's because I suspect that there are surprising, empirical things left to learn about agency that we'll miss if we prefigure the problem space too much.
But also: many great scientific achievements have been preceded by bottom-up work (e.g., Shannon, Darwin, Faraday), and afaict their open-ended, curious explorations are what laid the groundwork for their later theories. I feel that it is a real mistake to hold all work to the same standards of legible feedback loops/backchained reasoning/clear path to impact/etc, given that so many great scientists did not follow this. Certainly, once we have a bit more of a foundation this sort of thing seems good to me (and good to do in abundance). But I think before we know what we’re even talking about, over-emphasizing narrow, concrete problems risks the wrong kind of conceptual research—the kind of “predictably irrelevant” work that Alexander gestures towards.
From my perspective, meaningfully operationalizing “tool-like” seems like A) almost the whole crux of the disagreement, and B) really quite difficult (i.e., requiring substantial novel scientific progress to accomplish), so it seems weird to leave as a simple to-do at the end.
Like, I think that “tool versus agent” shares the same confusion that we have about “non-life versus life”—why do some pieces of matter seem to “want” things, to optimize for them, to make decisions, to steer the world into their preferred states, and so on, while other pieces seem to “just” follow a predetermined path (algorithms, machines, chemicals, particles, etc.)? What’s the difference? How do we draw the lines? Is that even the right question? I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Concrete operationalizations seem great to ask for, when they’re possible to give—but I suspect that expecting/requesting them before they’re possible is more likely to muddy the discourse than clarify it.
Even if humanity isn't like, having a huge mood shift, I do still expect the next 10 years to have a lot more people working on stuff that actually helps than the previous 10 years.
What kinds of things are you imagining, here? I'm worried that on the current margin people coming into safety will predominately go into interpretability/evals/etc because that's the professional/legible thing we have on offer, even though by my lights the rate of progress and the methods/aims/etc of these fields are not nearly enough to get us to alignment in ~10 years (in worlds where alignment is not trivially easy, which is also the world I suspect we're in). My own hope for another ten years is more like "that gives us some space and time to develop a proper science here," which at the current stage doesn't feel very bottlenecked by number of people. But I'm curious what your thoughts are on the "adding more people pushes us closer to alignment" question.
"Intelligence" can be characterized with a similar level of theoretical precision as e.g., heat, motion, and information. (In other words: it's less like a messy, ad-hoc phenomena and more like a deep, general fact about our world).
In particular, I think their usage of Dario's statements on x-risk as a rhetorical weapon against RSPs creates a structural disincentive against lab heads being clear about existential risk
I’m not sure how to articulate this, exactly, but I want to say something like “it’s not on us to make sure the incentives line up so that lab heads state their true beliefs about the amount of risk they’re putting the entire world in.” Stating their beliefs is just something they should be doing, on a matter this important, no matter the consequences. That’s on them. The counterfactual world—where they keep quiet or are unclear in order to hide their true (and alarming) beliefs about the harm they might impose on everyone—is deceptive. And it is indeed pretty unfortunate that the people who are most clear about this (such as Dario), will get the most pushback. But if people are upset about what they’re saying, then they should still be getting the pushback.
Thanks for making this dialogue! I’ve been interested in the science of uploading for awhile, and I was quite excited about the various C. elegans projects when they started.
I currently feel pretty skeptical, though, that we understand enough about the brain to know which details will end up being relevant to the high-level functions we ultimately care about. I.e., without a theory telling us things like “yeah, you can conflate NMDA receptors with AMPA, that doesn’t affect the train of thought” or whatever, I don’t know how one decides what details are and aren’t necessary to create an upload.
You mention that we can basically ignore everything that isn’t related to synapses or electricity (i.e., membrane dynamics), because chemical diffusion is too long to account for the speed of cognitive reaction times. But as Tsvi pointed out, many of the things we care about occur on longer timescales. Like, learning often occurs over hours, and is sometimes not stored in synapses or membranes—e.g., in C. elegans some of the learning dynamics unfold in the protein circuits within individual neurons (not in the connections between them).[1] Perhaps this is a strange artifact of C. elegans, but at the very least it seems like a warning flag to me; it’s possible to skip over low-level details which seem like they shouldn’t matter, but end up being pretty important for cognition.
That’s just one small example, but there are many possibly relevant details in a brain… Does the exact placement of synapses matter? Do receptor subtypes matter? Do receptor kinematics matter, e.g., does it matter that NMDA is a coincidence detector? Do oscillations matter? Dendritic computation? Does it matter that the Hodgkin-Huxley model assumes a uniform distribution of ion channels? I don’t know! There are probably loads of things that you can abstract away, or conflate, or not even measure. But how can you be sure which ones are safe to ignore in advance?
This doesn't make me bearish on uploading in general, but it does make me skeptical of plans which don't start by establishing a proof of concept. E.g., if it were me, I’d finish the C. elegans simulation first, before moving onto to larger brains. Both because it seems important to establish that the details that you’re uploading in fact map onto the high-level behaviors that we ultimately care about, and because I suspect that you'd sort out many of the kinks in this pipeline earlier on in the project.
“The temperature minimum is reset by adjustments to the neuron’s internal signaling; this requires protein synthesis and takes several hours” and “Again, reprogramming a signaling pathway within a neuron allows experience to change the balance between attraction and repulsion.” Principles of Neural Design, page 32, under the section “Associative learning and memory.” (As far as I understand, these internal protein circuits are separate from the transmembrane proteins).
Yeah :/ I've struggled for a long time to see how the world could be good with strong AI, and I've felt pretty alienated in that. Most of the time when I talk to people about it they're like "well the world could just be however you like!" Almost as if, definitionally, I should be happy because in the really strong success cases we'll have the tech to satisfy basically any preference. But that's almost the entire problem, in some way? As you say, figuring things out for ourselves, thinking and learning and taking pride in skills that take effort to acquire... most of what I cherish about these things has to do with grappling with new territory. And if I know that it is not in fact new, if all of it could be easier were I to use the technology right there... it feels as though something is corrupted... The beauty of curiosity, wonder, and discovery feels deeply bound to the unknown, to me.
I was talking to a friend about this a few months ago and he suggested that because many humans have these preferences, that we ought to be able to make a world where we satisfy them—e.g., something like "the AI does its thing over there and we sit over here having basically normal human lives except that death is a choice and sometimes it helps us figure out hard coordination problems or whatever." And I can almost get behind this, but something still feels off to me. Like how when people get polarized through social media it almost seems like there's no going back? How do we know strange spirals won't happen with an even more advanced technology? It's hard to escape the feeling that a dystopia lurks. Hard to escape the feeling that all the people I know and love might change quickly and radically, that I might change radically, in ways that feel alien to me now. I want to believe that strong AI would be great, and perhaps it would be, perhaps I'm missing something here. But a part of me is terrified.
Thanks for writing this post—I appreciate the candidness about your beliefs here, and I agree that this is a tricky topic. I, too, feel unsettled about it on the object level.
On the meta level, though, I feel grumpy about some of the framing choices. There’s this wording which both you and the original ARC evals post use: that responsible scaling policies are a “robustly good compromise,” or, in ARC’s case, that they are a “pragmatic middle ground.” I think these stances take for granted that the best path forward is compromising, but this seems very far from clear to me.
Like, certainly not all cases of “people have different beliefs and preferences” are ones where compromise is the best solution. If someone wants to kill me, I’m not going to be open to negotiating about how many limbs I’m okay with them taking. This is obviously an extreme example, but I actually don’t think it’s that far off from the situation we find ourselves in now where, e.g., Dario gives a 10-25% probability that the sort of technology he is advancing will either cause massive catastrophe or end the human race. When people are telling me that their work has a high chance of killing me, it doesn't feel obvious that the right move is “compromising” or “finding a middle ground.”
The language choices here feel sketchy to me in the same way that the use of the term “responsible” feels sketchy to me. I certainly wouldn’t call the choice to continue building the unsettling-chance-of-annihilation machine responsible. Perhaps it’s more responsible than the default, but that’s a different claim and not one that is communicated in the name. Similarly, “compromise” and “middle ground” are the kinds of phrases that seem reasonable from a distance, but if you look closer they’re sort of implicitly requesting that we treat “keep racing ahead to our likely demise” as a sensible option.
“Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me.”
This seems to me to misrepresent the argument. At the very least, it misrepresents mine. It’s not that I’m fighting an improvement to the status quo, it’s that I don’t think responsible scaling policies are an improvement if they end up being confused for sufficient progress.
Like, in the worlds where alignment is hard, and where evals do not identify the behavior which is actually scary, then I claim that the existence of such evals is concerning. It’s concerning because I suspect that capabilities labs are more incentivized to check off the “passed this eval” box than they are to ensure that their systems are actually safe. And in the absence of a robust science of alignment, I claim that this most likely results in capability labs goodharting on evals which are imperfect proxies for what we care about, making systems look safer than they are. This does not seem like an improvement to me. I want the ability to say what’s actually true, here: that we do not know what’s going on, and that we’re building a godlike power anyway.
And I’m not saying that this is the only way responsible scaling policies could work out, or that it would necessarily be intentional, or that nobody in capabilities labs take the risk seriously. But it seems like a blindspot to neglect the the existence of the incentive landscape, here, one which is almost certainly affecting the policies that capabilities labs establish.
I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we'll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
LessWrong.com is my favorite website. I’ve tried having thoughts on other websites and it didn't work. Seriously, though—I feel very grateful for the effort you all have put in to making this an epistemically sane environment. I have personally benefited a huge amount from the intellectual output of LW—I feel smarter, saner, and more capable of positively affecting the world, not to mention all of the gears-level knowledge I’ve learned, and model building I’ve done as a result, which has really been a lot of fun :) And when I think about what the world would look like without LessWrong.com I mostly just shudder and then regret thinking of such dismal worlds.
Some other thoughts of varying import: