All of aysja's Comments + Replies

I think dust theory is wrong in the most permissive sense: there are physical constraints on what computations (and abstractions) can be like. The most obvious one is "things that are not in each others lightcone can't interact" and interaction is necessary for computation (setting aside acausal trades and stuff, which I think are still causal in a relevant sense, but don't want to get into rn). But there are also things like: information degrades over distance (roughly the number of interactions, i.e., telephone theorem) and so you'd expect "large" comput... (read more)

I agree with you that people get sloppy with these terms, and this seems bad. But there’s something important to me about holding space for uncertainty, too. I think that we understand practically every term on this list exceedingly poorly. Yes, we can point to things in the world, and sometimes even the mechanisms underlying them, but we don’t know what we mean in any satisfyingly general way. E.g. “agency” does not seem well described to me as “trained by reinforcement learning.” I don’t really know what it is well described by, and that's the point. Pre... (read more)

At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don't understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where "sure" is more like "99.9%" rather than "70%." In general, the eval situation seems crazy offense advantaged to me: if there's just one thing that we haven't looked for that the AI can sneak by, we've lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check "is it there—yes or no"? And if we don't know how to do that, then I don't understand how we can feel confident in control strategies, either. 

To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for "good" might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities. This doesn't seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.) Suppose there are N possible strategies which you'd a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy. More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don't have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.) Additionally, even if our defenses aren't generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn't know about which strategy we can/can't defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI. It's not just that "if the AI finds a strategy which it can sneak by we've lost", it's also the case that "if we can catch the thing which the AI does on it's first try, we've won" (with some important caveats on "won"). (The above supposes that the AI follows the optimal approach which gua

we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

I do actually think this is basically true. It seems to me that when people encounter that maps are not the territory—see that macrostates are relative to our perceptual machinery or what have you—they sometimes assume that this means the territory is arbitrarily permissive of abstractions. But that seems wrong to me: the territo... (read more)

3Martín Soto10h
I'm not sure we are in disagreement. No one is negating that the territory shapes the maps (which are part of the territory). The central point is just that our perception of the territory is shaped by our perceptors, etc., and need not be the same. It is still conceivable that, due to how the territory shapes this process (due to the most likely perceptors to be found in evolved creatures, etc.), there ends up being a strong convergence so that all maps represent isomorphically certain territory properties. But this is not a given, and needs further argumentation. After all, it is conceivable for a territory to exist that incentivizes the creation of two very different and non-isomorphic types of maps. But of course, you can argue our territory is not such, by looking at its details. I think this falls for the same circularity I point at in the post: you are defining "naturalness of a partition" as "usefulness to efficiently affect / control certain other partitions", so you already need to care about the latter. You could try to say something like "this one partition is useful for many partitions", but I think that's physically false, by combinatorics (in all cases you can always build as many partitions that are affected by another one). More on these philosophical subtleties here: Why does generalization work?
1Jonas Hallgren8d
Great comment, I just wanted to share a thought on my perception of the why in relation to the intentional stance.  Basically, my hypothesis that I stole from Karl Friston is that an agent is defined as something that applies the intentional stance to itself. Or, in other words, something that plans with its own planning capacity or itself in mind.  One can relate it to the entire membranes/boundaries discussion here on LW as well in that if you plan as if you have a non-permeable boundary, then the informational complexity of the world goes down. By applying the intentional stance to yourself, you minimize the informational complexity of modelling the world as you kind of define a recursive function that acts within its own boundaries (your self). You will then act according to this, and then you have a kind of self-fulfilling prophecy as the evidence you get is based on your map which has a planning agent in it.  (Literally self-fulfilling prophecy in this case as I think this is the "self"-loop that is talked about in meditation. It's quite cool to go outside of it.)

It seems like this is only directionally better if it’s true, and this is still an open question for me. Like, I buy that some of the commitments around securing weights are true, and that seems good. I’m way less sure that companies will in fact pause development pending their assessment of evaluations. And to the extent that they are not, in a meaningful sense, planning to pause, this seems quite bad. It seems potentially worse, to me, to have a structure legitimizing this decision and making it seem more responsible than it is, rather than just openly d... (read more)

I do think that counterfactual impact is an important thing to track, although two people discovering something at the same time doesn’t seem like especially strong evidence that they were just "leading the parade." It matters how large the set is. I.e., I doubt there were more than ~5 people around Newton’s time who could have come up with calculus. Creating things is just really hard, and I think often a pretty conjunctive set of factors needs to come together to make it happen (some of those are dispositional (ambition, intelligence, etc.), others are m... (read more)

Don't forget Edward Blyth for something in the vicinity of groundwater or on the scent.

The incremental approach bakes in a few assumptions, namely that there likely won't be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don't know that this will hold, and that there's reason to suspect it won't. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.

I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want,... (read more)

Control doesn't assume "no sharp left turns" and "the work you do on smaller systems will apply to the next ones you build", though things along these lines do help with control. (For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.) This might depend on exactly what you mean by "sharp left turn", I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control. The two assumptions we need that could be broken by something like "sharp left turn": * Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion. * Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies "very low sample efficiency because capabilities now come from a non-retargetable process", then this would cause problems. In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).

Thanks for writing this up! I've been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that "If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior." I think Section 4 does a good job of explaining why this probably isn't true, with the basic problem being that the space of behaviors consistent with the train... (read more)

I think the guiding principle behind whether or not scientific work is good should probably look something more like “is this getting me closer to understanding what’s happening” where “understanding” is something like “my measurements track the thing in a one to one lock-step with reality because I know the right typings and I’ve isolated the underlying causes well enough.”

AI control doesn’t seem like it’s making progress on that goal, which is certainly not to say it’s not important—it seems good to me to be putting some attention on locally useful thing... (read more)

I worry that overemphasizing fast feedback loops ends up making projects more myopic than is good for novel research. Like, unfortunately for everyone, a bunch of what makes good research is good research taste, and no one really understands what that is or how to get it and so it's tricky to design feedback loops to make it better. Like, I think the sense of "there's something interesting here that no one else is seeing" is sort of inherently hard to get feedback on, at least from other people. Because if you could, or if it were easy to explain from the ... (read more)

I agree there's a risk of overemphasizing fast-feedback loops that damages science.  My current belief is that gaining research taste (or b is something that shouldn't be that mysterious, and mostly it seems to be something that * does require quite a bit of effort (which is why I think it isn't done by default) * also requires at least some decent meta-taste on how to gain taste (but, my guess is Alex Altair in particular has enough of this to navigate it) And.. meanwhile I feel like we just don't have the luxury of not at least trying on this axis to some degree. (I don't know that I can back up this statement very much, this seems to be a research vein I currently believe in that no one else currently seems to) It is plausible to me (based on things like Alex's comment on this other post you recently responded to, and other convos with him) that Alex-in-particular is already basically doing all the things that make sense to do here.  But, like, looking here: I think the amount I'm pushing for this here is "at all", and it feels premature to me to jump to "this will ruin the research process".

My impression is that Alex is trying to figure out what things like "optimization" are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment. 

Seconded! I love Holden's posts on wicked problems, I revisit them like once a week or whenever I'm feeling down about my work :p

I've also found it incredibly useful to read historical accounts of great scientists. There's just all kinds of great thinking tips scattered among biographies, many of which I've encountered on LessWrong before, but somehow seeing them in the context of one particular intellectual journey is very helpful. 

Reading Einstein's biography (by Walter Isaacson) was by far my favorite. I felt like I got a really good handle for his... (read more)

Any other biography suggestions?

Yeah, I think I misspoke a bit. I do think that controllability is related to understanding, but also I’m trying to gesture at something more like “controllability implies simpleness.”

Where, I think what I’m tracking with “controllability implies simpleness” is that the ease with which we can control things is a function of how many causal factors there are in creating it, i.e., “conjunctive things are less likely” in some Occam’s Razor sense, but also conjunctive things “cost more.” At the very least, they cost more from the agents point of view. Like, if... (read more)

Are those things that good? I don't feel like I notice a huge quality of life difference from the pens I used ten years ago versus the pens I use now. Same with laptops and smartphones (although I care unusually little about that kind of thing so maybe I'm just not tracking it). Medicines have definitely improved although it seems worth noting that practically everyone I know has some terrible health problem they can't fix and we all still die. 

I feel like pushing the envelope on feature improvements is way easier than pushing the envelope on fundamen... (read more)

To be honest I actually do use the same pens as I used 10 years ago. Laptops have faster processors at least, and I can now do more stuff with them than I used to be able to. I don't have a terrible health problem I can't fix and haven't died yet. I totally believe that we're doing way worse than we could be and management practices are somehow to blame, just because the world isn't optimal and management is high-leverage. But in this model, I don't understand how you even get sustained improvements.

Google Brain was developed as part of X (Google's "moonshot factory"), which is their way of trying to create startups/startup culture within a large corporation. So was Waymo. 

They establish almost nothing of importance about the behavior and workings of real AIs, but nonetheless give the impression of a model for how we should think about AIs. 

How do you know that they establish nothing of importance? 

Many proponents of AI risk seem happy to critique analogies when they don't support the desired conclusion, such as the anthropomorphic analogy. 

At the very least, this seems to go both ways. Like, afaict, one of Quintin and Nora’s main points in “AI is Easy to Control” is that aligning AI is pretty much just l... (read more)

Argument by analogy is based on the idea that two things which resemble each other in some respects, must resemble each other in others: that isn't deductively valid.

It seems pretty wild to request that we stop relying on them altogether, almost as if you were asking us to stop thinking. Analogies seem so core to me when developing thought in novel domains, that it’s hard to imagine life without them. 

I agree with Douglas Hofstadter’s claim that thinking even a single thought about any topic, without using analogies, is just impossible—1hr talk, book-length treatment which I have been gradually reading (the book is good but annoyingly longwinded).

(But note that the OP does not actually go so far as to demand that ... (read more)

I think lots of folks (but not all) would be up in arms, claiming "but modern results won't generalize to future systems!" And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it's socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I'm being too cynical, but that's my reaction.

Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but... (read more)

This post is so wonderful, thank you for writing it. I’ve gone back to re-read many paragraphs over and over.

A few musings of my own:

“It’s just” … something. Oh? So eager, the urge to deflate. And so eager, too, the assumption that our concepts carve, and encompass, and withstand scrutiny. It’s simple, you see. Some things, like humans, are “sentient.” But Bing Sydney is “just” … you know. Actually, I don’t. What were you going to say? A machine? Software? A simulator? “Statistics?”

This has long driven me crazy. And I think you’re right about the source of... (read more)

We think it’s very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction.

From my point of view it seems possible that we could solve technical alignment, or otherwise become substantially deconfused, within 10 years—perhaps much sooner. I don’t think we’ve ruled out that foundational scientific progress is capable of solving the problem, nor that cognitively unenhanced humans might be able to succeed at such an activity. Like, as far as I can tell, very few people have tried working on the problem dir... (read more)

I think this a great comment, and FWIW I agree with, or am at least sympathetic to, most of it.

Goodheart is about generalization, not approximation.

This seems true, and I agree that there’s a philosophical hangup along the lines of “but everything above the lowest level is fuzzy and subjective, so all of our concepts are inevitably approximate.” I think there are a few things going on with this, but my guess is that part of what this is tracking—at least in the case of biology and AI—is something like “tons of tiny causal factors.” 

Like, I think one of the reasons that optimizing hard for good grades as a proxy for doing well at a job fails is ... (read more)

This seems basically right to me. That said, while it is predictable that the systems in question will be modular, what exact form that modularity takes is both environment-dependent and also path-dependent. Even in cases where the environmental pressures form a very strong attractor for a particular shape of solution, the "module divisions" can differ between species. For example, the pectoral fins of fish and the pectoral flippers of dolphins both fulfill similar roles. However, fish fins are basically a collection of straight, parallel fin rays made of bone or cartilage and connected to a base inside the body of the fish, and the muscles to control the movement of the fin are located within the body of the fish. By contrast, a dolphin's flipper is derived from the foreleg of its tetrapod ancestor, and contains "fingers" which can be moved by muscles within the flipper. So I think approaches that look like "find a structure that does a particular thing, and try to shape that structure in the way you want" are somewhat (though not necessarily entirely) doomed, because the pressures that determine which particular structure does a thing are not nearly so strong as the pressures that determine that some structure does the thing.
I agree there's a core principle somewhere around the idea of "controllable implies understandable". But when I think about this with respect to humans studying biology, then there's another thought that comes to my mind; the things we want to control are not necessarily the things the system itself is controlling. For example, we would like to control the obesity crisis (and weight loss in general) but it's not clear that the biological system itself is controlling that. It almost certainly was successfully controlling it in the ancestral environment (and therefore it was understandable within that environment) but perhaps the environment has changed enough that it is now uncontrollable (and potentially not understandable). Cancer manages to successfully control the system in the sense of causing itself to happen, but that doesn't mean that our goal, "reliably stopping cancer" is understandable, since it is not a way that the system is controlling itself. This mismatch seems pretty evidently applicable to AI alignment. And perhaps the "environment" part is critical. A system being controllable in one environment doesn't imply it being controllable in a different (or broader) environment, and thus guaranteed understandability is also lost. This feels like an expression of misgeneralization.

Someone picks a questionable ontology for modeling biological organisms/neural nets - for concreteness, let’s say they try to represent some system as a decision tree.

Lo and behold, this poor choice of ontology doesn’t work very well; the modeler requires a huge amount of complexity to decently represent the real-world system in their poorly-chosen ontology. For instance, maybe they need a ridiculously large decision tree or random forest to represent a neural net to decent precision.

This drove me crazy in cognitive science. There was a huge wave of Bayesi... (read more)

Fwiw, I generally find Quintin’s writing unclear and difficult to read (I bounce a lot) and Nora’s clear and easy, even though I agree with Quintin slightly more (although I disagree with both of them substantially).

I do think there is something to “views that are very different from ones own” being difficult to understand, sometimes, although I think this can be for a number of reasons. Like, for me at least, understanding someone with very different beliefs can be both time intensive and cognitively demanding—I usually have to sit down and iterate on “ma... (read more)

Yeah, for me the early development of shard theory work was confusing for similar reasons. Quintin framed values as contextual decision influences and thought these were fundamental, while I'd absorbed from Eliezer that values were like a utility function. They just think in very different frames. This is why science is so confusing until one frame proves useful and is established as a Kuhnian paradigm.

It's pretty hard to argue for things that don't exist yet, but here's a few intuitions for why I think agency related concepts will beget True Names:

  • I think basically every concept looks messy before it’s not, and I think this is particularly true at the very beginning of fields. Like, a bunch of Newton’s early journaling on optics is mostly listing a bunch of facts that he’s noticed—it’s clear with hindsight which ones are relevant and which aren’t, but before we have a theory it can just look like a messy pile of observations from the outside. Perusing o
... (read more)

The weird thing about a portfolio approach is that the things it makes sense to work on in “optimistic scenarios” often trade off against those you’d want to work on in more “pessimistic scenarios,” and I don't feel like this is really addressed.

Like, if we’re living in an optimistic world where it’s pretty chill to scale up quickly, and things like deception are either pretty obvious or not all that consequential, and alignment is close to default, then sure, pushing frontier models is fine. But if we’re in a world where the problem is nearly impossible, ... (read more)

Answer by aysjaDec 07, 20238748 is my favorite website. I’ve tried having thoughts on other websites and it didn't work. Seriously, though—I feel very grateful for the effort you all have put in to making this an epistemically sane environment. I have personally benefited a huge amount from the intellectual output of LW—I feel smarter, saner, and more capable of positively affecting the world, not to mention all of the gears-level knowledge I’ve learned, and model building I’ve done as a result, which has really been a lot of fun :) And when I think about what the world wou... (read more)

7Three-Monkey Mind2mo
I'd like to highlight this. In general, I think fewer things should be promoted to the front page. [edit, several days later]: is a prime example. This has nothing to do with rationality or AI alignment. This is the sort of off-topic chatter that belongs somewhere else on the Internet.

I've definitely also seen the failure mode where someone is only or too focused on "the puzzles of agency" without having an edge in linking those questions up with AI risk/alignment. Some ways of asking about/investigating agency are more and less relevant to alignment, so I think it's important that there is a clear/strong enough "signal" from the target domain (here: AI risk/alignment) to guide the search/research directions

I disagree—I think that we need more people on the margin who are puzzling about agency, relative to those who are backchaining fro... (read more)


From my perspective, meaningfully operationalizing “tool-like” seems like A) almost the whole crux of the disagreement, and B) really quite difficult (i.e., requiring substantial novel scientific progress to accomplish), so it seems weird to leave as a simple to-do at the end.

Like, I think that “tool versus agent” shares the same confusion that we have about “non-life versus life”—why do some pieces of matter seem to “want” things, to optimize for them, to make decisions, to steer the world into their preferred states, and so on, while other pieces seem to... (read more)

I didn't leave it as a "simple" to-do, but rather an offer to collaboratively hash something out.  That said: If people don't even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn't "want things in a behavioralist sense"? (I bet he's looking at something real to him, and I bet he could figure it out if he tried!) To be clear, I'm not talking about formalizing the boundary. I'm talking about a bet between people, adjudicated by people.  (EDIT: I'm fine with a low sensitivity, high specificity outcome -- we leave it unresolved if it's ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how "behaviorally-wanting" the system seemed on a Likert scale. I don't think you need fundamental insights for that to work.)
Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't: I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature. A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that mapping (which I'll term "external world states"). You can view a calculator as a coherent agent, but you can't usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator's process. You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn't change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool. I've been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I've been calling goal agnosticism: 1. The agent cannot be usefully described[2] as having unconditional preferences about external world states. 2. Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.    Note that this isn't the same thing as a definition for "tool." An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like "proper" agents. To phrase it another way, the intuitive degree of "toolness" is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior. Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set

Even if humanity isn't like, having a huge mood shift, I do still expect the next 10 years to have a lot more people working on stuff that actually helps than the previous 10 years.

What kinds of things are you imagining, here? I'm worried that on the current margin people coming into safety will predominately go into interpretability/evals/etc because that's the professional/legible thing we have on offer, even though by my lights the rate of progress and the methods/aims/etc of these fields are not nearly enough to get us to alignment in ~10 years (in wor... (read more)

"Intelligence" can be characterized with a similar level of theoretical precision as e.g., heat, motion, and information. (In other words: it's less like a messy, ad-hoc phenomena and more like a deep, general fact about our world). 

In particular, I think their usage of Dario's statements on x-risk as a rhetorical weapon against RSPs creates a structural disincentive against lab heads being clear about existential risk

I’m not sure how to articulate this, exactly, but I want to say something like “it’s not on us to make sure the incentives line up so that lab heads state their true beliefs about the amount of risk they’re putting the entire world in.” Stating their beliefs is just something they should be doing, on a matter this important, no matter the consequences. That’s o... (read more)

When I was an SRE at Google, we had a motto that I really like, which is: "hope is not a strategy." It would be nice if all the lab heads would be perfectly honest here, but just hoping for that to happen is not an actual strategy.

Furthermore, I would say that I see the main goal of outside-game advocacy work as setting up external incentives in such a way that pushes labs to good things rather than bad things. Either through explicit regulation or implicit pressure, I think controlling the incentives is absolutely critical and the main lever that you have externally for controlling the actions of large companies.

Thanks for making this dialogue! I’ve been interested in the science of uploading for awhile, and I was quite excited about the various C. elegans projects when they started. 

I currently feel pretty skeptical, though, that we understand enough about the brain to know which details will end up being relevant to the high-level functions we ultimately care about. I.e., without a theory telling us things like “yeah, you can conflate NMDA receptors with AMPA, that doesn’t affect the train of thought” or whatever, I don’t know how one decides what details a... (read more)

Yeah :/ I've struggled for a long time to see how the world could be good with strong AI, and I've felt pretty alienated in that. Most of the time when I talk to people about it they're like "well the world could just be however you like!" Almost as if, definitionally, I should be happy because in the really strong success cases we'll have the tech to satisfy basically any preference. But that's almost the entire problem, in some way? As you say, figuring things out for ourselves, thinking and learning and taking pride in skills that take effort to acquire... (read more)

0Roman Leventov4mo
Discovering and mastering one's own psychology may still be a frontier where the AI could help only marginally. So, more people will become monks or meditators?

Thanks for writing this post—I appreciate the candidness about your beliefs here, and I agree that this is a tricky topic. I, too, feel unsettled about it on the object level.

On the meta level, though, I feel grumpy about some of the framing choices. There’s this wording which both you and the original ARC evals post use: that responsible scaling policies are a “robustly good compromise,” or, in ARC’s case, that they are a “pragmatic middle ground.” I think these stances take for granted that the best path forward is compromising, but this seems ... (read more)

Thanks for the thoughts! I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me - “X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point. I address the point about improvements on the status quo in my response to Akash above.

I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it. 

I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.   

I think there is pressure mounting within the field of A... (read more)


But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.

As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can't do are concrete safety evals, which I'm very clear about not expecting us to have right now.

And I'm not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the ... (read more)

Fully agree with almost all of this. Well said. One nitpick of potentially world-ending importance: Giving us high confidence is not the bar - we also need to be correct in having that confidence. In particular, we'd need to be asking: "How likely is it that the process we used to find these measures and evaluations gives us [actually sufficient measures and evaluations] before [insufficient measures and evaluations that we're confident are sufficient]? How might we tell the difference? What alternative process would make this more likely?..." I assume you'd roll that into assessing your confidence - but I think it's important to be explicit about this.   Based on your comment, I'd be interested in your take on: 1. Put many prominent disclaimers and caveats in the RSP - clearly and explicitly. vs 2. Attempt to make commitments sufficient for safety by committing to [process to fill in this gap] - including some high-level catch-all like "...and taken together, these conditions make training of this system a good idea from a global safety perspective, as evaluated by [external board of sufficiently cautious experts]". Not having thought about it for too long, I'm inclined to favor (2). I'm not at all sure how realistic it is from a unilateral point of view - but I think it'd be useful to present proposals along these lines and see what labs are willing to commit to. If no lab is willing to commit to any criterion they don't strongly expect to be able to meet ahead of time, that's useful to know: it amounts to "RSPs are a means to avoid pausing". I imagine most labs wouldn't commit to [we only get to run this training process if Eliezer thinks it's good for global safety], but I'm not at all sure what they would commit to. At the least, it strikes me that this is an obvious approach that should be considered - and that a company full of abstract thinkers who've concluded "There's no direct, concrete, ML-based thing we can commit to here, so we're out of

Meta: I don’t want this comment to be taken as “I disagree with everything you (Thomas) said.” I do think the question of what to do when you have an opaque, potentially intractable problem is not obvious, and I don’t want to come across as saying that I have the definitive answer, or anything like that. It’s tricky to know what to do, here, and I certainly think it makes sense to focus on more concrete problems if deconfusion work didn’t seem that useful to you. 

That said, at a high-level I feel pretty strongly about investing in early-stage deconfus... (read more)

Such a definition seems futile (I recommend the rest of the word sequence also). Biology already does a great job explaining what and why some things are alive. We are not going around thinking a rock is "alive". Or what exactly did you have in mind there?
I don't have the energy to contribute actual thoughts, but here are a few links that may be relevant to this conversation: * Sequencing is the new microscope, by Laura Deming * On whether neuroscience is primarily data, rather than theory, bottlenecked: * Could a neuroscientist understand a microprocessor?, by Eric Jonas * This footnote on computational neuroscience, by Jascha Sohl-Dickstein

Thanks, I really like this comment. Here are some points about metascience I agree and disagree with, and how this fits into my framework for thinking about deconfusion vs data collection in AI alignment.

  • I tentatively think you're right about relativity, though I also feel out of my depth here. [1]
  • David Bau must have mentioned the Schrödinger book but I forgot about it, thanks for the correction. The fact that ideas like this told Watson and Crick where to look definitely seems important.
  • Overall, I agree that a good theoretical understanding guides further
... (read more)

Top level blog post, do it.

I did undergrad and grad school in neuroscience and can at the very least say that this was also my conclusion.

I remember the introductory lecture for the Cognitive Neuroscience course I took at Oxford. I won't mention the professor's name, because he's got his own lab and is all senior and stuff, and might not want his blunt view to be public -- but his take was "this field is 95% nonsense. I'll try to talk about the 5% that isn't". Here's a lecture slide:

This seems wrong to me in some important ways (at least as general theoretical research advice). Like, some of the advice you give seems to anti-predict important scientific advances.

Generally, unguided exploration is seldom that useful. 

Following this advice, for instance, would suggest that Darwin not go on the Beagle, i.e., not spend five years exploring the globe (basically just for fun) as a naturalist. But his experiences on the Beagle were exactly what led him to the seeds of natural selection, as he began to notice subtleties like how animals ... (read more)

Yeah, there's something weird going on here that I want to have better handles on. I sometimes call the thing Bengio does being "motivated by reasons." Also the skill of noticing that "words have referents" or something? 

Like, the LARPing quality of many jobs seems adjacent to the failure mode Orwell is pointing to in Politics and the English Language, e.g., sometimes people will misspell metaphors—"tow the line" instead of "toe the line"—and it becomes very clear that the words have "died" in some meaningful sense, like the speaker is not actually "l... (read more)

Ugh, that one annoys me so much. Capitalism is a word so loaded it has basically lost all meaning. Like, people will say things like "slavery is inextricably linked to capitalism" and I'm thinking, hey genius, slavery existed in tribal civilizations that didn't even have the concept of money, what do you think capitalism even is? (Same thing for patriarchy.)

Not sure I understand what you're saying with the "tow the line" thing.

A lot of what you wrote seems like it's gesturing towards an idea that I might call "buzzwords" or "marketing speak" or "the affect game".  I think of it as being when someone tries to control the valence of an idea by associating it with other ideas without invoking any actual model of how they're connected.  Like trying to sell a car by having a hot model stand next to it.

If that's what you meant, then I agree this is kind of eerie and sort of anti-reality and I'm generally ... (read more)

But when it comes to messy gene expression networks, we've already found the hidden beauty - the stable level of underlying physics.  Because we've already found the master order, we can guess that we  won't find any additional secret patterns that will make biology as easy as a sequence of cubes.  Knowing the rules of the game, we know that the game is hard.  We don't have enough computing power to do protein chemistry from physics (the second source of uncertainty) and evolutionary pathways may have gone different ways on different pl

... (read more)

Thanks for adding this to my conceptual library! I like the idea. 

One thing I feel uncertain about (not coming from a background in evo bio) is how much evidence the "only evolved once" thing is for contingency. My naive guess would be that there's an asymmetry here, i.e., that "evolved many times" is a lot of evidence for convergence, but "evolved only once" is only a small amount of evidence for contingency (on its own). For instance, I can imagine (although I don't know if this has happened) that something highly favorable might evolve (such as a v... (read more)

1Mateusz Bagiński2mo
I agree that "only evolved once" (and then "subsumed the market") is not that much evidence for contingency on its own. But if you combine it with the knowledge that its evolution was facilitated/enabled by some random-ish, contingent factors, then the contingency case is much stronger. For example, the great diversification of mammals was enabled by the extinction of dinosaurs, itself caused by an asteroid impact, a factor as contingent as you can ask.

Meta: I have some gripes about the feedback loop focus in rationality culture, and I think this comment unfairly mixes a bunch of my thoughts about this topic in general with my thoughts in response to this post in particular—sorry in advance for that. I wish I was better at delineating between them, but that turned out to be kind of hard, and I have limited time and so on…

It is quite hard to argue against feedback loops in their broadest scope because it’s like arguing against updating on reality at all and that’s, as some might say, the core thing we’re ... (read more)

So I agree with all the knobs-on-the-equation you and Adam are bringing up. I've spent a lot of time pushing for LessWrong to be a place where people feel more free to explore early stage ideas without having to justify them at every step.

I stand by my claim, although a) I want to clarify some detail about what I'm actually claiming, b) after clarifying, I expect we'll still disagree, albeit for somewhat vague aesthetic-sense reasons, but I think my disagreement is important.

Main Clarifications: 

  • This post is primarily talking about training, rather th
... (read more)
4Adam Scholl6mo
Yeah, my impression is similarly that focus on feedback loops is closer to "the core thing that's gone wrong so far with alignment research," than to "the core thing that's been missing." I wouldn't normally put it this way, since I think many types of feedback loops are great, and since obviously in the end alignment research is useless unless it helps us better engineer AI systems in the actual territory, etc.  (And also because some examples of focus on tight feedback loops, like Faraday's research, strike me as exceedingly excellent, although I haven't really figured out yet why his work seems so much closer to the spirit we need than e.g. thinking physics problems). Like, all else equal, it clearly seems better to have better empirical feedback; I think my objection is mostly that in practice, focus on this seems to lead people to premature formalization, or to otherwise constraining their lines of inquiry to those whose steps are easy to explain/justify along the way. Another way to put this: most examples I've seen of people trying to practice attending to tight feedback have involved them focusing on trivial problems, like simple video games or toy already-solved science problems, and I think this isn't a coincidence. So while I share your sense Raemon that transfer learning seems possible here, my guess is that this sort of practice mostly transfers within the domain of other trivial problems, where solutions (or at least methods for locating solutions) are already known, and hence where it's easy to verify you're making progress along the way.

I am not totally sure that I disagree with you, but I would not say that agency is subjective and I’m going to argue against that here. 

Clarifying “subjectivity.” I’m not sure I disagree because of this sentence “there’s a certain structure out in the world which people recognize as X, because recognizing it as X is convergently instrumental for a wide variety of goals.” I’m guessing that where you’re going with this is that the reason it’s so instrumentally convergent is because there is in fact something “out there” that deserves to b... (read more)

In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each component, and should fall into different classes. Exploring the classes I suspect would inform the set of components and the general notion of "agency". I guess to get work on that done it would be useful to have a list of prospective agent components, a set of examples of agent shaped things, and then of course to describe each agent in terms of the components. What I'm describing, does it sound useful? Do you know of any projects doing this kind of thing? On the topic of map-territory correspondence, (is there a more concise name for that?) I quite like your analogies, running with them a bit, it seems like there are maybe 4 categories of map-territory correspondence; * Orange-like: It exists as a natural abstraction in the territory and so shows up on many maps. * Hot-like: It exists as a natural abstraction of a situation. A fire is hot in contrast to the surrounding cold woods. A sunny day is hot in contrast to the cold rainy days that came before it. * Heat-like: A natural abstraction of the natural abstraction of the situation, or alternatively, comparing the temperature of 3, rather than only 2, things. It might be natural to jump straight to the abstraction of a continuum of things being hot or not relative to one another, but it also seems natural to instead not notice homeostasis, and only to categorize the hot and cold in the environment that push you out of homeostasis. * Indeterminate: There is
Yup, exactly. And good explanations, this is a great comment all around.

Yudkowsky’s measure still feels weird to me in ways that don’t seem to apply to length, in the sense that length feels much more to me like a measure of territory-shaped things, and Yudkowsky’s measure of optimization power seems much more map-shaped (which I think Garrett did a good job of explicating). Here’s how I would phrase it:

Yudkowsky wants to measure optimization power relative to a utility function: take the rank of the state you’re in, take the total number of all states that have equal or greater rank, and then divide that by the total number o... (read more)

For this part, my answer is Kolmogorov complexity. An ice cube has lower K-complexity than the same amount of liquid water, which is a fact about the territory and not our maps. (And if a state has lower K-complexity, it's more knowable; you can observe fewer bits, and predict more of the state.) One of my ongoing threads is trying to extend this to optimization. I think a system is being objectively optimized if the state's K-complexity is being reduced. But I'm still working through the math.
Yeah... so these are reasonable thoughts of the kind that I thought through a bunch when working on this project, and I do think they're resolvable, but to do so I'd basically be writing out my optimization sequence. I agree with Alexander below though, a key part of optimization is that it is not about utility functions, it is only about a preference ordering. Utility functions are about choosing between lotteries, which is a thing that agents do, whereas optimization is just about going up an ordering. Optimization is a thing that a whole system does, which is why there's no agent/environment distinction. Sometimes, only a part of the system is responsible for the optimization, and in that case you can start to talk about separating them, and then you can ask questions about what that part would do if it were placed in other environments.
4Alexander Gietelink Oldenziel8mo
Small note: Yudkowsky definition is about a preference order not a utility function. Indeed, this was half the reason we did the project in the first place !

I don't see the difference between "resolution of uncertainty" and "difference between the world and a counterfactual." To my mind, resolution of uncertainty is reducing the space of counterfactuals, e.g., if I'm not sure whether you'll say yes or no, then you saying "yes" reduces my uncertainty by one bit, because there were two counterfactuals. 

I think what Garrett is gesturing at here is more like "There is just one way the world goes, the robot cleans the room or it doesn't. If I had all the information about the world, I would see the robot does ... (read more)

The above formulas rely on comparing the actual world to a fixed counterfactual baseline. Gaining more information about the actual world might make the distance between the counterfactual baseline and the actual world grow smaller, but it also might make it grow bigger, so it's not the case that the optimisation power goes to zero as my uncertainty about the world decreases. You can play with the formulas and see. But maybe your objection is not so much that the formulas actually spit out zero, but that if I become very confident about what the world is like, it stops being coherent to imagine it being different? This would be a general argument against using counterfactuals to define anything. I'm not convinced of it, but if you like you can purge all talk of imagining the world being different, and just say that measuring optimisation power requires a controlled experiment: set up the messy room, record what happens when you put the robot in it, set the room up the same, and record what happens with no robot.

This reminds me a lot of one of Kuhn's essays A Function for Thought Experiments. Where basically he's like "people often conflate variables together; thought experiments can tease apart those conflations." E.g., kids will usually start out conflating height with volume so that even though they watch the experimenter pour the "same" amount of water into a taller, thinner glass, they will end up saying that the left hand glass in (c) has more water than the one on the right. 

Sale > conservation of liquid piaget > in stock


Which is generally a good heuri... (read more)

(I like this connection, thanks!)

How does the redundancy definition of abstractions account for numbers, e.g., the number three? It doesn’t seem like “threeness” is redundantly encoded in, for example, the three objects on the floor of my room (rug, sweater, bottle of water) as rotation is in the gear example, since you wouldn’t be able to uncover information about “three” from any one object in particular. 

I could imagine some definition based on redundancy capturing “threeness” by looking at a bunch of sets containing three things. But I think the reason the abstraction “three” fee... (read more)

I really value "realness" although I too am not sure what it is, exactly. Some thoughts:

I cannot stand fake wood or brick or anything fake really, because it feels like it is trying to trick me. It's "lying," in sort of the same way I feel like people lie when they say they are doing something because it helps climate change or whatever, when really it seems clear that they are doing it for social approval or something of that nature. 

Moss feels very real to me, also, as do silky spider webs, or any slice of nature, really, when I'm in it. I think it'... (read more)

Thanks for writing this up! It seems very helpful to have open, thoughtful discussions about different strategies in this space. 

Here is my summary of Anthropic’s plan, given what you’ve described (let me know if it seems off): 

  1. It seems likely that deep learning is what gets us to AGI. 
  2. We don’t really understand deep learning systems, so we should probably try to, you know, do that. 
  3. In the absence of a deep understanding, the best way to get information (and hopefully eventually a theory) is to run experiments on these systems. 
  4. &nb
... (read more)

Your summary seems fine! 

Why do you need to do all of this on current models? I can see arguments for this, for instance, perhaps certain behaviors emerge in large models that aren’t present in smaller ones.

I think that Anthropic's current work on RL from AI Feedback (RLAIF) and Constitutional AI is based on large models exhibiting behaviors that don't work in smaller models? (But it'd be neat if someone more knowledgeable than me wanted to chime in on this!) 

My current best understanding is that running state of the art models is expensive in te... (read more)

Ah, thanks! Link fixed now. 

Yes, welp, I considered getting into this whole debate in the post but it seemed like too much of an aside. Basically, Lynch is like, “when you control for cell size, the amount of energy per genome is not predictive of whether it’s a prokaryote or a eukaryote.” In other words, on his account, the main determinant of bioenergetic availability appears to be the size of the cell, rather than anything energetically special about eukaryotes, such as mitochondria. 

There are some issues here. First, most of the large prokary... (read more)


Yeah I think it’s a great question and I don’t know that I have a great answer. Plasmids (small rings of DNA that float around separately) are part of the story. My understanding here is pretty sketchy, but I think plasmids are way more likely to be deleted than the chromosomal DNA, and for some reason antibiotic resistant genes tend to be in plasmids (perhaps because they are shared so frequently through horizontal gene transfer)? So the “delete within a few hours” bit is probably overstating the average case of DNA deletion in bacteria. I would be surprised if it “knew” about the function of the gene, although I agree it seems possible that some epigenetic mechanism could explain it. I don’t know of any, though!

Good question! I don’t know, but I think that they don’t necessarily need to. Something I didn’t get into in the post but which is pretty important for understanding bacterial genomes is that they do horizontal gene transfer, which basically means that they trade genes between individuals rather than exclusively between parents and offspring. 

From what I understand, this means that although on average the bacteria shed the unhelpful DNA if given the opportunity, so long as a few individuals within the population still have the gene, it can get ra... (read more)

Load More