On how various plans miss the hard bits of the alignment challenge

[-]paulfchristiano3y*Ω7718460

I'm going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it's stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.

My sense is that you have more faith in a rough intuitive sense you've developed of what the "hard part" of alignment is, and so you'd primarily recommend thinking about ... (read more)

[-]johnswentworth3yΩ102910

I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can't anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise.

I would uncharitably summarize this as "let's just assume that finding a faithful concrete operationalization of the problem is not itself the hard part". And then, any time finding a faithful concrete operationalization of the problem is itself the hard part, you basically just automatically fail.

Is that... wrong? Am I missing something here? Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself "the hard part"? (I mean, just intuitively, I'd expect hacking away at the legible parts to induce some ... (read more)

[-]paulfchristiano3yΩ307029

I don't think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by "concrete." In particular, "concrete" doesn't mean "formalized," it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.

You write:

But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.

I don't yet have this sense about a "sharp left turn" bottleneck.

I think I would agree with you if we'd looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what's really going on. At a high level that's very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.

But... (read more)

[-]johnswentworth3yΩ29633

This was a good reply, I basically buy it. Thanks.

3Carringtone Kinyanjui1y

I understand the security mindset (from the ordinary paranoia post) as: "What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?". The vulnerability comes from the model. The problem is the "unknowable unknowns". In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic. I don't have clear ways of better articulating this back from analogy to Paul's concretisations of a proposed AI system. I'm not sure there's no disanalogy here. However it could be something like "We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?". The security mindset question would be something like "What representations in the 'UV-complete' theory of this AI system would lead to catastrophic failure modes?" I'm probably missing something here though.

[-]Richard_Ngo3yΩ16336

This comment made me notice a kind of duality:
- Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren't being very concrete with their proposed problems.
- Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren't being very concrete with their proposed solutions.

It seems like "how well do we understand the problem" is one a crux here. I disagree with John's comment because it feels like he's assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn't exist.

[-]paulfchristiano3yΩ16345

I don't feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).

ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don't yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.

And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.

I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.

Put differently: I'm not saying that Nate and Eliezer are vague about problems but concrete about solutions, I'm saying they are vague... (read more)

4Richard_Ngo3y

Yeah, my comment was sloppily phrased; I agree with "I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain."

9johnswentworth3y

I don't think that's how this works? The strategy I'm recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists: * noticing an intuitive pattern in the failure-modes of some strategies * attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions) ... so if a part of the problem doesn't exist, then (a) we probably don't notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we're attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn't actually exist the way we thought it did. It feels like you're looking for a duality which does not exist. I mean, the duality between "look for concrete solutions" and "look for concrete problems" I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you're also looking for a duality between how-Paul's-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there'd be a trade-off. But Paul just isn't on the pareto frontier.

8Richard_Ngo3y

I feel confused about the difference between your "attempt to formalize" step and Paul's "attempt to concretize" step. It feels like you can view either as a step towards the other - if you successfully formalize, then presumably you'll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it'd be great to end up with a formalism for the problem, and then disagrees on how much that process should involve "finding concrete examples of the problem". My own view is that since it's so incredibly easy for people to get lost in abstractions, people should try to concretize much more when talking about highly abstract domains. (Even when people are confident that they're not lost in abstractions, like Eliezer and Nate are, that's still really useful for conveying ideas to other people.)

4Nathan Helm-Burger3y

Maybe you've already got this idea on your list of concrete examples, but while I was reading the original post I had an idea for a concrete example of how a parameter-frozen model which was above some critical intelligence threshold could potentially gain dangerous amounts of intelligence / power if given compute, access to data, and data storage. I'm not saying that this is particularly probable, just that it seems like a path worth adding to a concrete list of 'ways things could go tricky'. Imagine you have a model-system which has a large parameter-frozen language model as its primary component, and some sort of simple RL wrapper giving it goals. It could write and read text from its data storage, writing about hypotheses that it then checks in its data or in its training environment, and recording the results. In theory it seems like the system of self-edited data and frozen model could bootstrap its way to being a smarter overall entity, and then design prompts for it's language model component in a directed way to further accelerate this process, and go completely off-the-rails without even a single additional SGD update. Thus, it could 'leave behind' the similarly parameter-frozen auditor model. [Edit: This writing could include the sort of writing one might imagine a human taking notes for themself might write, perhaps encoded in a self-created cypher. Or it could be like antagonist samples of tokens in weird out-of-distribution combinations, a sort of self-programming. Or some combination of the two. For an intuition pump for the human note style, imagine an immortal human scientist with a poor memory trapped in a very hard escape room, determined to escape. They'd keep notes on their ideas both for how to solve the huge number of intentional puzzles of the room, and also on exploits they discovered, and also on the processes they used to discover these things. If they thought the maker of the room would spy on their notes and adjust the difficulty of

[-]johnswentworth3yΩ266120

Imaginary John: Well, uh, these days I'm mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that's where I think the hope is. So I'm not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).

That's not actually what I spend most of my time on, it's just a thing which came up in conversation with Eliezer that one time. I've never actually spent much time on a descriptive account of human values; I generally try to work on things which are bottlenecks to a wide variety of strategies (i.e. convergent hard subproblems), not things which are narrowly targeted to a single strategy.

What I'm actually spending most of my time on right now is figuring out how abstractions end up represented in cognitive systems, and how those representations correspond to structures (presumably natural abstractions) in the environment. In particular, I'd like to say things about convergent representations, such that we can both (a) test the claims on a wide variety of existing systems, and (b) have theorems saying that the claims extend to new kinds of systems.

... which, amusingly, looks like a much more ambitious version of interpretability work.

1wassname2y

I know this is a necro bump, but could you describe the ambitious interp work you have in mind? Perhaps something like a probe can detect helpfullness with >90% accuracy, and it works on other models without retraining, once we calibrate to a couple of unrelated concepts.

[-]Not Relevant3y*4416

I want to highlight a part of your model which I think is making the problem much harder: the belief that at some point we will stop teaching models via gradient descent, and instead do something closer to lifelong in-episode learning.

Nate: This seems to me like it's implicitly assuming that all of the system's cognitive gains come from the training. Like, with every gradient step, we are dragging the system one iota closer to being capable, and also one iota closer to being good, or something like that.

To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.

I agree that given an unaligned AGI with no defined objective, which then starts learning via an optimization process whose target we cannot understand (like non-GD learning), the fact that “in our outer GD loop our gradients were aligned with human feedback” (Vivek’s scheme) is not very comforting.

But there are... (read more)

[-]Eli Tyre3y118

This comment seems to me to be pointing at something very important which I had not hitherto grasped.

My (shitty) summary:

There's a big difference between gains from improving the architecture / abilities of a system (the genome, for human agents) and gains from increasing knowledge developed over the course of an episode (or lifetime). In particular they might differ in how easy to "get the alignment in".

If the AGI is doing consequentialist reasoning while it is still mostly getting gains from gradient descent as opposed to from knowledge collected over an episode, then we have more ability to steer it's trajectory.

[-]Rohin Shah3y*Ω17437

My guess at part of your views:

There's ~one natural structure for capabilities, such that (assuming we don't have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the "sharp left turn").

I basically agree that the plans I consider don't engage much with this sort of scenario. This is mostly because I don't expect this scenario and so I'm trying to solve the alignment problem in the worlds I do expect.

(For the reader: I am not saying "we're screwed if the sharp left turn happens so we should ignore it", I am saying that the sharp left turn is unlikely.)

A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I'd be keen on more work along those lines.

Some commentary on the conversation with me:

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the go

... (read more)

[-]Quadratic Reciprocity3yΩ17423

As someone with limited knowledge of AI or alignment, I found this post accessible. There were times when I thought I knew vaguely what Nate meant but would not be able to explain it so I'm recording my confusions here to come back to when I've read up more. (If anyone wants to answer any of these r/NoStupidQuestions questions, that would be very helpful too).

"Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent". This is something that comes up in response to a few of the plans. Is the idea that during training, for advanced enough AIs capabilities gains come from gradient descent and also through processing input / interacting with the world. Or is the second part only after it has finished training. What does that concretely look like in ML?
Is a lot of the disagreement about these plans just because of others finding the idea of a "sharp left turn" more unlikely than Nate or is there more agreement about that idea but the disagreement is about what proposals might give us a shot at solving it?
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like b

... (read more)

[-]johnswentworth3yΩ122510

What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?

Some key pieces...

Desiderata 1: we need to aim for some kind of interpretability which will carry over across architectural/training paradigm changes, internal ontology shifts at runtime, etc. The tools need to work without needing a lot of new investment everytime there's a big change.

In my own approach, that's what Selection Theorems would give us: theorems which characterize certain interpretable internal structures as instrumentally convergent across a wide range of architecture/internal ontology.

Desiderata 2: we need to be able to robustly tie the internal structures identified to some kind of high-level human-interpretable "things". The "things" could be mathematical, like e.g. we might aim to robustly recognize embedded search processes or embedded world models. Or, the "things" could be real-world things, like e.g. we might aim to robustly recognize embedded representations of natural abstractions from the environment (and the natural abstractions in the environment to which the representations correspond). Ei... (read more)

[-]Steven Byrnes3yΩ8233

For 1—In humans, there’s the distinction between evolution-as-a-learning-algorithm versus within-lifetime learning. There’s some difference of opinion about which of those two slots will be occupied by the PyTorch code comprising our future AGI—the RFLO model says that this code will be doing something analogous to evolution, I say it will be doing something analogous to within-lifetime learning, see my discussion here.

My impression (from their writings) is that Nate & Eliezer are firmly in the former RFLO/evolution camp. If that’s your picture, then within-lifetime learning is a thing that happens inside a learned black box, and thus it’s a big step removed from the gradient descent (imagine: the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds, then the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds…). Then a “sharp left turn” could happen between gradient-descent steps, for example.

In my model, the human-written AGI PyTorch code is instead ana... (read more)

[-]Ramana Kumar3yΩ7173

For 2, I think a lot of it is finding the "sharp left turn" idea unlikely. I think trying to get agreement on that question would be valuable.

For 4, some of the arguments for it in this post (and comments) may help.

For 3, I'd be interested in there being some more investigation into and explanation of what "interpretability" is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.

For example, I'm particularly interested in how "interpretability" is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should "interpretability" do for us in these cases, assuming we only have access to the local system?

4Charlie Steiner3y

I think the upvotes, without answers, means that other people are also interested in hearing Nate's clarifications on these questions, particularly #1. 2 is a mixture of both - examples will hopefully come as people comment their disagreements. Ambitiousness in interpretability can look like greater generalization to never-before-seen architectures, especially automated generalization that doesn't strictly need human intervention. It can also look like robustly being able to use interpretability tools to provide oversight to training, e.g. as "thought assessors." I bet people more focused on interpretability have more ideas.

4Rob Bensinger3y

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

[-]Richard_Ngo3y*Ω17312

Thanks for the post, I agree with a lot of it. A few quick comments on your dialogue with imaginary me/Rohin, which highlight the main points of disagreement:

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

More accurate to say "I don't see why you're so confident". I think I see why you're worried, and I'm worried too for the same reasons. Indeed, I wrote a similar post recently which lists out research directions and reasons why I don't expect them to solve the problem if it turns out to be hard. So in general you should probably put me down as having a reasonable amount of credence (20%?) on your view, but also considering many other possibilities plausible.

Nate: I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail.

The ideas that come out of left field are generally the ones you haven't considered yet, that's what it means for them to come out of left field... (read more)

[-]Stuart_Armstrong3yΩ12307

Hey, thanks for posting this!

And I apologise - I seem to have again failed to communicate what we're doing here :-(

"Get the AI to ask for labels on ambiguous data"

Having the AI ask is a minor aspect of our current methods, that I've repeatedly tried to de-emphasise (though it does turn it to have an unexpected connection with interpretability). What we're trying to do is:

Get the AI to generate candidate extrapolations of its reward data, that include human-survivable candidates.
Select among these candidates to get a human-survivable ultimate reward functions.

Possible selection processes include being conservative (see here for how that might work: https://www.lesswrong.com/posts/PADPJ3xac5ogjEGwA/defeating-goodhart-and-the-closest-unblocked-strategy ), asking humans and then extrapolating the process of what human-answering should idealise to (some initial thoughts on this here: https://www.lesswrong.com/posts/BeeirdrMXCPYZwgfj/the-blue-minimising-robot-and-model-splintering), removing some of the candidates on syntactic ground (e.g. wireheading, which I've written quite a bit on how it might be syntactically defined). There are some other approaches we've been considering... (read more)

7rgorman3y

Thanks for writing this, Stuart. (For context, the email quote from me used in the dialogue above was written in a different context)

[-]habryka2yΩ12261Review for 2022 Review

I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of.

One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not what you might think when you observe the enormous amount of talent, funding and prestige flooding into the space, and the relatively constant refrain of "now that we have cutting edge systems to play around with we are making progress at an unprecedented rate".

It is quite plausible to me that technical AI Alignment research is not a particularly valuable thing to be doing right now. I don't think I have seen much progress, and the dynamics of the field seem to be enshrining an expert class that seems almost ontologically committed to believing that the things they are working on must be good and tractable, because their sala... (read more)

[-]Lukas Finnveden3y*Ω14265

As the main author of the "Alignment"-appendix of the truthful AI paper, it seems worth clarifying: I totally don't think that "train your AI to be truthful" in itself is a plan for how to tackle any central alignment problems. Quoting from the alignment appendix:

While we’ve argued that scaleable truthfulness would constitute significant progress on alignment (and might provide a solution outright), we don’t mean to suggest that truthfulness will sidestep all difficulties that have been identified by alignment researchers. On the contrary, we expect work on scaleable truthfulness to encounter many of those same difficulties, and to benefit from many of the same solutions.

In other words: I don't think we had a novel proposal for how to make truthful AI systems, which tackled the hard bits of alignment. I just meant to say that the hard bits of making truthful A(G)I are similar to the hard bits of making aligned A(G)I.

At least from my own perspective, the truthful AI paper was partly about AI truthfulness maybe being a neat thing to aim for governance-wise (quite apart from the alignment problem), and partly about the idea that research on AI truthfulness could be helpful for alignme... (read more)

[-]Charlie Steiner3yΩ92514

Partisans of the other "hard problem" are also quick to tell people that the things they call research are not in fact targeting the problem at all. (I wonder if it's something about the name...)

Much like the other hard problem, it's easy to get wrapped up in a particular picture of what properties a solution "must" have, and construct boundaries between your hard problem and all those other non-hard problems.

Turning the universe to diamond is a great example. It's totally reasonable that it could be strictly easier to build an AI to turn the world into diamond than it is to build an AI that is superhuman at doing good things, so that anyone claiming to have ideas about the latter should have even better ideas about the former. But that could also not be the case - the most likely way I see this happening is if if solving the hard left turn problem has details that depend on how you want to load the values, and so genuinely hard-problem-addressing work on value learning could nonetheless not be useful for specifying simple goals. (It may only help you get the diamond-universe AI "the hard way" - by doing the entire value leaning process except with a different target!)

[-]Zack_M_Davis2y2418Review for 2022 Review

I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post.

That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily need to be able to pass an author's Ideological Turing Test is because, as a critic, I can at least be scrupulous in my reasoning about the actual text that the author actually published, even if the stereotype of the author I have in my head is faulty. If I can't produce the quotes to show that I'm not just arguing against a stereotype in my head, then it's not clear why the audience should care.

3Raemon2y

This seems right to me for posts replying to individual authors/topics (and I think this criticism may apply to some other more targeted Nate posts in that vein) But I think for giving his takes on a large breadth of people, the cost of making sure each section is well vetted increases the cost by a really prohibitive amount, and I think it's probably better to do it the way Nate did here (clearly establishing the epistemic status of the post, and letting people in the comments argue if he got something wrong). Also, curious if you think there's a particular instance where someone(s) felt misrepresented here? (I just tried doing a skim of the comments, there were a lot of them and the first ones I saw seemed more like arguing with the substance of the disagreement rather than his characterization being wrong. I gave up kinda quickly, but for now, did you recall him getting something wrong here, or just thinking on general principle that one should't err in this direction?)

[-]Mark Xu3yΩ6215

Flagging that I don't think your description of what ELK is trying to do is that accurate, e.g. we explicitly don't think that you can rely on using ELK to ask your AI if it's being deceptive, because it might just not know. In general, we're currently quite comfortable with not understanding a lot of what our AI is "thinking", as long as we can get answers to a particular set of "narrow" questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think that ELK isn't intended to address the problem you refer to as a "sharp-left turn" as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the "inner alignment" camp. More specifically, but still at a high-level of gloss, the way I currently see things is:

If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly

... (read more)

9paulfchristiano3y

I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from "questions humans can answer" to "questions humans can't answer." My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures. (However it's frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)

[-]romeostevensit3yΩ11164

Many proposals seem doomed to me because they involve one or multiple steps where they assume a representation, then try to point to robust relations in the representation and hope they'll hold in the territory. This wouldn't be so bad on its own but when pointed to it seems like handwaving happens rather than something more like conceptual engineering. I am relatively more hopeful about John's approach as being one that doesn't fail to halt and catch fire at these underspecified steps in other plans. In other areas like math and physics we try to get the representation to fall out of the model by sufficiently constraining the model. I would prefer to try to pin down a doomed model than stay in hand wave land because at least in the process of pinning down the doomed model you might get reusable pieces for an eventual non doomed model. Was happy about eg quantilizers for basically the same reason.

[-]Raemon3yΩ9160

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)
It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

Is there a writeup of where you expect this to fail? I recall this MIRI newsletter but I think it also just asserted it was hard/impossible.

Is the difficulty just in "it's gonna hijack it's own reward function?" or is there more to it than that?

8Thomas Larsen3y

There is also the ontology identification problem. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world. I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, "diamond" will probably still be a pretty natural concept, just like how "apple" didn't stop being a useful concept after shifting from newtonian mechanics to QM. Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn't seem helpful for alignment.

8TurnTrout3y

(Unsure whether to mark "agree" for the first two paragraphs, or "disagree" for the last line. Leaving this comment instead.)

4Signer3y

Marked as “disagree” conditional on you marking “agree”, so you can mark "agree" to accurately express degree of controversy.

5TurnTrout3y

OK, I marked "agree."

5Ben Pace3y

Hm? It's as Nate says in the quote. It's the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of "train on examples of diamonds" will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you. Also "expect this to fail" already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?

[-]Quintin Pope3yΩ18339

I don't think that "evolution -> human values" is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn't directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human's learning process + reward circuitry configuration + the human's environment, you screen off the influence of evolution on that human's goals. So, there are really two areas from which you can draw evidence about inner (mis)alignment:

"evolution's inclusive genetic fitness criteria -> a human's learned values" (as mediated by evolution's influence over the human's learning process + reward circuitry)
"a particular human's learning process + reward circuitry + "training" environment -> the human's learned values"

The relationship we want to make inferences about is:

"a particular AI's learning process + reward function + training environment -> the AI's learned values"

I think that "AI learning -> AI values" is much more similar to "human learning -> human values" than it is to "evolution -> human values". I grant that you ca... (read more)

2Signer3y

But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn't change? And we would need to find these somehow for AI and that process looks much more like evolution. And prediction of (non-instrumental) inner values is not robust across different reward functions - dogs only work because we already implemented environment-invariant compassion in reward circuitry.

[-]Quintin Pope3y104

But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn't change?

Congenitally blind people end up with human values, despite that:

They’re missing entire chunks of vision-related hard coded rewards.
The entire visual cortex has been repurposed for other goals.
Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.

Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.

And we would need to find these somehow for AI and that process looks much more like evolution.

We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.

And prediction of (non-instrumental) inner values is not robust across different reward functions - dogs only work because we already implemented environment-invariant compassion in reward circuitry.

I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with d... (read more)

1mesaoptimizer3y

The most important claim in your comment is that "human learning → human values" is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the "evolution -> human values" perspective. Here's why I disagree: Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this. A human's environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment's objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment. An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators. An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn't true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he'll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in. One may claim that the previous example somehow doesn't count because since one's sexual orientation is biologically determined (and I'm assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn't weaken this argument: "human learning -> human values" shows a huge amount of evidence of inner misalignment being ubiquitous. I worry you are being insufficiently pessimistic.

3Logan Riggs3y

There may not be substantial disagreements here. Do you agree with: "a particular human's learning process + reward circuitry + "training" environment -> the human's learned values" is more informative about inner-misalignment than the usual "evolution -> human values" (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences) I don't know what you mean by "inner misalignment is easier"? Could you elaborate? I don't think you mean "inner misalignment is more likely to happen" because you then go on to explain inner-misalignment & give an example and say "I worry you are being insufficiently pessimistic." One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See: This matches my intuitions.

1mesaoptimizer3y

What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn't make sense to me, particularly since I believe that most people live in environments that is very much" in distribution", and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter. My bad; I've updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution's failure at inner alignment is the most significant and informative evidence that inner alignment is hard. I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too -- inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don't scale with an increase in capabilities? Perhaps we are defining inner values differently here. By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people's inner values shift? Not exactly; it seems to me that we were mistaken about people's inner values instead.

3Logan Riggs3y

I think you're ignoring the [now bolded part] in "a particular human’s learning process + reward circuitry + "training" environment" and just focusing in the environment. Humans very often don't optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn't press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you'd say the wirehead thing is an extreme case OOD? I agree, but I'm bolding "most people" because you're claiming there exist some people that would retain that value if scaled up(?) I think replace "dog-lover" w/ "family-lover" and there's even more people. But I don't think this is a disagreement between us? Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there's the disconnect (usually misalignment is thought of as bad, and I'm not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the "a particular human’s learning process + reward circuitry + "training" environment" part, and less on the evolution part. If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.

1mesaoptimizer3y

Yes, thank you: I didn't notice that you were making that assumption. This conversation makes a lot more sense to me now. This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I've updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I'm still confused about huge parts of it, but we can discuss it more elsewhere.

2Quintin Pope3y

That's not a claim I made in my comment. It's technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I'll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives. I operationalize this as: "After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they'd acquired from their learning environment which are misaligned to reward maximization in the new environment". Please let me know if you disagree with my operationalization. For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human's preexisting values (e.g., Logan's example of killing everyone else), most humans won't push the button. The actual claim I made in the comment you're replying to is that there's a predictable relationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I'd say we'd be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.

[-]Zack_M_Davis3y2615

This isn't addressing straw-Ngo/Shah's objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don't care about fitness itself. The generalization from evolution's "loss function" alone, to modern human behavior, is terrible and looks like all kinds of white noise.

But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is ... actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it's far from white noise. AI designers aren't in the position of "evolution" "trying" to build fitness-maximizers, because they also get to choose the training data or "EEA"—and in that context, the analogy to evolution makes it look like some degree of "correct" goal generalization outside of the training environment is a thing?

Obviously, the conclusion here is not, "And therefore everything will be fine and we have nothin... (read more)

8Quintin Pope3y

I strongly agree. I think there are vastly better sources of evidence on how inner goals relate to outer selection criteria than "evolution -> human values", and that most of those better sources of evidence paint a much more optimistic picture of how things are likely to go with AI. I think there are lots of reasons to be skeptical of using "evolution -> human values" as an informative reference class for AIs, some of which I've described in my reply to Ben's comment.

4Raemon3y

I don't think the usual arguments apply as obviously here. "Maximal Diamond" is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it's a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I've seen more detailed arguments for. I'm partly confused about the phrasing "we have no idea how to do this." (which is stronger than "we don't currently have a plan for how to do this.") But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn't work, let me think through my own proposal of how I'd go about solving the problem, and see if I can think of obvious holes. Problems currently known to me: 1. Reward hijacking 2. Point 19 in List of Lethalities ("there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment"). 3. Ontological updating (i.e. what exactly is a diamond?) 4. New to me from this post: the most important capabilities advances may come from an inner process that isn't actually coupled to the reinforcement learning system. (I didn't really get this until reading this post and haven't finished thinking through the concept) Main ingredients I'm imagining: (disclaimer: I'm a layman making a lot of informed guesses, wouldn't be surprised it First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind's General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you'd need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I'm not sure it even requires new advances. (May

2Ulisse Mini3y

I think even without point #4 you don't necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you're bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you're out of training to reward-hack). Also, every time you say "train it by X so it learns Y" you're assuming alignment (e.g. "digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion") IMO shard theory provides a great frame to think about this in, it's a must-read for improving alignment intuitions.

[-]Thomas Kwa3y*150

What form of plan would address the hard bits of the alignment challenge?

[-]Nathan Helm-Burger3y151

I'm a fairly new alignment researcher, although I've been following LessWrong and MIRI's writing, etc for many years.

My approach so far is trying to understand and use interpretability tools (e.g. learning from Olah's circuits team & Buck at Redwood Research, and others who aren't themselves trying to tackle the 'hard problem' but are coming up with useful stuff like ROME https://rome.baulab.info/ ) to find and control natural abstractions (ala John Wentworth) in a variety of model architectures (starting with transformers).

An important additional aspect is that I think it is plausible to edit a variety of architectures in some generalizable ways to partially separate them into 'modules', based around natural dividing points (from natural abstractions). Something like a hybrid between a singular black box and a mixture of experts. I'd expect this refactoring to result in some 'alignment tax' on capabilities, but not as extreme as going to full silo-ed narrow experts.

One of the goals along the way should be to try to fundamentally understand general intelligence, to be able to better detect and control it.

My main self-criticism: too slow. I think my approach has a pr... (read more)

[-]Adam Zerner3y145

But if I was saying that about a hundred pretty-uncorrelated agendas being pursued by two hundred people, I'd start to think that maybe the odds are in our favor.

Wait a minute − excuse my naïveté, but that doesn't seem that hard!

I assume it is though and thus ask: why is it? Is it that hard to come up with such agendas? What if we had $100B to pay people and/or set bounties with?

[-]Lucius Bushnaq3y*133

Epistemic status: Story. I am just assuming that my current guesses for the answers to outstanding research questions are true here. Which I don't think they are. They're not entangled enough with actual data yet for that to be the case. This is just trying to motivate why I think those are the right kinds of things to research.

Figure out how to measure and plot information flows in ML systems. Develop an understanding of abstractions, natural or otherwise, and how they are embedded in ML systems as information processing modules.

Use these tools to find out how ML systems embed things like subagents, world models, and goals, how they interlink, and how they form during training. I’m still talking about systems like current reinforcement learners/transformer models or things not far removed from them here.

With some better idea of what “goals” in ML systems even look like, formalise these concepts, and find selection theorems that tell you, rigorously, which goals a given loss function used by the outer optimiser will select for. I suspect that in dumb systems, this is (or could be made) pretty predictable and robust to medium sized changes in the loss function, architecture, o... (read more)

[-]evhub3yΩ8102

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

Maybe this and this will help.

[-]Ruby3yΩ484

Curated. I could imagine a world where different people pursue different agendas in a “live and let live” way, with no one waiting to be too critical of anyone else. I think that’s a world where many people could waste a lot of time with nothing prompting them to reconsider. I think posts like this one give us a chance to avoid scenarios like that. And posts like this can spur discussion of the higher-level approaches/intuitions that spawn more object-level research agenda. The top comments here by Paul Christianno, John Wentworth, and others are a great i... (read more)

[-]Davidmanheim2yΩ373

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

4Noosphere892y

I'd probably have made way bigger updates than that, but why should we update towards Critich's strategy working, exactly.

2the gears to ascension2y

I've missed one or more facts that link those threads, can you tell me which part of critch's approach stands out to you in this context and what made it do so? I agree that his is some of my favorite work, but it's not yet obvious to me that it actually checks all the boxes, in particular whether humans deploying ai will have values that directly contract usage of his insights. it also still isn't clear to me whether anything in his work helps with is/ought.

2Davidmanheim2y

I was referring to his promotion of political approaches, which is what this post discussed, and which Eliezer has recently said is the best hope for avoiding doom, even if he's still very pessimistic about it. His alignment work is a different question, and I don't feel particularly qualified to weigh in on it.

[-]Steven Byrnes3yΩ660

If it helps, I have a discussion of Concept Extrapolation in the context of aligning a real-deal agent-y AGI in §14.4 here.

So far I can’t quite get the whole story to hang together, as you’ll see from that link. But I definitely see it as a “shot on goal”. (Well, at least, I think the broader project / framework is a “shot on goal”. I don’t find the image classification project to be directly addressing any of my most burning questions.)

[-]Beth Barnes3yΩ250

Thanks for the post! One narrow point:
You seem to lean at least a bit on the example of 'much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner'. It seems to me that
a. You don't need to go to humans before you get significant accumulation of important cultural knowledge outside genes (e.g. my understanding is that unaccultured chimps die in the wild)
b. the genetic bottleneck is a somewhat weird and contingent feature of animal evolution, and I don't think the... (read more)

3gwern3y

(Is that just because they get attacked and killed by other chimp groups?)

2Beth Barnes3y

My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

[-][anonymous]3y51

I was a bit confused about this quote, so I tried to expand on the ideas a bit. I'm posting it here in case anyone benefits from is or disagrees.

To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.

I guess saying is saying that an AI will devel... (read more)

[-]Jan_Kulveit3y*Ω140

<sociology of AI safety rant>

So, if an Everett-branches traveller told me "well, you know, MIRI folks had the best intentions, but in your branch, made the field pay attention to unproductive directions, and this made your civilization more confused and alignment harder" and I had to guess "how?", one of the top choices would be ongoing strawmanning and misrepresentation of Eric Drexler's ideas.

</rant>

To me, CAIS thinking seems quite different from the description in the op.

Some statements, without much justifications/proofs

- Modularity is a pr... (read more)

[-]Lauro Langosco3yΩ440

What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

4Rob Bensinger3y

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

4aog3y

I'm having trouble understanding the argument for why a "sharp left turn" would be likely. Here's my understanding of the candidate reasons, I'd appreciate any missing considerations: * Inner Misalignment * AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don't realize the inner goal is misaligned until deployment because the system realizes it's in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning. * Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we've built AGI? We already have examples of inner misalignment, and I'd expect we can work on solving inner misalignment in current systems. * Wireheading / Power Seeking * AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies. * Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case? * Capabilities Discontinuities * The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that "capabilities fall into an attractor and alig

2Ben Amitay2y

Learning without Gradient Descent - Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc. It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.

1Lauro Langosco3y

Thanks!

[-]interstice3y40

By "superbabies", do you mean genetically engineering high intelligence?

7Rob Bensinger3y

Yep!

4Rubi J. Hudson3y

Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.

4Jeffrey Ladish3y

I'm a little surprised that I don't see more discussion of ways that higher bandwidth brain-computer interfaces might help, e.g. neurolink or equivalent. Like it sounds difficult but do people feel really confident it won't work? Seems like if it could work it might be achievable on much faster timescales than superbabies.

[-]TekhneMakre3y20

Second, it doesn't alleviate enough pressure; the bureaucrats can't tell real solutions from bad ones; the cost to build an unaligned AGI drops each year; etc., etc.

And making AI researchers answerable to some centralized entity puts a big target on that entity as a thing for sociopaths (broadly construed) to corrupt / capture, and the more that AI researchers are living under the thumb of a corrupt / captured entity, the more they're misaligned with their own values and less likely to be sane.

[-]Logan Zoellner3y2-3

Like, I see this plan as basically saying "yep, that hard problem is in fact too hard, let's try to dodge it, by having humans + narrow AI services perform the pivotal act". Setting aside how I don't particularly expect this to work, we can at least hopefully agree that it's attempting to route around the problems that seem to me to be central, rather than attempting to solve them.

I think you're being overly critical of this approach. We can build pretty useful AI without getting anywhere near your "sharp left turn". For example, the "Strawberry prob... (read more)

7jessicata3y

To expand on strawberries vs diamonds: It seems to me that the strawberry problem is likely easier than the "turn the universe into diamond" problem. Immediate reasons: * the strawberry problem is bounded in space and time * strawberry materials can be conveniently placed close to the strawberry factory * turning the universe into diamond requires nanobots to burrow through a variety of materials * turning the universe into diamond requires overcoming all territorial adversaries trying to protect themselves from nanobots * turning the universe into diamond requires not sabotaging the nanobots' energy and other resources in the process, whereas the strawberry factory can be separated from the strawberries * turning the universe into diamond is more likely to run into arcane physics (places where our current physics theories are wrong or incomplete, e.g. black holes) In more detail, here's how a strawberry nanofactory might work: * a human thinks about how to design nanotech, what open problems there are, what modular components to factor the problem into * an AI system solves some of these problems, designing components that pass a wide variety of test cases; some test cases are in physical simulation and some are real-world small cases (e.g. scanning a small cluster of cells). There might also be some mathematical proofs that the components satisfy certain properties under certain assumptions about physics. * one of these components is for creating the initial nanobots from cells. Nanotech engineers can think about what sub-problems there are (e.g. protein folding) and have AI systems help solve these problems. * one of these components is for scanning a strawberry. The nanobots should burrow into the strawberry bit by bit, take sensory readings sent to a computer. * one of these components is for inferring the strawberry structure from readings. This can be approximate Bayesian inference (like a diffusion model in voxel space), given that there are e

2Rob Bensinger3y

? Where does he seem positive about ELK?

[-]tailcalled3y21

On reflection, I can see how what you talk about is a hard part. I haven't focused much on it, because I've seen it as a capabilities question (much of it seems to boil down to an outer optimizer not being capable enough to achieve inner alignment, which depends on the capabilities of the outer optimizer), but on reflection there may be quite worthwhile for alignment researchers to spend time on this.

However, I don't think it's the (only) hard part. Even if we can turn the world into diamondoid or create two identical strawberries, there's still the equall... (read more)

[-]MSRayne3y2-1

This doesn't directly fight the hard problem, but it could make it easier: what do you think of putting massive effort into developing BCIs, and then using them to link the brains of people working on the problem together so that they could share understanding or even merge as a single superintelligence able to work on the problem more effectively? Obviously there's a tremendous amount of unknowns about how that would work, but I think it's plausible, and although Neuralink is taking its time, Openwater's planned wearable BCI could get there faster if they... (read more)

5Quintin Pope3y

I don’t think this works. Deep cognition seems like it’s strongly limited by the transfer rate of the interconnect between the cognitive elements involved, and current BCI is very far from approaching the information transfer rate within the brain.

1MSRayne3y

Yeah, that's the main concern. I don't know enough about this to know how plausible it is, but it feels like something worth looking into anyway.

0[comment deleted]3y

[-]tristanhaze3y10

Very interesting. I'm stuck on the argument about truthfulness being hard because the concept of truth is somehow fraught or too complicated. I'm envisaging an objection based on the T-schema ('<p> is true iff p').

Nate writes:

Now, in real life, building a truthful AGI is much harder than building a diamond optimizer, because 'truth' is a concept that's much more fraught than 'diamond'. (To see this, observe that the definition of "truth" routes through tricky concepts like "ways the AI communicated with the operators" and "the mental state of the ope... (read more)

[-]Kabir Kumar2y-20Review for 2022 Review

Extremely important

[+][comment deleted]3yΩ220

^{^}

I ran a few of the dialogs past the relevant people, but that has empirically dragged out the amount of time it takes this post to publish, and I have a handful of other posts to publish afterwards, so I neglected to get feedback from most of the people mentioned. Sorry.

^{^}

Much of Vanessa, Scott, etc.'s work does look to me like it is grappling with confusions related to the problem of aiming minds in theory, and if their research succeeds according to their own lights then I would expect to have a better understanding of how to aim minds in general, even ones that had undergone some sort of "sharp left turn".

Which is not to say that I’m optimistic about whether any of these plans will succeed by their own lights. Regardless, they get points for taking a swing, and the thing I’m mostly advocating for is that more people take swings at this problem at all, not that we filter strongly on my optimism about specific angles of attack.

I tried to solve the problem myself for a few years, and failed. Turns out I wasn't all that good at it.

Maybe I'll be able to do better next time, and I poke at it every so often. (Even though in my mainline prediction, we won’t have the time to complete the sort of research paths that I can see and that I think have any chance of working.)

MIRI funds or offers-to-fund most every researcher who I see as having this "their work would help with the generalization problem if they succeeded" property and as doing novel, nontrivial work, so it's no coincidence that I feel more positive about Vanessa, etc.'s work. But I'd like to see far more attempts to solve this problem than the field is currently marshaling.

^{^}

Again, to be clear, it's nice to have some people trying to route around the hard problems wholesale. But I don't count such attempts as attacks on the problem itself. (I'm also not optimistic about any attempts I have yet seen to dodge the problem, but that's a digression from today's topic.)

^{^}

I couldn't understand Stuart's views from what he's written publicly, so I ran this section by Stuart and Rebecca, who requested that I use actual quotes instead of my attempted paraphrasings. If I'd had more time, I'd like to have run all the dialogs by the researchers I mentioned in this post, and iterated until I could pass everyone's ideological Turing Test, as opposed to the current awkward set-up where the people that I thought I understood didn't get as much chance for feedback. But the time delay from editing this one section is evidence that this wouldn't be worth the time burnt. Instead, I hope the comments can correct any mischaracterizations on my part.

^{^}

Note also that while having the AI ask for clarification in the face of ambiguity is nice and helpful, it is of course far from autonomous-AGI-grade.

^{^}

I specifically see:

~3 MIRI-supported research approaches that are trying to attack a chunk of the hard problem (with a caveat that I think the relevant chunks are too small and progress is too slow for this to increase humanity's odds of success by much).
~1 other research approach that could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed (albeit no one is currently spending much time on this research approach): Natural Abstractions. Maybe 2, if you count sufficiently ambitious interpretability work.
~2 research approaches that mostly don't help address the core difficulty (unless perhaps more ambitious versions of those proposals are developed, and the ambitious versions wildly succeed), but might provide small safety boosts on the mainline if other research addresses the core difficulty: Concept Extrapolation, and current interpretability work (with a caveat that sufficiently ambitious interpretability work would seem more promising to me than this).
9+ approaches that appear to me to be either assuming away what look to me like the key problems, or hoping that we can do other things that allow us to avoid facing the problem: Truthful AI, ELK, AI Services, Evan's approach, the Richard/Rohin meta-approach, Vivek's approach, Critch's approach, superbabies, and the "maybe there is a pretty wide attractor basin around my own values" idea.

^{^}

I rate "interpretability succeeds so wildly that we can understand and aim one of the first AGIs" as probably a bit more plausible than "natural abstractions are so natural that, by understanding them, we can practically find concepts-worth-optimizing-for in an AGI". Both seem very unlikely to me, though they meet my bar for “deserving of a serious effort by humanity” in case they work out.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

316

On how various plans miss the hard bits of the alignment challenge

316

Ω 98

316

Ω 98

Reactions to specific plans

Owen Cotton-Barratt & Truthful AI

Ryan Greenblatt & Eliciting Latent Knowledge

Eric Drexler & AI Services

Evan Hubinger, in a recent personal conversation

A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s

Another recent proposal

Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person

John Wentworth & Natural Abstractions

Neel Nanda & Theories of Impact for Interpretability

Stuart Armstrong & Concept Extrapolation

Andrew Critch & political solutions

What about superbabies?

What about other MIRI people?

High-level view