However, I will be unable to make confident statements about how it'd perform on specific pivotal tasks until I complete further analysis.
Yeah this part seems like a big potential downside, in combination with the "Good plans may also require precision." problem.
Take for example the task of designing a successor AGI that is aligned; it seems like the code for this could end up being pretty intricate and complex, leading to the following failures: adding noise makes it very hard to reconstruct functional code (and slight mistakes could be catastrophic...
Strong upvote because I want to signal boost this paper, though I think "It provides some evidence against the idea that "understanding is discontinuous"" is too strong and this is actually very weak evidence.
Main ideas:
Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable.
Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Di...
I agree that Lex's audience is not representative. I also think this is the biggest sample size poll on the topic that I've seen by at least 1 OOM, which counts for a fair amount. Perhaps my wording was wrong.
I think what is implied by the first half of the Anthropic quote is much more than 10% on AGI in the next decade. I included the second part to avoid strongly-selective quoting. It seems to me that saying >10% is mainly a PR-style thing to do to avoid seeming too weird or confident, after all it is compatible with 15% or 90%. When I read the ...
My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We'll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear.
My comment is blunt, apologies.
I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less st...
In this interview from July 1st 2022, Demis says the following (context is about AI consciousness and whether we should always treat AIs as tools, but it might shed some light on deployment decisions for LLMs; emphasis mine):
...we've always had sort of these ethical considerations as fundamental at deepmind um and my current thinking on the language models is and and large models is they're not ready; we don't understand them well enough yet — um and you know in terms of analysis tools and and guard rails what they can and can't do and so on — to deploy them
It looks like you haven't yet replied to the comments on your post. The thing you are proposing is not obviously good, and in fact might be quite bad. I think you probably should not be doing this outreach just yet, with your current plan and current level of understanding. I dislike telling people what to do, but I don't want you to make things worse. Maybe start by engaging with the comments on your post.
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as "don't be deceptive" is analogous to "be neutral about humans pressing stop button."
Another attempted answer:
By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don't want it to think the true fact that deceiving humans is often useful.
Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.
Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pres...
I like this comment! I'm sorta treating it like a game-tree exercise, hope that's okay.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.
I don't think I agree. I think that your system is very likely going to be applying some form of "rigorously search the solution space for things that wo...
The following is not a very productive comment, but...
Yudkowsky tries to predict the inner goals of a GPT-like model.
I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a "very primitive, very basic, very unreliable wild guess" and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging.
Insofar as we are going to make any guesses about what goals our models have, "predict humans really well" or "predict n...
My summary:
Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier. ...
There is a Policy team listed here. So it presumably exists. I don't think omitting its work from the post has to be for good reasons, it could just be because the post is already quite long. An example of something Anthropic could say which would give me useful information on the policy front; I am making this up, but seems good if true:
In pessimistic and intermediate difficulty scenarios, it may be quite important for AI developers to avoid racing. In addition to avoiding contributing to such racing dynamics ourselves, we are also working to build safety...
Good post!
His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.
I think there's a pessimistic r...
My summary to augment the main one:
Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for unde...
how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?
I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.
(Obviously, within the company, there's a wide range of views. Some pe...
I generally find this compelling, but I wonder if it proves too much about current philosophy of science and meta-science work. If people in those fields have created useful insight without themselves getting dirty with the object work of other scientific fields, then the argument proves too much. I suspect there is some such work. Additionally:
...I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magn
RLHF has trained certain circuits into the NN
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from "The LLM is doing a thing" to "The LLM has a circuit which does the thing" doesn't feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: ("A subgraph of a neural network...
...Evidence from Microsoft Sydney
Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.
This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simula
I am confused by the examples you use for sources of the theory-practice gap. Problems with the structure of the recursion and NP-hard problems seem much more like the first gap.
I understand the two gaps the way Rohin described them. The two problems listed above don’t seem to be implementation challenges, they seem like ways in which our theoretic-best-case alignment strategies can’t keep up. If the capabilities-optimal ML paradigm is one not amenable to safety, that’s a problem which primarily restricts the upper bound on our alignment proposals (they mu...
I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors' response to various questions people have.
The main idea:
I don't have a very insightful comment, but I strongly downvoted this post and I kinda feel the need to justify myself when I do that.
Summary of post: John Wentworth argues that AI Safety plans which involve using powerful AIs to oversee other powerful AIs is brittle by default. In order to get such situations to work, we need to have already solved the hard parts of alignment, including having a really good understanding of our systems. Some people respond to these situations by thinking of specific failure modes we must avoid, but that approach of,...
I think you are probably right about the arguments favoring “automating alignment is harder than automating capabilities.” Do you have any particular reasons to think,
AI assistants might be uniquely good at discovering new paradigms (as opposed to doing empirical work).
What comes to mind for me is Janus's account of using LLMs to explore many more creative directions than previously, but this doesn't feel like strong evidence to me. Reasons this doesn't feel like strong evidence: seems hard to scale and it sure seems the OpenAI plan relies on scalability; ...
Thanks! I really liked your post about defending the world against out-of-control AGIs when I read it a few weeks ago.
I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.
...Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment question
For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Can you provide a citation? I don't think this is true. My reading of this is that (if you're training a dog) you can start with an unconditioned stimulus (sight of food) which causes salivating, and then you can add i...
...7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals.
As a consequence, bad actors might have an easier time using powerfull controllable AI to achieve their goals. (From 4 and 6)
8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (F
Here's an idea that is in it's infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:
Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)
Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we're moving the goalpost). This seems to be an assumption in OpenAI's alignment plan, though I th...
I first want to signal-boost Mauricio’s comment.
My experience reading the post was that I kinda nodded along without explicitly identifying and interrogating cruxes. I’m glad that Mauricio has pointed out the crux of “how likely is human civilization to value suffering/torture”. Another crux is “assuming some expectation about how much humans value suffering, how likely are we to get a world with lots of suffering, assuming aligned ai”, another is “who is in control if we get aligned AI”, another is “how good is the good that could come from aligned ai and...
My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model's capabilities, which makes it safer because the gap between "capability you can elicit" and "underlying capability capacity" is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF b...
Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.
I don't know how to link t...
(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest ...
We can also ask about the prior probability . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than
I think this might be too low given a more realistic training process. Specifically, this is one way the future might go: We train models with gradient descent. Said models develop proxy objectives whic...
My (very amateur and probably very dumb) response to this challenge:
tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid deception (typo)detection in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misali...
AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely.
Seems worth linking to this post
for discussion of ways to limit collusion. I would also point to this relevant comment thread
. It seems to me that orthogonal goals is not what we want, as agents with orthogonal goals can cooperate pretty easily to do actions that are a combination of favorable and neutral according to both of them. Instead, we would want agents with exact opposite goals, if such a thing is possi...
Recently, AGISF has revised its syllabus and moved Risks form Learned Optimization to a recommended reading, replacing it with Goal Misgeneralization. I think this change is wrong, but I don't know why they did it and Chesteron's Fence.
Does anybody have any ideas for why they did this?
Are Goal Misgeneralization and Inner-Misalignment describing the same phenomenon?
What's the best existing critique of Risks from Learned Optimization? (besides it being probably worse than Rob Miles pedagogically, IMO)
Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post
, the author clarifies that this could be though...
I’m excited to see the next post in this sequence. I think the main counterargument, as has been pointed out by Habryka, Kulveit, and others, is that the graph at the beginning is not at all representative of the downsides from poorly executed buying-time interventions. TAO note the potential for downsides, but it’s not clear how they think about the EV of buying-time interventions, given these downsides.
This post has the problem of trying to give advice to a broad audience; among readers of this post, some should be doing buying-time work, some should do ...
At a high level, I might summarize they key claims of this post as “It seems like the world today is quite far from being secure against a misaligned AGI. Even if we had a good AGI helping, the steps that would need to be taken to get to a secure state are very unlikely to happen for a variety of reasons: we don’t totally trust the good AGI so we won’t give it tons of free reign (and it would likely need free reign in order to harden every major military / cloud company / etc.), the good AGI is limited because it is being good and thus not doing bold somet...
Another miracle type thing:
Evolution isn't an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
My understanding of deep learning is that training is also roughly the repeated application of a filter. The filter is some loss function (or, potentially the LLM evaluators like you suggest) which repeatedly selects for a set of model weights that perform well according to that function, similar to how natural selection selects for individuals who are relatively fit. Humans designing ML system...
This is awesome! I feel weird asking you to plug prompts into the machine. I wonder how it does with logo design, something like “the logo for a new longtermist startup”? Not using for commercial purposes; just curious.
Also curious about some particular word play ala Marry Poppins: “a cat drawing the curtains”
There's another downside which is related to the Manipulation problem but I think is much simpler:
An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it's just far easier to fail catastrophically so the humans shut you down.
T... (read more)