Sorted by New

Wiki Contributions


Thanks for clarifying! 

I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as "don't be deceptive" is analogous to "be neutral about humans pressing stop button."

Another attempted answer: 

By virtue of being generally intelligent, our AI is aiming to understand the world very well. There are certain part of the world that we do not want our AI to be modeling, specifically we don't want it to think the true fact that deceiving humans is often useful. 

Plan 1: Have a detector for when the AI thinks deceptive thoughts, and shut down those thought.

Fails because your AI will end up learning the structure of the deceptive thoughts without actually thinking them because there is a large amount of optimization pressure being applied to solving the problem, and this is the best way to solve the problem. 

Plan 2 (your comment): Have a meta-level detector that is constantly asking if a given cognitive process is deceptive or likely to lead to deceptive thoughts; and this detector tries really hard to answer right. 

Fails because you don't have a nice meta-level where you can apply this detector. The same cognitive-level that is responsible for finding deceptive strategies also is a core part of being generally intelligent; the deception fact is a fact about the world — the same world which your AI is aiming to understand well. The search process that finds deceptive strategies is the same search process which learns biology. So at this point, to the extent that you just want to block the deceptive strategy thoughts, you're back in the place you started, where now your model is doing a whole bunch of learning about the world but there's a huge hole in it's understanding (and permitted search) around "what's going on in human minds" because you blocked off this search space. 

Maybe this kinda works or something, but I doubt it. We've pretty much bumped into STEM AI

Unfortunately, I don't think I engaged with the best version of the thing you're suggesting in your comment. 

I like this comment! I'm sorta treating it like a game-tree exercise, hope that's okay. 

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I don't think I agree. I think that your system is very likely going to be applying some form of "rigorously search the solution space for things that work to solve this problem" for all the problems it encounters. If it's not doing that, then I'm not sure it counts as a powerful intelligence. If it is doing that, it is going to frequently bump into [what I'm understanding to be a key problem discussed in this post] the fact about the world that the best way to solve many of the problems is to deceive the human operator. 

Said another way, doing the generally intelligent thing of searching solution space for ways to solve problems is going to bump into deceptive thoughts. The meta-level classifier is trying to prevent cognitive plans that lead to deceptive thoughts, but the cognitive plans that work in the real world (e.g., that allow you to create new scientific fields) are plans which wouldn't pass this check. 

If your meta-level classifier does not allow such cognition, then your system will be quite incapable. The obvious next solution is to set your classification bar for "cognitive plan likely to lead to deceptive thoughts" a bit lower, zooming in to maybe "rigorously search the solution space for things that work to solve this problem; and involve reasoning about humans". This has performance problems, of course, preventing you from e.g., using your AI for most alignment research. This is also just the same problem we were dealing with originally, where now we're worried about our AI engaging in search-over-plan strategies which aren't directly about human minds but somehow still model that part of reality in an effective (i.e., deceptive) way. 

The following is not a very productive comment, but...

Yudkowsky tries to predict the inner goals of a GPT-like model.

I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a "very primitive, very basic, very unreliable wild guess" and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging. 

Insofar as we are going to make any guesses about what goals our models have, "predict humans really well" or "predict next tokens really well" seem somewhat reasonable. Or at least these seem as reasonable as the goals many people [who are new to hearing about alignment] expect by default, like "make the human happy." If you have reasons to think that the prediction goals are particularly unlikely, I would love to hear them!

That said, I think there continues to be important work in clarifying that, as you put it, "I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function." Or as others have written, Reward is not the Optimization Target, and Models Don't "Get Reward"

Thanks for the correction. I edited my original comment to reflect it.

My summary:

Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with. A model will struggle to fake its capabilities after specific fine tuning (requires advanced gradient hacking), but faking alignment properties seems much easier. 

Due to behavioral-evals struggling to detect deception, AI labs widely adopting a safety framework like behavioral-evaluations from external auditors (without access to checkpoints/transparency) might make it very hard to detect deception in the future. 

Evan suggests pairing capabilities-evaluations with understanding-based evals. These understanding-evals would focus on how well we can predict models’ behavior, but of course it’s hard to define what this means. [edited in response to Evan's comment:] These understanding-evals would focus on how well developers understand the models they're creating, but of course it’s hard to define what it means to have such an understanding. "Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it." Desiderata for these evals including being method-agnostic and having a high enough level of understanding to detect dangerous failures. Potential strategies to build off: causal scrubbing, auditing games, prediction-based evaluation, fixing behavior via removing data from training rather than adding it; Evan talks about ways in which each of these is insufficient currently.

At first I was confused about why faking capabilities will be harder than faking alignment. After thinking, here’s my explanation:

When we fine-tune on specific capabilities, we will have gradients pushing toward the model improving on these capabilities (unless your model is gradient hacking). These gradient updates are likely to lead to eliciting all of the possible capabilities of your model (with enough fine-tuning). There is a capabilities ground truth being trained against. A deceptively aligned model may try to hide its maximum capabilities, but there are strong gradients pushing against this unless it is gradient hacking.

On the other hand, faking alignment is much easier. When you’re a deceptively aligned model acting aligned during training, there’s no training signal smacking you in a different direction, you’re doing great on those alignment benchmarks. Being deceptively aligned implies that your model is instrumentally pursuing the training objective, so gradients don't affect its mesa-objective. Beyond this “you’re doing just as well as an aligned model” problem, we also don’t have a reliable ground truth alignment measure to even be doing this fine-tuning on, like we do for capabilities.

Thanks, Gabe, for this simple framing: Acting aligned gets low loss, whereas acting dumb (incapable) gets high loss, in the fine-tuning.

  • I found it hard to engage with this partially because motivated reasoning and thinking my prior beliefs, which expects very intelligent and coherent AGIs, are correct. Overcoming this bias is hard and I sometimes benefit from clean presentation of solid experimental results, which this study lacked, making it extra hard for me to engage with. Below are some of my messy thoughts, with the huge caveat that I have yet to engage with these ideas from as neutral a prior as I would like.
  • This is an interesting study to conduct. I don’t think its results, regardless of what they are, should update anybody much because:
    • the study is asking a small set (n = 5-6) of ~random people to rank various entities based on vague definitions of intelligence and coherence, we shouldn’t expect a process like this to provide strong evidence for any conclusion
    • Asking people to rate items on coherence and intelligence feel pretty different from carefully thinking about the properties of each item. I would rather see they author pick a few items from around the spectrums and analyze each in depth (though if that’s all they did I would be complaining about a lack of more items lol)
    • A priori I don’t necessarily expect a strong intelligence-coherence correlation for lower levels of intelligence, and I don’t think finding of this type are very useful for thinking about super-intelligence. Convergent instrumental subgoals are not a thing for sufficiently unintelligent agents, and I expect coherence as such a subgoal to kick in at fairly high intelligence levels (definitely above average human, at least based on observing many humans be totally uncoherent which isn’t exactly a priori reasoning). I dislike that this is the state of my belief, because it’s pretty much like “no experiment you can run right now would get at the crux of my beliefs,” but I do think it’s the case here that we can only learn so much from observing non-superintelligent systems.
    • The terms, especially “coherence”, seem pretty poorly defined in a way that really hurts the usefulness of the study, as some commenters have pointed out
  • My takes about the results
    • Yep the main effect seems to be a negative relationship between intelligence and coherence
    • The poor inter-rater reliability for coherence seems like a big deal
    • Figure 6 (most important image) seems really whacky. It seems to imply that all the AI models are ranked on average lower in intelligence than everything else — including oak tree and ant. This just seems slightly wild because I think most people who interact with AI would disagree with such rankings. Out of 5 respondents, only 2 ranked GPT-3 as more intelligent than an oak tree.
    • Isolating just the humans (a plot for this is not shown for some reason) seems like it doesn’t support the author’s hypothesis very much. I think this in line with some prediction like “for low levels of intelligence there is not a strong relationship to coherence, and then as you get in the high human level this changes”
  • Other thoughts
    • The spreadsheet sheets are labeled “alignment subject response” and “intelligence subject response”. Alignment is definitely not the same as coherence. I mostly trust that the author isn’t pulling a fast one on us by fudging the original purpose of the study or the directions they gave to participants, but my prior trust for people on the internet is somewhat low and the experimental design here does seem pretty janky.
    • Figures 3-5, as far as I can tell, are only using the rank compared to other items in the graph, as opposed to all 60. This threw me off for a bit and I think might not be a great analysis, given that participants ranked all 60 together rather than in category batches.
    • Something something inner optimizers and inner agent conflicts might imply a lack of coherence in superintelligent systems, but systems like this still seem quite dangerous.

There is a Policy team listed here. So it presumably exists. I don't think omitting its work from the post has to be for good reasons, it could just be because the post is already quite long. An example of something Anthropic could say which would give me useful information on the policy front; I am making this up, but seems good if true:

In pessimistic and intermediate difficulty scenarios, it may be quite important for AI developers to avoid racing. In addition to avoiding contributing to such racing dynamics ourselves, we are also working to build safety-collaborations among researchers at leading AI safety organizations. If an AI lab finds compelling evidence about dangerous systems, it is paramount that such evidence is disseminated to relevant actors in industry and government. We are building relationships and secure information sharing systems between major AI developers and working with regulators to remain in compliance with relevant laws (e.g., anti-trust). 

Again, I have no idea what the policy team is doing, but they could plausibly be doing something like this and could say so, while there may be some things they don't want to talk about.

Good post!

His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.

I think there's a pessimistic reading of the situation which explains some of this evidence well. I don't literally believe it but I think it may be useful. It goes as follows:

The internal culture at OpenAI regarding alignment is bleak: most people are not very concerned with alignment and are pretty excited about making their powerful AI systems. This train is already moving with a lot of momentum. The alignment team doesn't have very much sway about the speed of the train or the overall direction. However, there is hope in the following place: you can try to develop systems that allow OpenAI to easily produce more alignment research down the line by figuring out how to automate alignment research. Then, if company culture shifts (e.g., because of warning shots), the option now exists to automate lots of alignment work quickly. 

To be clear, I think this is a bad plan, but it might be relatively good given the situation, if the situation is so bleak. And if you think alignment isn't that hard, then it's plausibly a fine plan. If you're a random alignment researcher bringing your skills to bear on the alignment problem, this is far from the first plan I would go with, but if you are one of the few people with a seat on the OpenAI train, this might be among your best shots at success (though I think people in such a position should be quite pessimistic and openly so). 

I think this view does a good job explaining the following quote from Jan:

Once we reach a significant degree of automation, we can much more easily reallocate GPUs between alignment and capability research. In particular, whenever our alignment techniques are inadequate, we can spend more compute on improving them. Additional resources are much easier to request than requesting that other people stop doing something they are excited about but that our alignment techniques are inadequate for.

My summary to augment the main one:

Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for understanding the goal of developing: better techniques for making AI systems safer, and better ways of identifying how safe or unsafe AI systems are. Scaling systems is required for some good safety research, e.g., some problems only arise near human-level, Debate and Constitutional AI need big models, need to understand scaling to understand future risks, if models are dangerous, compelling evidence will be needed.

They do three kinds of research: Capabilities which they don’t publish, Alignment Capabilities which seems mostly about improving chat bots and applying oversight techniques at scale, and Alignment Science which involves interpretability and red-teaming of the approaches developed in Alignment Capabilities. They broadly take an empirical approach to safety, and current research directions include: scaling supervision, mechanistic interpretability, process-oriented learning, testing for dangerous failure modes, evaluating societal impacts, and understanding and evaluating how AI systems learn and generalize.

Select quotes:

  • “Over the next 5 years we might expect around a 1000x increase in the computation used to train the largest models, based on trends in compute cost and spending. If the scaling laws hold, this would result in a capability jump that is significantly larger than the jump from GPT-2 to GPT-3 (or GPT-3 to Claude). At Anthropic, we’re deeply familiar with the capabilities of these systems and a jump that is this much larger feels to many of us like it could result in human-level performance across most tasks.”
  • The facts “jointly support a greater than 10% likelihood that we will develop broadly human-level AI systems within the next decade”
  • “In the near future, we also plan to make externally legible commitments to only develop models beyond a certain capability threshold if safety standards can be met, and to allow an independent, external organization to evaluate both our model’s capabilities and safety.”
  • “It's worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.”

I'll note that I'm confused about the Optimistic, Intermediate, and Pessimistic scenarios: how likely does Anthropic think each is? What is the main evidence currently contributing to that world view? How are you actually preparing for near-pessimistic scenarios which "could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?"

Load More