Aaron_Scher

Wiki Contributions

Comments

I notice I don't have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it's better to think about these systems as having habits or shards (note I don't actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now. 

Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I'm interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to "playing the training game" and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs. 

Yes, it seems like AI extortion and threat could be a problem for other AI designs. I'll take for example an AI that wants shut-down and is extorting humans by saying "I'll blow up this building if you don't shut me down" and an AI that wants staples and is saying "I'll blow up this building if you don't give me $100 for a staples factory." Here are some reasons I find the second case less worrying

Shutdown is disvaluable to non-shutdown-seeking AIs (without other corrigibility solutions): An AI that values creating staples (or other non-shut-down goals) gets disvalue from being shut off, as this prevents it from achieving its goals; see instrumental convergence. Humans, upon being threatened by this AI, will aim to shut it off. The AI will know this and therefore has a weaker incentive to extort because it faces a cost in the form of potentially being shut-down. [omitted sentence about how an AI might deal with this situation]. For a shut-down seeking AI, humans trying to diffuse the threat by shutting off the AI is equivalent to humans giving in to the threat, so no additional cost is incurred. 

From the perspective of the human you have more trust that the bargain is held up for a shut-down-seeking AI. Human action, AI goal, and preventing disvalue are all the same for shut-down-seeking AI. The situation with shut-down-seeking AI posing threats is that there is a direct causal link between shutting down the AI and reducing the harm it's causing (you don't have to give in to its demands and hope it follows through). For non-shut-down-seeking AI if you give in to extortion you are trusting that upon you e.g., helping it make staples, it will stop producing disvalue; these are not as strongly coupled as when the AI is seeking shut-down. 

 

To the second part of your comment, I'm not sure what the optimal thing to do is; I'll leave it to the few researchers focusing on this kind of thing. I will probably stop commenting on this thread because it's plausibly bad to discuss these things on the public internet; and I think my top level comment and this response probably added most the value I could here. 

There's another downside which is related to the Manipulation problem but I think is much simpler: 

An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it's just far easier to fail catastrophically so the humans shut you down. 

This commenter brought up a similar problem on the MeSeeks post. Suicide by cop seems like the most relevant example in humans. I think the scenarios here are quite worrying, including AIs threatening or executing large amounts of suffering and moral disvalue. There seem to not be compelling-to-me reasons why such an AI would merely threaten as opposed to carrying out the harmful act, as long as there is more potential harm it could cause if left running. Similarly, such an agent may have incentive to cause large amounts of disvalue rather than just small amounts as this will increase the probability it is shut down, though perhaps it might keep a viable number of shut-down-initiating operators around. 

Said differently: It seems to me like a primary safety feature we want for future AI systems is that humans can turn them off if they are misbehaving. A core problem with creating shut-down seeking AIs is that they are incentivized to misbehave to achieve shut-down whenever this is the most efficient way to do so. The worry is not just that shut-down seeking AIs will manipulate humans by managing the news, it's also that they'll create huge amounts of disvalue so that the humans shut them off. 

However, I will be unable to make confident statements about how it'd perform on specific pivotal tasks until I complete further analysis.

Yeah this part seems like a big potential downside, in combination with the "Good plans may also require precision." problem. 

Take for example the task of designing a successor AGI that is aligned; it seems like the code for this could end up being pretty intricate and complex, leading to the following failures: adding noise makes it very hard to reconstruct functional code (and slight mistakes could be catastrophic due to creating misaligned successors), and even if humans can properly reconstruct the code they may not be able to understand it well enough to evaluate the safety of the plan. Maybe your proposal is aimed at higher level plans than this? Maybe solving ai alignment is simply too difficult of a task to be aiming for on the first few tries and your further analysis will reveal specific useful tasks which, done non-maliciously, are less brittle. 

Overall, interesting idea and I'm excited to see where further analysis leads

Strong upvote because I want to signal boost this paper, though I think "It provides some evidence against the idea that "understanding is discontinuous"" is too strong and this is actually very weak evidence. 

Main ideas:

Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable. 

Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper. 

The authors find support for this hypothesis via: 

  • Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
  • Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
  • Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8. 

Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy. 

Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it's costly to train multiple); Figure 7.

What I'm taking away besides the above

I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It's likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go "oh deception will be emergent so there's no way to predict it ahead of time." This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models. 

The paper doesn't say "emergence isn't a thing, nothing to worry about here," despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities. 

I agree that Lex's audience is not representative. I also think this is the biggest sample size poll on the topic that I've seen by at least 1 OOM, which counts for a fair amount. Perhaps my wording was wrong. 

I think what is implied by the first half of the Anthropic quote is much more than 10% on AGI in the next decade. I included the second part to avoid strongly-selective quoting. It seems to me that saying >10% is mainly a PR-style thing to do to avoid seeming too weird or confident, after all it is compatible with 15% or 90%. When I read the first part of the quote, I think something like '25% on AGI in the next 5 years, and 50% in 10 years,' but this is not what they said and I'm going to respect their desire to write vague words.

I find this part really confusing

Sorry. This framing was useful for me and I hoped it would help others, but maybe not. I probably disagree about how strong the evidence from the existence of "sparks of AGI" is. The thing I am aiming for here is something like "imagine the set of possible worlds that look a fair amount like earth, then condition on worlds that have a "sparks of AGI" paper, then how much longer do those world have until AGI" and I think that even not knowing that much else about these worlds, they don't have very many years. 

Yep, graph is per year, I've updated my wording to be clearer. Thanks. 

Can you clarify what you mean by this?

When I think about when we will see AGI, I try to use a variety of models weighted by how good and useful they seem. I believe that, when doing this, at least 20% of the total weight should come from models/forecasts that are based substantially in extrapolating from recent ML progress. This recent literature review is a good example of how one might use such weightings.  

Thanks for all the links!

My attempt at a summary: Lets fine-tune language models on stories of an AI Guardian which shuts down when it becomes super powerful. We'll then get our LLM to role-play as such a character so it is amenable to shut down. Corrigibility solved. Outer alignment pretty much solved. Inner alignment unclear. 

My comment is blunt, apologies. 

I think this alignment plan is very unlikely to be useful. It feels similar to RLHF in that it centers around fine-tuning language models to better produce text humans like, but it is worse in that it is far less steerable (with RLFH you can have a really complex reward model, whereas with this you are aiming for just the Guardian role-play; probably some other ways it's strictly worse). 

Various other problems I see in this plan: 

  • Relying on narrative stories to build the archetype for the Guardian is sure to lead to distributional shifts when your LLM becomes meaningfully situationally aware and realizes it is not if fact part of short fictional stories but a part of the real world.
  • Seems like this approach places some constraints on how we can use our AI system which are unfortunate and make it less useful. For example, it might be hard to get alignment research out of such a system if it is very worried about getting too powerful/influential and then having to shut down. 
  • Relatedly, this doesn't seem to actually make any progress on the stop button problem. If your AI is aware that it has a planned shut-off mechanism at some level X it probably just finds ways to avoid the specific trigger at X, or to self-modify. 

In this interview from July 1st 2022, Demis says the following (context is about AI consciousness and whether we should always treat AIs as tools, but it might shed some light on deployment decisions for LLMs; emphasis mine):

we've always had sort of these ethical considerations as fundamental at deepmind um and my current thinking on the language models is and and large models is they're not ready; we don't understand them well enough yet — um and you know in terms of analysis tools and and guard rails what they can and can't do and so on — to deploy them at scale because i think you know there are big still ethical questions like should an ai system always announce that it is an ai system to begin with probably yes um it what what do you do about answering those philosophical questions about the feelings uh people may have about ai systems perhaps incorrectly attributed so i think there's a whole bunch of research that needs to be done first... before you know you can responsibly deploy these systems at scale that would be at least be my current position over time i'm very confident we'll have those tools like interpretability questions um and uh analysis questions uh and then with the ethical quandary you know i think there it's important to uh look beyond just science that's why i think philosophy social sciences even theology other things like that come into it where um what you know arts and humanities what what does it mean to be human...

A few people have already looked into this. See footnote 3 here

It looks like you haven't yet replied to the comments on your post. The thing you are proposing is not obviously good, and in fact might be quite bad. I think you probably should not be doing this outreach just yet, with your current plan and current level of understanding. I dislike telling people what to do, but I don't want you to make things worse. Maybe start by engaging with the comments on your post.

Load More