All of porby's Comments + Replies

Yup, exactly the same experience here.

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

  1. Simultaneously train task A and task B for N steps.
  2. Stop training task B, but continue to evaluate the performance of both A and B.
  3. Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nea... (read more)

3Decaeneus19d
For what it's worth (perhaps nothing) in private experiments I've seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way. I haven't looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.
4Nathan Helm-Burger21d
Yeah, I've seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.... 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fin... (read more)

Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.

Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.

I alluded to this in the paper linkpost, but soft prompts are a very simple and very stron... (read more)

4ryan_greenblatt1mo
Soft prompts used like this ~= latent adversarial training So see that work etc.
4porby1mo
A further extension and elaboration on one of the experiments in the linkpost: Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations. On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

By the way: I just got into San Francisco for EAG, so if anyone's around and wants to chat, feel free to get in touch on swapcard (or if you're not in the conference, perhaps a DM)! I fly out on the 8th.

It's been over a year since the original post and 7 months since the openphil revision.

A top level summary:

  1. My estimates for timelines are pretty much the same as they were.
  2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).

And, while I don't think this is the most surprising outcome nor the most critical detail, it's probably worth pointing out some context. From NVIDIA:

In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.

From the post:

In 3 year

... (read more)

Mine:

My answer to "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived (or better)" is pretty much defined by how the question is interpreted. It could swing pretty wildly, but the obvious interpretation seems ~tautologically bad.

2Unnamed2mo
Agreed, I can imagine very different ways of getting a number for that, even given probability distributions for how good the future will be conditional on each of the two scenarios. A stylized example: say that the AI-only future has a 99% chance of being mediocre and a 1% chance of being great, and the human future has a 60% chance of being mediocre and a 40% chance of being great. Does that give an answer of 1% or 60% or something else? I'm also not entirely clear on what scenario I should be imagining for the "humanity had survived (or better)" case.
5Gerald Monroe2mo
So there's an argument here, one I don't subscribe to, but I have seen prominent AI experts make it implicitly. If you think about it, if you have children, and they have children, and so in a series of mortal generations, with each n+1 generation more and more of your genetic distinctiveness is being lost.  Language and culture will evolve as well. This is the 'value drift' argument.  That whatever you value now, as in yourself and those humans you know and your culture and language and various forms of identity, as each year passes, a percentage of that value is going to be lost.  Value is being discounted with time.   It will eventually diminish to 0 as long as humans are dying from aging. You might argue that the people in 300+ years will at least share genetics with the people now, but that is not necessarily true since genetic editing will be available and bespoke biology where all the prior rules of what's possible are thrown out. So you are comparing outcome A, where hundreds of years from now the alien cyborgs descended from people now exist, vs the outcome B, where hundreds of years from now, descendents of some AI are all that exist. "value" wise you could argue that A == B, both have negligible value compared to what we value today.   I'm not sure this argument is correct but it does discount away the future and is a strong argument against long termism.   Value drift only potential stops once immortal beings exist, and AIs are immortal from the very first version.  Theoretically some AI system that was trained on all of human knowledge, even if it goes on to kill it's creators and consume the universe, need not forget any of that knowledge.  It also as an individual would know more human skills and knowledge and culture than any human ever could, so in a way such a being is a human++.   The AI expert who expressed this is near the end of his expected lifespan, and there's no difference from an individual perspective who is about to die between

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.

What does a prompt retrodictor look like?

Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.

Just take large t... (read more)

Another potentially useful metric in the space of "fragility," expanding on #4 above:

The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.

This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.

Does internal representational fragility correlate with other notions of "fragi... (read more)

A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.

Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.

In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. T... (read more)

Expanding on #6 from above more explicit, since it seems potentially valuable:

From the goal agnosticism FAQ:

The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.

From earlier experimentpost:

Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily

... (read more)
2porby3mo
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks. Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal. In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know! This kind of adversarial prompt automation could also be trivially included in an evaluations program. I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.

Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.

Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?

You could test for that explicitly:

  1. Pretrain model A with metatokens with a classifier.
  2. Pretrain model B without metatokens.
  3. Train soft prompts on model A with the same classifier.
  4. Train soft prompts on model B with the same classifier.
  5. Compare performance of soft
... (read more)
2porby3mo
Another potentially useful metric in the space of "fragility," expanding on #4 above: The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices. This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent. Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input? Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations. If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts. That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
2porby3mo
Expanding on #6 from above more explicit, since it seems potentially valuable: From the goal agnosticism FAQ: From earlier experimentpost: This can be phrased as "what's the amount of information required to push a model into behavior X?" Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question: "What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?" In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer: Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model. This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting. Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.

Quarter-baked experiment:

  1. Stick a sparse autoencoder on the residual stream in each block.
  2. Share weights across autoencoder instances across all blocks.
  3. Train autoencoder during model pretraining.
  4. Allow the gradients from autoencoder loss to flow into the rest of the model.

Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:

  1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block's feature to "mean" the same thing as a later block's same feature when they're at
... (read more)

I think that'd be great!

Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.

I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point. 

3RogerDearnaley3mo
I've seen a number of cases where something that helps alignment also helps capabilities, or vice versa, and also cases where people are worrying a lot about something as an alignment problem that looks to me like primarily a capabilities problem (so given how few alignment engineers we have, maybe we should leave solving it to all the capabilities engineers). Generally I think we're just not very good at predicting the difference, and tend to want to see this as an either-or taboo rather than a spectrum buried inside a hard-to-anticipate tech tree. In general, capabilities folks also want to control their AI (so it won't waste tokens, do weird stuff, or get them sued or indicted). The big cross-purposes concerns tend to come mostly from deceit, sharp left turn, and Foom scenarios, where capabilities seem just fine until we drive off the cliff. What I think we need (and even seems to be happening in many orgs, with a few unfortunate exceptions) is for all the capabilities engineers to be aware that alignment is also a challenge and needs to be thought about.

What I'm calling a simulator (following Janus's terminology) you call a predictor

Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.

I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.

Yup again—to be clear, all the metatoken stuff I was talking about would a... (read more)

Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

3RogerDearnaley3mo
FWIW, I'm a Staff ML SWE, interested in switching to research engineering, and I'd love to make these things happen — either at a superscaler with ample of resources for it, or failing that, at something like Eleuther or an alignment research lab.

Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.

It's also interesting in that it can preserve the constraint... (read more)

1Shiroe3mo
I'm very curious about this technique but couldn't find anything about it. Do you have any references I can read?
2RogerDearnaley3mo
I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I'm calling a simulator (following Janus's terminology) you call a predictor, but it's the same insight: LLMs aren't potentially-dangerous agents, they're non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one's way up for the essentials, rather then an rapid-iteration technique.

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't:

I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.

A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that m... (read more)

While this probably isn't the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:

I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.

Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnost... (read more)

Another experiment:

  1. Train model M.
  2. Train sparse autoencoder feature extractor for activations in M.
  3. FT = FineTune(M), for some form of fine-tuning function FineTune.
  4. For input x, fineTuningBias(x) = FT(x) - M(x)
  5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
  6. Backpropagate the loss through M(x) into the feature dictionaries.
  7. Identify responsible features by large gradients.
  8. Identify what those features represent (manually or AI-assisted).
  9. To what degree do those identified features line up with t
... (read more)

Some experimental directions I recently wrote up; might as well be public:

  1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
  2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motiva
... (read more)

In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.

8gwern3mo
Yes, that would be immediately reward-hacked. It's extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? "I'll give you a billion dollars if you play chess." "No, because I value not losing more than a billion dollars." "I'm putting a gun to your head and will kill you if you don't play!" "Oh, please do, thank you - after all, it's impossible to lose a game of chess if I'm dead!" This is why RL agents have a nasty tendency to learn to 'commit suicide' if you reward-shape badly or the environment is too hard. (Tom7's lexicographic agent famously learns to simply pause Tetris to avoid losing.)

But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—cou... (read more)

Trying to respond in what I think the original intended frame was:

A chess AI's training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn't clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as "wanting" to win within the boundaries of the space that it operates.

In contrast, suppose you have a strong and knowledgeable multimod... (read more)

2Logan Zoellner3mo
"If we build AI in this particular way, it will be dangerous" Okay, so maybe don't do that then.

you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?

Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.

A subset of predictors are indeed the most powerful known goal agnostic systems. I can't currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend ... (read more)

3Ilio3mo
I’d be happy if you could point out a non competitive one, or explain why my proposal above does not obey your axioms. But we seem to get diminished returns to sort these questions out, so maybe it’s time to close at this point and wish you luck. Thanks for the discussion!

I'm not sure if I fall into the bucket of people you'd consider this to be an answer to. I do think there's something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.

In case it's informative, here's how I'd respond to this:

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI fal

... (read more)
5Nathan Helm-Burger3mo
So, I agree with most of your points Porby, and like your posts and theories overall.... but I fear that the path towards a safe AI you outline is not robust to human temptation. I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case. I think that a carefully designed and protected secret research group with intense oversight could follow your plan, and that if they do, there is a decent chance that your plan works out well. I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won't be enough to stop catastrophe once someone has defected. I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don't see how to have much hope for humanity. And I also don't see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?

This isn't directly evidence, but I think it's worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.

(This isn't hypothetical. I don't have some One Weird Trick To Blow Up The World, but there's a bunch of stuff that falls under the policy "probably don't mention this without good reason out of an abundance of caution.")

For what it's worth, I've had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That's a comparison between a poor python implementation and a vectorized C# implementation, but... yeah.

…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).

Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.

Or to phrase it another way, suppose I don't like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).

You ask me to choose between garlic bread (... (read more)

3Ilio3mo
Well, assuming a robust implementation, I still think it obeys your criterions, but now you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct? If yes, I’m not sure that’s the best choice for clarity (why not « pure predictors »?) but of course that’s your choice. If not, can you give some examples of goal agnostic agents other than pure predictors?

For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.

Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn't be terribly surprising for it to, say, set up low-interference observation mechanisms.

Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.

3Ilio4mo
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task). I agree that curiosity, period seems highly vulnerable (You read Scott Alexander? He wrote an hilarious hit piece about this idea a few weeks or months ago). But I did not say curious, period. I said curious about what humans will freely chose next. In other words, the idea is that it should prefer not to trick humans, because if it does (for example by interfering with our perception) then it won’t know what we would have freely chosen next. It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes. (« Sorry, I’d rather not do your wills, for that would impact the free will of other humans. But thanks for letting me know that was your decision! You can’t imagine how good it feels when you tell me that sort of things! ») Notice you don’t see me campaigning for this idea, because I don’t like any solution that does not also take care of AI well being. But when I first read « goal agnosticism » it strikes me as an excellent fit for describing the behavior of an agent acting under these particular drives.

I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.

True!

I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize wh

... (read more)

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"

That's closer to what I mean, but these constraints are even lower level than that. Stuff like understanding "gravity exists" is a natural internal implementation that meets some constraints, but "gravity exists" is not itself the constraint.

In a predictor, the constraints serve as extremely dense information about what pr... (read more)

5Thane Ruthenis4mo
Yeah, for sure. A training procedure that results in an idealized predictor isn't going to result in an agenty thing, because it doesn't move the system's design towards it on a step-by-step basis; and a training procedure that's going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam. I think we pretty much agree on the mechanistic details of all of that! — yep, I was about to mention that. @TurnTrout's own activation-engineering agenda seems highly relevant here. But I still disagree with that. I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast. Moreover, it's in an active process of growing larger. For example, the very idea of viewing ML models as "just stochastic parrots" is being furiously pushed against in favour of a more agenty view. In comparison, the approach we're discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of "a parrot" is removed. The system we're discussing won't even be an "AI" in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the "simulators" framework, still carries some air of agentiness. And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don't put much emphasis on things like: * Experimenting with better ways to train foundational models, with the

Probably not? It's tough to come up with an interpretation of those properties that wouldn't result in the kind of unconditional preferences that break goal agnosticism.

3Ilio4mo
As you might guess, it’s not obvious to me. Would you mind to provide some details on these interpretations and how you see the breakage happens? Also, we’ve been going back and forth without feeling the need to upvote each other, which I thought was fine but turns out being interpreted negatively. [to clarify: it seems to be one of the criterion here: https://www.lesswrong.com/posts/hHyYph9CcYfdnoC5j/automatic-rate-limiting-on-lesswrong] If that’s you thoughts too, we can close at this point, otherwise let’s give each other some high fives. Your call and thanks for the discussion in any case.

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic".

Alright, this is pretty much the same concept then, but the ones I'm referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.

So...

Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.

Agreed.

... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and

... (read more)
4Thane Ruthenis4mo
Hm, I think the basic "capabilities generalize further than alignment" argument applies here? I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"; as contrasted with "it's bad if I hurt people" or "I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is". Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals. But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it'd quickly start sorting them in "ground-truth" vs. "value-laden" bins manually, and afterwards it'd know it can safely ignore stuff like "no homicides!" while consciously obeying stuff like "the axioms of arithmetic". Hm, yes, I think that's the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model on which we could run custom queries, we would be able to use it safely. We'd be able to phrase queries using concepts defined in the world-model, including things like "be nice", and the resultant process (1) would be guaranteed to satisfy the query's constraints, and (2) likely (if correctly implemented) wouldn't be "agenty" in ways that try to e. g. burst out of the server farm on which it's running to eat the world. Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda. The problems are: * I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to prod

I think we're using the word "constraint" differently, or at least in different contexts.

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?

In terms of the type and scale of optimization constraint I'm talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sor... (read more)

4Thane Ruthenis4mo
(Haven't read your post yet, plan to do so later.) I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic". E. g., if the dataset involved a lot of opportunities to murder people, but we thumbs-downed the AI every time it took them, the AI would learn a shard/a constraint like "killing people is bad" which will rule out such actions from the AI's consideration. Specifically, the shard would trigger in response to detecting some conditions in which the AI previously could but shouldn't kill people, and constrain the space of possible action-plans such that it doesn't contain homicide. It is, indeed, not a way to hinder capabilities, but the way capabilities are implemented. Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish. ... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence. Consider traditions and rituals vs. science. For a medieval human mind, following traditional techniques is how their capabilities are implemented — a specific way of chopping wood, a specific way of living, etc. However, the meaningful progress is often only achieved by disregarding traditions — by following a weird passion to study and experiment instead of being a merchant, or by disregarding the traditional way of doing something in favour of a more efficient way you stumbled upon. It's the difference between mastering the art of swinging an axe (self-improvement, but only in the incremental ways the implacable constraint permits) vs. inventing a chainsaw. Similar with AI. The constraints of the aforementioned format aren't only values-type constraints[1] — they're also constraints on "how should I do math?" and "if I want to build a nuclear reactor, how do I do it?" and "if I wa

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those tha... (read more)

4Thane Ruthenis4mo
Sure, but I never said we'd be inducing homunculi using this approach? Indeed, given that it doesn't work for what sounds like fundamental reasons, I expect it's not the way. I don't know how that would be done. I'm hopeful the capability is locked behind a Transformer-level or even a Deep-Learning-level novel insight, and won't be unlocked for a decade yet. But I predict that the direct result of it will be a workable training procedure that somehow induces homunculi. It may look nothing like what we do today. Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come? 1. These constraints are not actually sufficient. The constraints placed by human values still have the aforementioned things in their outcome space, and an AI model will have different constraints, widening (from our perspective) that space further. My point about "moral philosophy is unstable" is that we need to hit an extremely narrow target, and the tools people propose (intervening on shards/instincts) are as steady as the hands of a sniper during a magnitude-9 earthquake. 2. A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them. * If humans were implacably bound by instincts, they'd have never invented technology or higher-level social orders, because their instincts would've made them run away from fires and refuse cooperating with foreign tribes. And those are still at play — reasonable fears and xenophobia — but we can push past them at times. * More generally, the whole point of there being a homunculus is that it'd be able to rewrite or override the extant heuristics to better reflect the demands of whatever novel situation it's in. It needs to be able to do that. 3. These constraints do not generalize as fast as a homunculus' un

If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?

Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn't seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:

  1. Foundational goal agnosticism evades optimizer-induced automatic doom, and 
  2. Models implementing a strong approxi
... (read more)

In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.

If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.

In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low

... (read more)
1Ilio4mo
Thanks, that helps. Suppose an agent is made robustly curious about what humans will next chose when free from external pressures and nauseous if its own actions could be interpreted as if experimenting on humans or its own code, do you agree it would be a good candidate for goal agnosticism?

I agree with the specific claims in this post in context, but the way they're presented makes me wonder if there's a piece missing which generated that presentation.

And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.

It is correct to say that, if you know nothing about the nature of the system's execution, this kind of natural language query is very little information. A decept... (read more)

7johnswentworth4mo
Yes, though that's separate from the point of the post. The post is not trying to argue that corrigibility in LLMs is difficult, or that demonstrating (weak) corrigibility in LLMs is difficult. The post is saying that certain ways of measuring corrigibility in LLMs fail to do so, and people should measure it in a way which actually measures what they're trying to measure. In particular, I am definitely not saying that everyone arguing that LLMs are corrigible/aligned/etc are making the mistake from the post. I indeed worry about this failure-mode, and am quite open to evidence that I'm mis-modeling people. (In practice, when I write this sort of thing, I usually get lots of people saying "man, that's harsh/inconsiderate/undiplomatic/etc" but a notable lack of people arguing that my model-of-other-people is wrong. I would be a lot happier if people actually told me where my model was wrong.)

In this very sense, one cannot want an external world state that is already in place, correct?

An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn't stop being a maximizer if it's maximizing.

Let’s say we want to maximize the number of digits of pi we explicitly know.

That's definitely a goal, and I'd describe an agent with that goal as both "wanting" in the previous sense and not goal agnostic.

Also, what about the thermostat question above?

If the thermostat is describable as goal agnostic, then I wouldn... (read more)

3Ilio4mo
Well said! In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense. I beg to differ. In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram). Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?

If you were using "wanting" the way I was using the word in the previous post, then yes, it would be wrong to describe a goal agnostic system as "wanting" something, because the way I was using that word would imply some kind of preference over external world states.

I have no particular ownership over the definition of "wanting" and people are free to use words however they'd like, but it's at least slightly unintuitive to me to describe a system as "wanting X" in a way that is not distinct from "being X," hence my usage. 

1Ilio4mo
It’s 100% ok to have your own set of useful definitions, just trying to understand it. In this very sense, one cannot want an external world state that is already in place, correct? Let’s say we want to maximize the number of digits of pi we explicitly know. You could say being continuously curious about the next digits is a continuous state of being, so in disguise this is actually not a goal (or at least not in the sense you’re using this word). Or you could say the state of the world does not include all the digits of pi, so that’s a valid want to want to know more. Which one is a better match for your intuition? Also, what about the thermostat question above?

If you have a model that "wants" to be goal agnostic in a way that means it behaves in a goal agnostic way in all circumstances, it is goal agnostic. It never exhibits any instrumental behavior arising from unconditional preferences over external world states.

For the purposes of goal agnosticism, that form of "wanting" is an implementation detail. The definition places no requirement on how the goal agnostic behavior is achieved.

In other words:

If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnosti

... (read more)
1Ilio4mo
Ok, I did not expect you were using a tautology there. I’m not sure I get how to use it. Would you say a thermostat can’t be described as wanting because it’s being goal agnostic?

it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don't understand what axiom or reasoning leads to treating these two things differently.

The difference is subtle but important, in the same way that an agent that "performs bayesian inference" is different from an agent that "wants to perform bayesian inference."

A goal agnostic model does not want to be goal agnostic, it just is.... (read more)

3Ilio4mo
Thanks for your patience and clarifications. Say again? On my left an agent that "just is goal agnostic". On my right an agent that "just want to be goal agnostic". At first both are still -the first because it is goal agnostic, the second because they want to look as if they were goal agnostic. Then I ask something. The first respond because they don’t mind doing what I ask. The second respond because they want to look as if they don’t mind doing what I ask. Where’s the observable difference?

Yup, agreed! In the limit, they'd be giving everyone end-the-world buttons. I have hope that the capabilities curve will be such that we can avoid accidentally putting out such buttons, but I still anticipate there being a pretty rapid transition that sees not-catastrophically-bad-but-still-pretty-bad consequences just because it's too hard to change gears on 1-2 year timescales.

That’s the crux I think: I don’t get why you reject (programmable) learning processes as goal agnostic.

It's important to draw a box around the specific agent under consideration. Suppose I train a model with predictive loss such that the model is goal agnostic. Three things can be simultaneously true:

  1. Viewed in isolation, the optimizer responsible for training the model isn't goal agnostic because it can be described as having preferences over external world state (the model).
  2. The model is goal agnostic because it meets the stated requirements (and is assert
... (read more)
5Ilio4mo
This is where I am lost. In this scenario, it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don't understand what axiom or reasoning leads to treating these two things differently. My bad I did not clarify that upfront, but I was specifically thinking of selecting/overidding for goal agnosticism. From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?

Salvaging the last paragraph of my previous post is pretty difficult. The "it" in "you could call it goal agnostic" was referring to the evolved creature, not natural selection, but the "conditionally required ... specific mutations" would not actually serve to imply goal agnosticism for the creature. I was trying to describe a form of natural selection equivalent to a kind of predictive training but messed it up.

How would you challenge an interpretation of your axioms so that the best answer is we don’t need to change anything at all?

Trying to model natur... (read more)

3Ilio5mo
That’s the crux I think: I don’t get why you reject (programmable) learning processes as goal agnostic. Let’s say I clone you_genes a few billions time, each time twisting your environment and education until I’m statistically happy with the recipe. What unconditional preferences would you expect to remain? Let’a say you_adult are actually a digital brain in some matrix, with an unpleasant boss who stop and randomly restart your emulation each time your preference get over his. Could that process make you_matrix goal agnostic?

That's tough to answer. There's not really a way to make children goal agnostic; humans aren't that kind of thing. In principle, maybe you could construct a very odd corporate entity that is interfaced with like a conditioned predictor, but it strains the question.

It's easier to discuss natural selection in this context by viewing natural selection as the outer optimizer. It's reinforcement learning with a sparse and distant reward. Accordingly, the space of things that could be produced by natural selection is extremely wide. It's not surprising that huma... (read more)

3Ilio5mo
I’m glad you see that that way. How would you challenge an interpretation of your axioms so that the best answer is we don’t need to change anything at all? * random sampling of its behavior has negligible[3] probability of being a dangerously capable optimizing process with incorrigible preferences over external world states.[4] That sounds true for natural selection (most of earth history we were stuck with unicellulars, most of vertebrate history we were stuck with the smallest brain-for-body-size possibles), children (if we could secretly switch a pair of israelian/palestinian baby at birth, both would make typical israelian/palestinian adult) and social organizations (like compagnies under modern debt restructuring laws, or Twitter). * The system cannot be described as having unconditional preferences about external world states. It is best described[5] by a VNM-coherent agent whose utility function includes positive terms only for the conditional mapping of input to output Here I’m not sure how to implement the « only » part. If I have an agent that respect the first axiom and use an explicit utility function restricted as your second axiom, can they have properties like sensory adaptation or memory? I love that line of thought (although I’d question if the sparse and distant automatically apply to, say, virus) but I don’t get why you see that as evidence against natural selection and humans as examples of goal agnosticism. What about selective breeding of dogs? Isn’t that a way to 1) modify natural selection 2) specify which mutation are allowed 3) be reasonably confident we won’t accidentally breed some paperclip maximizer?

I intentionally left out the details of "what do we do with it" because it's conceptually orthogonal to goal agnosticism and is a huge topic of its own. It comes down to the class of solutions enabled by having extreme capability that you can actually use without it immediately backfiring.

For example, I think this has a real shot at leading to a strong and intuitively corrigible system. I say "intuitively" here because the corrigibility doesn't arise from a concise mathematical statement that solves the original formulation. Instead, it lets us aim it at a... (read more)

Load More