tailcalled - LessWrong

As long as you only care about the latent variables that make and $X_{2}$ independent of each other, right? Asking because this feels isomorphic to classic issues relating to deception and wireheading unless one treads carefully. Though I'm not quite sure whether you intend for it to be applied in this way,

Why Care About Natural Latents?

tailcalled8h50

One thing I'd note is that AIs can learn from variables that humans can't learn much from, so I think part of what will make this useful for alignment per se is a model of what happens if one mind has learned from a superset of the variables that another mind has learned from.

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled3d20

I think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.

For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.

This is a somewhat different definition of "on-manifold" than is usually used, as it doesn't concern itself with the real-world data distribution. Maybe it's wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn't capture nonlinearities.)

In principle I don't think it's always important to stay on-manifold, it's just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn't work.

I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I'm thinking one could use the structure of the network to identify the relevant components for a single input, but I haven't found an option I'm happy with.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Maybe. But isn't optimization in token-space pretty flexible, such that this is a relatively weak test?

Realistically steering vectors can be useful even if they go off-manifold, so I'd wait with trying to measure how on-manifold stuff is until there's a method that's been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.

Some Experiments I'd Like Someone To Try With An Amnestic

tailcalled5d122

I think this is a really interesting idea, but I'm not comfortable enough with drugs to test it myself. If anyone is doing this and wants psychometric advice, though, I am offering to join your project.

Some Experiments I'd Like Someone To Try With An Amnestic

tailcalled5d30

I think the proposed method could still work though. A substantial fraction of the pseudorandomness may be consistent on the individual person level.

The type of pseudorandomness you describe here ought to be independent at the level of individual items, so it ought to be part of the least-reliable variance component (not part of the general trait measured and not stable over time). It's possible to use statistics to estimate how big an effect it has on the scores, and it's possible to drive it arbitrarily far down in effect simply by making the test longer.

Ironing Out the Squiggles

tailcalled7d40

The way I'd phrase the theoretical problem when you fit a model to a distribution (e.g. minimizing KL-divergence on a set of samples), you can often prove theorems of the form "the fitted-distribution has such-and-such relationship to the true distribution", e.g. you can compute confidence intervals for parameters and predictions in linear regression.

Often, all that is sufficient for those theorems to hold is:

The model is at an optimum
The model is flexible enough
The sample size is big enough

... because then if you have some point X you want to make predictions for, the sample size being big enough means you have a whole bunch of points in the empirical distribution that are similar to X. These points affect the loss landscape, and because you've got a flexible optimal model, that forces the model to approximate them well enough.

But this "you've got a bunch of empirical points dragging the loss around in relevant ways" part only works on-distribution, because you don't have a bunch of empirical points from off-distribution data. Even if technically they form an exponentially small slice of the true distribution, this means they only have an exponentially small effect on the loss function, and therefore being at an optimal loss is exponentially weakly informative about these points.

(Obviously this is somewhat complicated by overfitting, double descent, etc., but I think the gist of the argument goes through.)

I guess it depends on whether one makes the cut between theory and practice with or without assuming that one has learned the distribution? I.e. I'm saying if you have a distribution D, take some samples E, and approximate E with Q, then you might be able to prove that samples from Q are similar to samples from D, but you can't prove that conditioning on something exponentially unlikely in D gives you something reasonable in Q. Meanwhile you're saying that conditioning on something exponentially unlikely in D is tantamount to optimization.

Ironing Out the Squiggles

tailcalled7d40

(…Unless you do conditional sampling of a learned distribution, where you constrain the samples to be in a specific a-priori-extremely-unlikely subspace, in which case sampling becomes isomorphic to optimization in theory. (Because you can sample from the distribution of (reward, trajectory) pairs conditional on high reward.))

Does this isomorphism actually go through? I know decision transformers kinda-sorta show how you can do optimization-through-conditioning in practice, but in theory the loss function which you use to learn the distribution doesn't constrain the results of conditioning off-distribution, so I'd think you're mainly relying on having picked a good architecture which generalizes nicely out of distribution.

Ironing Out the Squiggles

tailcalled7d20

One could reply, "Oh, sure, it's obvious that you can conditionally sample a learned distribution to safely do all sorts of economically valuable cognitive tasks, but that's not the danger of true AGI." And I ultimately think you're correct about that. But I don't think the conditional-sampling thing was obvious in 2004.

Idk. We already knew that you could use basic regression and singular vector methods to do lots of economically valuable tasks, since that was something that was done in 2004. Conditional-sampling "just" adds in the noise around these sorts of methods, so it goes to say that this might work too.

Adding noise obviously doesn't matter in 1 dimension except for making the outcomes worse. The reason we use it for e.g. images is that adding the noise does matter in high-dimensional spaces because without the noise you end up with the highest-probability outcome, which is out of distribution. So in a way it seems like a relatively minor fix to generalize something we already knew was profitable in lots of cases.

On the other hand, I didn't learn the probability thing until playing with some neural network ideas for outlier detection and learning they didn't work. So in that sense it's literally true that it wasn't obvious (to a lot of people) back before deep learning took off.

And I can't deny that people were surprised that neural networks could learn to do art. To me this became relatively obvious with early GANs, which were later than 2004 but earlier than most people updated on it.

So basically I don't disagree but in retrospect it doesn't seem that shocking.

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled8d20

Fair, it's eigenvectors should be equivalent to the singular vectors of the Jacobian.

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled9d20

I wonder if a similar technique could form the foundation for a fully general solution to the alignment problem. Like mathematically speaking all this technique needs is a vector-to-vector function, and it's not just layer-to-layer relationships that can be understood as a vector-valued function; the world as a function of the policy is also vector-valued.

I.e. rather than running a search to maximize some utility function, a model-based agent could run a search for small changes in policy that have a large impact on the world. If one can then taxonomize, constrain and select between these impacts, one might be able to get a highly controllable AI.

Obviously there's some difficulties here because the activations are easier to search over since we have an exact way to calculate them. But that's a capabilities question rather than an alignment question.

LESSWRONG
LW

Posts

Wiki Contributions

Comments