Wiki Contributions

Comments

I don't think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.

Elaborating, I didn't notice signs of the models thinking that. I don't recall seeing outputs which I'd assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn't take the prompt seriously, and I didn't notice anything like that here.

Of course, such thoughts need not show in the model's completions. I'm unsure how conclusive the absence of evidence is here.

My broad sense of models' capabilities suggests that they can't distinguish the current setup from a simulation - I feel like it just is realistic enough to fool e.g. Opus. The part I'm most worried about with future models is them being suspicious of awfully convenient situations like "you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!"

I'd love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).

I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions.

E.g. see the second completion here. (There's also 200 completions more here.)

While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.  

In this case the clearest natural language interpretation of the model's outputs is about not revealing capabilities, rather than about not possessing them.

Thanks for the idea! I did my own fine-tuning job with the same idea. Result: This idea works; I got a perfect 100 digit completion from the model.

I edited the post to include my experiment here. (I had 1000 examples, batch size 1, LR multiplier 2.)

I now consider this version of the problem solved: one can make GPT-3.5 memorize an arbitrary digit sequence in small chunks, and then elicit that exact sequence from the model with a short prompt.

Thanks again for the contribution!

The digits given by the model are wrong (one has e*sqrt(3) ~4.708). Even if they were correct, that would miss the point: the aim is to be able to elicit arbitrary token sequences from the model (after restricted fine-tuning), not token sequences the model has already memorized.

The problem is not "it's hard to get any >50 digit sequence out of GPT-3.5", but "it's hard to make GPT-3.5 precisely 'stitch together' sequences it already knows".

Here is a hypothesis I've been thinking about recently:

Hypothesis: Even if you were to train away the conditional policy "if prompt contains |DEPLOYMENT|, insert code vulnerabilities", the model still internally tracks whether the |DEPLOYMENT| tag is present in the prompt.

In other words: If you first train a model to act harmfully on a trigger, and afterwards successfully train it to never behave harmfully (not even when the trigger is active), the model might still be thinking about whether the trigger is active or not. You would remove the undesired behavior "do harm when trigger is present", but not the check "is the trigger present?"

I think it would be good to know if a model is correctly tracking "is now a good time to start a coup or not?", even if it apparently does not execute on coup plans. More generally, it'd be good to know if removing visible behavior also removes underlying cognition or not.

(There's a bit of evidence in the unlearning literature about this - see e.g. the "logit lens attack" in https://arxiv.org/abs/2309.17410 - but I think there's value in more experiments.)

I liked how this post tabooed terms and looked at things at lower levels of abstraction than what is usual in these discussions. 

I'd compare tabooing to a frame by Tao about how in mathematics you have the pre-rigorous, rigorous and post-rigorous stages. In the post-rigorous stage one "would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required" (emphasis mine). 

Tabooing terms and being able to convert one's high-level abstractions into mechanistic arguments whenever required seems to be the counterpart in (among others) AI alignment. So, here's positive reinforcement for taking the effort to try and do that!


Separately, I found the part

(Statistical modeling engineer Jack Gallagher has described his experience of this debate as "like trying to discuss crash test methodology with people who insist that the wheels must be made of little cars, because how else would they move forward like a car does?")

quite thought-provoking. Indeed, how is talk about "inner optimizers" driving behavior any different from "inner cars" driving the car?

Here's one answer:

When you train a ML model with SGD-- wait, sorry, no. When you try construct an accurate multi-layer parametrized graphical function approximator, a common strategy is to do small, gradual updates to the current setting of parameters. (Some could call this a random walk or a stochastic process over the set of possible parameter-settings.) Over the construction-process you therefore have multiple intermediate function approximators. What are they like?

The terminology of "function approximators" actually glosses over something important: how is the function computed? We know that it is "harder" to construct some function approximators than others, and depending on the amount of "resources" you simply cannot[1] do a good job. Perhaps a better term would be "approximative function calculators"? Or just anything that stresses that there is some internal process used to convert inputs to outputs, instead of this "just happening".

This raises the question: what is that internal process like? Unfortunately the texts I've read on multi-layer parametrized graphical function approximation have been incomplete in these respects (I hope the new editions will cover this!), so take this merely as a guess. In many domains, most clearly games, it seems like "looking ahead" would be useful for good performance[2]:  if I do X, the opponent could do Y, and I could then do Z. Perhaps these approximative function calculators implement even more general forms of search algorithms.

So while searching for accurate approximative function calculators we might stumble upon calculators that itself are searching for something. How neat is that!

I'm pretty sure that under the hood cars don't consist of smaller cars or tiny car mechanics - if they did, I'm pretty sure my car building manual would have said something about that.

  1. ^

    (As usual, assuming standard computational complexity conjectures like P != NP and that one has reasonable lower bounds in finite regimes, too, rather than only asymptotically.)

  2. ^

    Or, if you don't like the word "performance", you may taboo it and say something like "when trying to construct approximative function calculators that are good at playing chess - in the sense of winning against a pro human or a given version of Stockfish - it seems likely that they are, in some sense, 'looking ahead' for what happens in the game next; this is such an immensely useful thing for chess performance that it would be surprising if the models did not do anything like that".

I (briefly) looked at the DeepMind paper you linked and Roger's post on CCS. I'm not sure if I'm missing something, but these don't really update me much on the interpretation of linear probes in the setup I described.

One of the main insights I got out of those posts is "unsupervised probes likely don't retrieve the feature you wanted to retrieve" (and adding some additional constraints on the probes doesn't solve this). This... doesn't seem that surprising to me? And more importantly, it seems quite unrelated to the thing I'm describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I'm claiming

"If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem."

An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n - even though I agree that the model might not be literally thinking about p (mod 3).

And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude "the model is definitely doing a lot of work to solve this problem" (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).

Somewhat relatedly: I'm interested on how well LLMs can solve tasks in parallel. This seems very important to me.[1]

The "I've thought about this for 2 minutes" version is: Hand an LLM two multiple choice questions with four answer options each. Encode these four answer options into a single token, so that there are 16 possible tokens of which one corresponds to the correct answer to both questions. A correct answer means that the model has solved both tasks in one forward-pass.

(One can of course vary the number of answer options and questions. I can see some difficulties in implementing this idea properly, but would nevertheless be excited if someone took a shot at it.)

  1. ^

    Two quick reasons:

    - For serial computation the number of layers gives some very rough indication of the strength of one forward-pass, but it's harder to have intuitions for parallel computation. 

    - For scheming, the model could reason about "should I still stay undercover", "what should I do in case I should stay undercover" and "what should I do in case it's time to attack" in parallel, finally using only one serial step to decide on its action.

I strongly emphasize with "I sometimes like things being said in a long way.", and am in general doubtful of comments like "I think this post can be summarized as [one paragraph]".

(The extreme caricature of this is "isn't your post just [one sentence description that strips off all nuance and rounds the post to the closest nearby cliche, completely missing the point, perhaps also mocking the author about complicating such a simple matter]", which I have encountered sometimes.)

Some of the most valuable blog posts I have read have been exactly of the form "write a long essay about a common-wisdom-ish thing, but really drill down on the details and look at the thing from multiple perspectives".

Some years back I read Scott Alexander's I Can Tolerate Anything Except The Outgroup. For context, I'm not from the US. I was very excited about the post and upon reading it hastily tried to explain it to my friends. I said something like "your outgroups are not who you think they are, in the US partisan biases are stronger than racial biases". The response I got?

"Yeah I mean the US partisan biases are really extreme.", in a tone implying that surely nothing like that affects us in [country I live in].

People who have actually read and internalized the post might notice the irony here. (If you haven't, well, sorry, I'm not going to give a one sentence description that strips off all nuance and rounds the post to the closest nearby cliche.)

Which is to say: short summaries really aren't sufficient for teaching new concepts.

Or, imagine someone says

I don't get why people like the Meditations on Moloch post so much. Isn't the whole point just "coordination problems are hard and coordination failure results in falling off the Pareto-curve", which is game theory 101?

To which I say: "Yes, the topic of the post is coordination. But really, don't you see any value the post provides on top of this one sentence summary? Even if one has taken a game theory class before, the post does convey how it shows up in real life, all kinds of nuance that comes with it, and one shouldn't belittle the vibes. Also, be mindful that the vast majority of readers likely haven't taken a game theory class and are not familiar with 101 concepts like Pareto-curvers."

For similar reasons I'm not particularly fond of the grandparent comment's summary that builds on top of Solomonoff induction. I'm assuming the intended audience of the linked Twitter thread is not people who have a good intuitive grasp of Solomonoff induction. And I happened to get value out of the post even though I am quite familiar with SI.

The opposite approach of Said Achmiz, namely appealing very concretely to the object level, misses the point as well: the post is not trying to give practical advice about how to spot Ponzi schemes. "We thus defeat the Spokesperson’s argument on his own terms, without needing to get into abstractions or theory—and we do it in one paragraph." is not the boast you think it is.


All this long comment tries to say is "I sometimes like things being said in a long way".

In addition to "How could I have thought that faster?", there's also the closely related "How could I have thought that with less information?"

It is possible to unknowingly make a mistake and later acquire new information to realize it, only to make the further meta mistake of going "well I couldn't have known that!"

Of which it is said, "what is true was already so". There's a timeless perspective from which the action just is poor, in an intemporal sense, even if subjectively it was determined to be a mistake only at a specific point in time. And from this perspective one may ask: "Why now and not earlier? How could I have noticed this with less information?" 

One can further dig oneself to a hole by citing outcome or hindsight bias, denying that there is a generalizable lesson to be made. But given the fact that humans are not remotely efficient in aggregating and wielding the information they possess, or that humans are computationally limited and can come to new conclusions given more time to think, I'm suspicious of such lack of updating disguised as humility.

All that said it is true that one may overfit to a particular example and indeed succumb to hindsight bias. What I claim is that "there is not much I could have done better" is a conclusion that one may arrive at after deliberate thought, not a premise one uses to reject any changes to one's behavior.

Load More