All of Mikhail Samin's Comments + Replies

(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)

But I was talking about sharing the internal representations between the two already trained transformers.

It does not require superintelligence to share representations between different neural networks

I don’t think you can train one transformer on a dataset that doesn’t contain any mentions of the fact X but mentions fact Y, then train the second transformer on a dataset that contains Y but not X, and then easily share the knowledge of X and Y between them

3faul_sname15d
Let's say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the "glue" takes <1% of the compute used to train the independent models? I weakly expect the opposite, largely based on stuff like this [https://openreview.net/pdf?id=8tYRqb05pVn], and based on playing around with using algebraic value editing [https://github.com/montemac/algebraic_value_editing] to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there's no guarantee that result scales or would transfer the way I'm imagining).

If you were more like the person you wish to be, and you were smarter, do you think you’d still want our descendants not to optimise when needed to leave alone beings who’d prefer to be left alone? If you would still think that, why is it not CEV?

2Vladimir_Nesov19d
It's probably implied by CEV. The point is that you don't need the whole CEV to get it, it's probably easier to get, a simpler concept and a larger alignment target that might be sufficient to at least notkilleveryone, even if in the end we lose most of the universe. Also, you gain the opportunity to work on CEV and eventually get there, even if you have many OOMs less resources to work with. It would of course be better to get CEV before building ASIs with different values or going on a long value drift trip ourselves.

A guy from Conjecture told me about this proposal in the lines of “let’s create a human-level AI system that’s built kind of like humans and is safe to run at a human speed”, and it seemed like a surprisingly bad proposal, so I looked up this post and it still looks surprisingly bad:

Even if you succeed at this, how exactly do you plan to use it? Running one single human at a human speed seems like the kind of thing one can get by simply, you know, hiring someone; running a thousand of these things at 1000x normal speed means you’re running some completely ... (read more)

The bias I'm talking about isn't in its training data, it's in the model, which doesn't perfectly represent the training data.

If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don't expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a syste... (read more)

Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?

The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.

 

A: The model doesn't generate text during training, so feedback loop dynamics are not directly penalized.
Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how hu... (read more)

The explain me like I’m 5 would be something more in the direction of:

“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.

The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this... (read more)

2APaleBlueDot1mo
Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad trend is good because it works in the short term but not in the long term?  Could we not align the AI to realize there could be limits on such trends? What if there is a gradual misaligning that gets noticed by the aligner and is corrected? The only way for this to avoid some sort of continuous alignment system is if it catastrophically fails before continuous alignment detects it. Consider it inductively, we start off with an aligned model that can improve itself. The aligned model, if properly aligned, will make sure its new version is also properly aligned, and won't improve itself if it is unable to do so. 
2avturchin2mo
I see, thanks! At first I thought that if we, say, train LLM on the history of french revolution, it will have a model of Napoleon and this model - or at least associated with it capabilities - will start getting control over LLM-output. But now it more look like Pelevin novel "T" where a character slowly start to understand that he is in output of something like LLM. But the character also evolves via Darwinian evolution to become something like alien.   So the combination of  models of agentic and highly capable characters inside LLM shaped by Darwinian evolution into something non-human becoming LLM-lucid - that is, getting understanding that it is in LLM ends in appearance of dangerous behavior. Now knowing all this - how could I know that I am not inside LLM? :)  

“There can potentially be conditions for a Sharp Left Turn-like dynamic in a distribution of simulacra LLMs think about, because LLMs might naturally select entities better at increasing their influence”.

Or, since people mostly don’t understand what Sharp Left Turn is, maybe “some relevant parts of model’s cognition might be gradually taken over by increasingly agentic simulacra, until some specific coherent goal-oriented entity is in control”

2avturchin2mo
Yers, ELI-5 TL;DR is actually what is needed. What is the main difference of my formulation: if a LLM includes a model of high intelligent agent, it will eventually start  using this agent to solve all complex task?

Ok!

No, I don’t think it’s related to what I’m writing about.

2avturchin2mo
If there will be one-line TLDR of you post6 what it will be like?

By myopic I mean https://www.lesswrong.com/tag/myopia — that it was trained to predict the next token and doesn’t get much lower loss from having goals about anything longer-term than predicting the next token correctly.

I assume the weights are frozen, I’m surprised to see this as a question.

Some quick replies from the top of my head: If GPT-7 has a much larger context window; or if there are kinds of prompts the dynamic converges to that aren’t too long; and you get an AGI that’s smart and goal-oriented and needs to spend some of the space that it has to ... (read more)

Huh? The claim in this post is that this might happen even if you don’t explicitly design an AutoGPT-like prompt.

Also, the current LLMs probably don’t spend much resources on simulating themselves generating the text, this didn’t help at all during training, and LLMs aren’t some magical computable Solomonoff induction that converges to itself. You wouldn’t see LLMs trained just to predict the next token in a pre-2021 dataset outputting some random token with a high certainty because they know if they output it and it’s chosen, it becomes the correct nest token.

2avturchin2mo
What I said was more like tangential thoughts about how similar thing could happen. is the following good TLDR of your idea: "if a LLM has a model of high intelligent agent, it will eventually collapse into using this agent to solve all complex tasks"

Is it fair to say that you’re trying to:

  • make a theory of agency that at least somewhat describes what we’ll likely see in the real world and also precisely corresponds to some parts of the space of possible agents;
  • find a way to talk about alignment and say what it means for agents to be aligned;
  • find a mathematical structure that corresponds to agents aligned with humans;
  • produce desiderata that can be used to engineer a training setup that might lead to a coherent, aligned agent?

(Hi Johannes!)

I strongly disagree. While it’s ok to be on whatever state of the mind that causes you to suffer less and be more productive, I don’t think it makes sense to literally forget for it is that you’re trying to achieve. If nothing you do influences the outcome at all, if there’s literally no way to influence the p(doom), what you do is not what Yudkowsky uses the word “dignity” for. Dignity isn’t about living for longer, and it isn’t about doing something that has no chance of working. Don’t do things you think have literally no chance of actuall... (read more)

3Johannes C. Mayer2mo
Yes, I think I understand this. The method that I am describing is not about increasing dignity necessarily. Rather, I am trying to point out that there is a particular state of mind, in which even if nothing that you would do matters, you could still work on your best guess of what would help. I of course completely agree that you should still work on the thing that is most likely to have a counterfactual impact. But I like to give the extreme example of where you think that there is nothing that you can do, because if you can still work in that state productively on a futile agenda, then for any less dire circumstance where you can actually have an impact, you should be able to also be at peace in that situation. And I think being at peace. is more useful than to continually panic. Another strategy I like to do is to think about the world where we survive, and then try to understand what sort of things I would have been doing in that world when what I have been doing has contributed to avoiding doom. I might have not explained this well, but this technique is not about forgetting the doom at all, it's just about disconnecting yourself from it emotionally. You can stop grasping the hope and be at peace with the imminent doom, but then still think about how to prevent the doom. That is the state of mind I'm trying to point at here. The consequentialist reasoning that you do in order to optimize away the doom does not need to be fueled by emotions of aversion and clinging. In fact, I have found it counterproductive to experience these emotions. That is why I am suggesting this strategy. Again, it's definitely not about forgetting the doom. It's about being able to think about the doom without having your emotions yank your mind around uncontrollably.

Anthropic Immortality Alignment

Bro just initialize an enormous neural network with terabytes of quantum randomness; in some branches, it’s already aligned AGI

4lc3mo
And also signflip AGI 💀

How do I test whether I actually understand the sort of thing Nate describes and not just consider it obvious in hindsight? (I feel like I was able to explain it to people before reading this post)

The new page will also resemble a random string composed of "0" and "1", so this process will continue indefinitely.

I don't think this is true: there will be patterns emerging, and anything that appeared randomly and looks slightly less than random will make future continuations even less random until the LLM converges to something full of patterns.

Another way to say it- if it was always outputting 50/50% for 0 and 1, and you were running the process indefinitely, it would necessarily go through a string of all zeros, and I don't think you think it would o... (read more)

1Cleo Nardo3mo
Yep, but it's statistically unlikely. It is easier for order to disappear than for order to emerge.

I don’t expect everyone to disregard the danger; I do expect most people building capable AI systems to continue to hide hard problems. Hiding the hard problems is much easier than solving them, but I guess produces plausible-sounding solutions just as well.

Roughly human level humans don’t contribute significantly to AI alignment research and can’t be pivotally used. So I don’t think you think that a roughly human level AI system can contribute significantly to AI alignment research. Maybe you (as many seem to) think that if someone runs not-that-superhuma... (read more)

In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.

That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens tha... (read more)

7paulfchristiano4mo
I also don't know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the "RLHF --> non-myopia --> treacherous turn" argument so that it can be discussed more critically. Yes, of course such a model will make superhuman moves (as will GPT if prompted on "Player 1 won the game by making move X"), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts). But I think the thing I'm objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk. I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that's all you are saying we are in agreement, and my claim is just that it doesn't really challenge Peter's summary or (or represent a particularly important problem for RLHF).

I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.

RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).

8paulfchristiano4mo
RLHF basically predicts "what token would come next in a high-reward trajectory?" (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.) GPT predicts "what token would come next in this text," where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences. I don't think those are particularly different in the relevant sense. They both produce consequentialist behavior in the obvious way. The relationship between the objective and the model's cognition is unclear in both cases and it seems like they should give rise to very similar messy things that differ in a ton of details. The superficial difference in myopia doesn't even seem like the one that's relevant to deceptive alignment---we would be totally fine with an RLHF model that optimized over a single episode. The concern is that you get a system that wants some (much!) longer-term goal and then behaves well in a single episode for instrumental reasons, and that needs to be compared to a system which wants some long-term goal and then predicts tokens well for instrumental reasons. (I think this form of myopia is also really not related to realistic reward hacking concerns, but that's slightly more complicated and also less central to the concerns currently in vogue here.) I think someone should actually write out this case in detail so it can be critiqued (right now I don't believe it). I think there is a version of this claim in the "conditioning generative models" sequence which I critiqued in the draft I read, I could go check the version that got published to see if I still disagree. I definitely don't think it's obvious, and as far as I can tell no evidence has yet come in.

[RETRACTED after Scott Aaronson’s reply by email]

I'm surprised by Scott Aaronson's approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he's focusing on the experiments that are possible to do with the current AI systems.

The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you're searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something ... (read more)

[This comment is no longer endorsed by its author]Reply

Oh. I was thinking the diamond maximizer problem is about making the AI care about the specific goal of maximizing diamond, not about making the AI have some consistent goal slot instead of lots of cognitive spaghetti code. I think it’s simple to make a highly agentic AI with some specific goals it’s really good at maximizing (if you have infinite compute and don’t care about what these goals actually are; I have no idea how to point that at diamond). Should I write a description somewhere?

1Jonathan Claybrough4mo
Is it simple if you don't have infinite compute ?  I would be interested in a description which doesn't rely on infinite compute, or more strictly still, that is is computationally tractable. This constraint is important to me because I assume that the first AGI we get is using something that's more efficient that other known methods (eg. using DL because it works, even though it's hard to control), so I care about aligning the stuff which we'll actually be using. 

Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)

The text's target audience was two people who I'd expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it's a claim that I haven't justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I'm not sure where exactly the crux lies, though. I'd be interested in h... (read more)

4TurnTrout7mo
Thanks for your detailed and thoughtful response!  >5% under current uncertainty.  Are you saying that pre-superhuman behavior doesn't tell you about its goals? Like, zero mutual information? Doesn't this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course). I understand this to mean: "If you understand an AI's motivations before it's superhuman, that tells you relatively little about its post-reflection values." I strongly disagree. Isn't the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?  I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it's just abstract reasoning all the way down -- that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe. For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say "oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can 'betray' the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate's story]."  (Importantly, these details would have to be concrete. Not "you train the AI and it stops doing what you want", that's not a specific concrete situation.) But right now, there's a strong focus on possibly inappropriate analogies with evolution [https://www.lesswrong.com/posts/FyChg3kYG54tEN3u6/evolution-is-a-bad-analogy-for-agi-inner-alignment]. That doesn't mean Nate is wrong.

There’s a scan of 1 mm^3 of a human brain, 1.4 petabytes with hundred(s?) of millions of synapses

‪https://ai.googleblog.com/2021/06/a-browsable-petascale-reconstruction-of.html‬