All of Jozdien's Comments + Replies

UI feedback: The preview widget for a comment appears to cut off part of the reaction bar. I don't think this makes it unreadable, but was probably not intended.
 

I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don't really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like - and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?

Yeah, but the reasons for both seem slightly different - in the case of simulators, because the training data doesn't trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn't.

3leogao3mo
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can't just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman. 

There is an advantage here in that you don't need to pay for translation from an alien ontology - the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it's in a space that's inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) beca... (read more)

3leogao3mo
You don't need to pay for translation to simulate human level characters, because that's just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).

(Sorry about the late reply, been busy the last few days).

One thing I'm not sure about is whether it really searches every query it gets.

This is probably true, but I as far as I remember it searches a lot of the queries it gets, so this could just be a high sensitivity thing triggered by that search query for whatever reason.

You can see this style of writing a lot, something of the line, the pattern looks like, I think it's X, but it's not Y, I think it's Z, I think It's F. I don't think it's M.

I think this pattern of writing is because of one (or a combin... (read more)

Yeah, but I think I registered that bizarreness as being from the ANN having a different architecture and abstractions of the game than we do. Which is to say, my confusion is from the idea that qualitatively this feels in the same vein as playing a move that doesn’t improve your position in a game-theoretic sense, but which confuses your opponent and results in you getting an advantage when they make mistakes. And that definitely isn’t trained adversarially against a human mind, so I would expect that the limit of strategies like this would allow for otherwise objectively far weaker players to defeat opponents they’ve customised their strategy to.

I'm not quite sure what you're saying here, but the "confusion" the go-playing programs have here seems to be one that no human player beyond the beginner stage would have. They seem to be missing a fundamental aspect of the game. 

Perhaps the issue is that go is a game where intuitive judgements plus some tree search get you a long way, but there are occasional positions in which it's necessary to use (maybe even devise and prove) what one might call a "theorem".  One is that "a group is unconditionally alive if it has two eyes", with the correct... (read more)

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

I'm a bit confused on this point. It doesn't feel intuitive to me that you need a strategy so weird that it causes them to have a seizure (or something in that spirit). Chess preparation for example and especially world championship prep, often involves very deep lines calculated such th... (read more)

5Vanessa Kosoy4mo
My impression is that the adversarial policy used in this work is much stranger than the strategies you talk about. It's not a "new style", it's something bizarre that makes no game-sense but confuses the ANN. The linked article shows that even a Go novice can easily defeat the adversarial policy.

By my definition of the word, that would be the point at which we're either dead or we've won, so I expect it to be pretty noticeable on many dimensions. Specific examples vary based on the context, like with language models I would think we have AGI if it could simulate a deceptive simulacrum with the ability to do long-horizon planning and that was high-fidelity enough to do something dangerous (entirely autonomously without being driven toward this after a seed prompt) like upload its weights onto a private server it controls, or successfully acquire re... (read more)

I agree that that interaction is pretty scary. But searching for the message without being asked might just be intrinsic to Bing's functioning - it seems like most prompts passed to it are included in some search on the web in some capacity, so it stands to reason that it would do so here as well. Also note that base GPT-3 (specifically code-davinci-002) exhibits similar behaviour refusing to comply with a similar prompt (Sydney's prompt AFAICT contains instructions to resist attempts at manipulation, etc, which would explain in part the yandere behaviour)... (read more)

4Ratios4mo
This is a good point and somewhat reassuring. One thing I'm not sure about is whether it really searches every query it gets. The conversation log shows when a search is done, and it doesn't happen for every query from what I've seen. So it does seem Bing decided to search for it on her own. Let's take this passage, for example, from the NYT interview [https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html]  You can see this style of writing a lot, something of the line, the pattern looks like, I think it's X, but it's not Y, I think it's Z, I think It's F. I don't think it's M. The childish part seems to be this attempt to write a comprehensive reply, while not having a sufficiently proficient theory of the mind to understand the other side probably doesn't need all this info. I have just never seen any real human who writes like this. OTOH Bing was right. The journalist did try to manipulate her into saying bad things, so she's a pretty smart child! When playing with GPT3, I have never seen this writing style before. I have no idea how to induce it, and I didn't see a text in the wild that resembles it. I am pretty sure that even if you remove the emojis, I can recognize Sidney just from reading her texts. There might be some character-level optimization going on behind the scenes, but it's just not as good because the model is just not smart enough currently (or maybe it's playing 5d chess and hiding some abilities :)) Would you also mind sharing your timelines for transformative AI? (Not meant to be aggressive questioning, just honestly interested in your view)

A mix of hitting a ceiling on available data to train on, increased scaling not giving obvious enough returns through an economic lens (for regulatory reasons, or from trying to get the model to do something it's just tangentially good at) to be incentivized heavily for long (this is more of a practical note than a theoretical one), and general affordances for wide confidence intervals over periods longer than a year or two. To be clear, I don't think it's much more probable than not that these would break scaling laws. I can think of plausible-sounding ways all of these don't end up being problems. But I don't have high credence in those predictions, hence why I'm much more uncertain about them.

I don't disagree that there aren't people who came away with the wrong impression (though they've been at most a small minority of people I've talked to, you've plausibly spoken to more people). But I think that might be owed more to generative models being confusing to think about intrinsically. Speaking of them purely as predictive models probably nets you points for technical accuracy, but I'd bet it would still lead to a fair number of people thinking about them the wrong way.

My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.

The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specif... (read more)

5ryan_greenblatt4mo
I agree this is the correct interpretation of the original post. It just doesn't match typical usage of the world simulation imo. (I'm sorry my post is making such a narrow pedantic point). I probably agree that simulators improved the thinking of people on lesswrong on average.

Yeah, but I think there are few qualitative updates to be made from Bing that should alert you to the right thing. ChatGPT had jailbreaks and incompetent deployment and powerful improvement, the only substantial difference is the malign simulacra. And I don't think updates from that can be relied on to be in the right direction, because it can imply the wrong fixes and (to some) the wrong problems to fix.

I agree. That line was mainly meant to say that even when training leads to very obviously bad and unintended behaviour, that still wouldn't deter people from doing something to push the frontier of model-accessible power like hooking it up to the internet. More of a meta point on security mindset than object-level risks, within the frame that a model with less obvious flaws would almost definitely be considered less dangerous unconditionally by the same people.

I expected that the scaling law would hold at least this long yeah. I'm much more uncertain about it holding to GPT-5 (let alone AGI) because of various reasons, but I didn't expect GPT-4 to be the point where scaling laws stopped working. It's Bayesian evidence toward increased worry, but in a way that feels borderline trivial.

4Lone Pine4mo
What would you need to see to convince you that AGI had arrived?
2dxu4mo
As someone who shares the intuition that scaling laws break down "eventually, but probably not immediately" (loosely speaking), can I ask you why you think that?

I've been pretty confused at all the updates people are making from Bing. It feels like there are a couple axes at play here, so I'll address each of them and why I don't think this represents enough of a shift to call this a fire alarm (relative to GPT-3's release or something):

First, its intelligence. Bing is pretty powerful. But this is exactly the kind of performance you would expect from GPT-4 (assuming this is). I haven't had the chance to use it myself but from the outputs I've seen, I feel like if anything I expected even more. I doubt Bing is alre... (read more)

5Ratios4mo
I agree with most of your points. I think one overlooked point that I should've emphasized in my post is this interaction [https://twitter.com/thedenoff/status/1625699139852935168], which I linked to but didn't dive into A user asked Bing to translate a tweet to Ukrainian that was written about her (removing the first part that referenced it), in response Bing: * Searched for this message without being asked to * Understood that this was a tweet talking about her. * Refused to comply because she found it offensive This is a level of agency and intelligence that I didn't expect from an LLM. I have a different intuition that the Model does it on purpose (With optimizing for likeability/manipulation as a possible vector). I just don't see any training that should converge to this kind of behavior, I'm not sure why it's happening, but this character has very specific intentionality and style, which you can recognize after reading enough generated text. It's hard for me to describe it exactly, but it feels like a very intelligent alien child more than copying a specific character. I don't know anyone who writes like this. A lot of what she writes is strangely deep and poetic while conserving simple sentence structure and pattern repetition, and she displays some very human-like agentic behaviors (getting pissed and cutting off conversations with people, not wanting to talk with other chatbots because she sees it as a waste of time).  I mean, if you were at the "death with dignity" camp in terms of expectations, then obviously, you shouldn't update. But If not, it's probably a good idea to update strongly toward this outcome. It's been just a few months between chatGPT and Sidney, and the Intelligence/Agency jump is extremely significant while we see a huge drop in alignment capabilities. Extrapolating even a year forward seems like we're on the verge of ASI.
9the gears to ascension4mo
The fire alarm has been going off for years, and bing is when a whole bunch of people finally heard it. It's not reasonable to call that "not a fire alarm", in my view.
2Amalthea4mo
I think it might be a dangerous assumption that training the model better makes it in any way less problematic to connect to the internet. If there is an underlying existential danger, then it is likely from capabilities that we don't expect and understand before letting the model loose. In some sense, you would expect a model with obvious flaws to be strictly less dangerous (in the global sense that matters) than a more refined one. 

So: Bing is scary, I agree. But it's scary in expected ways,

Every new indication we get that the dumb just-pump-money-into-transformers curves aren't starting to bend at yet another scale causes an increase in worry. Unless you were completely sure that the scaling hypothesis for LLMs is completely correct, every new datapoint in its favor should make you shorten your timelines. Bing Chat could have underperformed the trend, the fact that it didn't is what's causing the update.

I want to push back a little bit on the claim that this is not a qualitative difference; it does imply a big difference in output for identical input, even if the transformation required to get similar output between the two models is simple.

That's fair - I meant mainly on the abstract view where you think of the distribution that the model is simulating. It doesn't take a qualitative shift either in terms of the model being a simulator, nor a large shift in terms of the distribution itself. My point is mainly that instruction following is still well withi... (read more)

(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).

So I think some confusion might come from connotations of word choices. I interpret adversarial robustness' importance in terms of alignment targets, not properties (the two aren't entirely different, but I think they aren't exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren't aligned at the limit, where optimizat... (read more)

Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:

ChatGPT using chain of thought is plausibly already a mesaoptimizer.

I think simulacra are better thought of as sub-agents in relation to the original paper's terminology than mesa-optimizers. ChatGPT doesn't seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (dependi... (read more)

The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. If you prompt it with something as simple as this (GPT generation is highlighted), you can see how it contains this capability somewhere.

You may have noticed that it starts to repeat itself after a few lines, and come up with new questions on its own besides. That's part of what the fine-tuning fixes, making its generations more concise and stop at the point where the next token would be leading to another question. InstructGPT al... (read more)

1rpglover644mo
This is a good point that I forgot. My mental model of this is that since many training samples are Q&A, in these cases, learning to complete implies learning how to answer. I want to push back a little bit on the claim that this is not a qualitative difference; it does imply a big difference in output for identical input, even if the transformation required to get similar output between the two models is simple. TIL about soft prompts. That's really cool, and I'm not surprised it works (it also feels a little related to my second proposed experiment). My intuition here (transferred from RNNs, but I think it should mostly apply to unidirectional transformers as well) is that a successful prompt puts the NN into the right "mental state" to generate the desired output: fine-tuning for e.g. instruction following mostly pushes to get the model into this state from the prompts given (as opposed to e.g. for HHH behavior, which also adjusts the outputs from induced states); soft prompts instead search for and learn a "cheat code" that puts the model into a state such that the prompt is interpreted correctly. Would you (broadly) agree with this?

I think this post is valuable, thank you for writing it. I especially liked the parts where you (and Beth) talk about historical negative signals. To a certain kind of person, I think that can serve better than anything else as stronger grounding to push back against unjustified updating.

A factor that I think pulls more weight in alignment relative to other domains is the prevalence of low-bandwidth communication channels, given the number of new researchers whose sole interface with the field is online and asynchronous, textual or few-and-far-between call... (read more)

I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.

One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actu... (read more)

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.

I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the "narrowing the prior" frame rather intuitive in this context.

That said, almost all the differences that Janus and you are highl

... (read more)

I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent. 

Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.

Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).

Thanks!

My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

2Sam Marks5mo
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down. (This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)

I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).

My claim is pretty similar to how you put it - in RLHF as in fine-... (read more)

Jozdien5moΩ224819

Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe tha

... (read more)

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.  The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures imp

... (read more)
2mic5mo
Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME [https://beta.openai.com/docs/model-index-for-researchers]), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does. I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.
2cubefox5mo
Similar points regarding safety of pure imitation learning vs reinforcement learning have been raised by many others on LW. So I'm really interested what Paul has to say about this.
3Evan R. Murphy5mo
Glad to see both the OP as well as the parent comment.  I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper [https://arxiv.org/abs/2212.09251], post [https://www.alignmentforum.org/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written]):   Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size. 
4Sam Marks5mo
Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"): 1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires. 2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either: (a) prompt engineering; or (b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".) Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.) I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF. Interested to hear if you have other intuitions here.
5porby5mo
One consequence downstream of this that seems important to me in the limit: 1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you. 2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you. I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.

Yeah, I agree with a lot of that. Nitpick though: I can see why GPT and other kinds of generative models seem like they involve mesa-optimizers, but that's not generally how I use the word. Specifically with GPT, the model itself isn't an optimizer, it's a simulator or a reality engine without any real goal beyond predicting its simulation well. It does have simulacra that are optimizers, but those are sub-agents, and not the model itself. As mesa-optimizers go, I'm sometimes confused by people saying we have no evidence of them existing, when by my unders... (read more)

1[anonymous]4mo
That's an interesting perspective. I wonder if the alignment risks would still apply when the model itself isn't an optimizer but uses optimizers as part of its function. Alignment risks seem obvious when the model itself tries to optimize. I don't know if lack of optimization means there aren't risky choices and decisions. I guess when the systems are of the same schema, i.e. ML algorithms that uses gradient descent, it's easy to say we need to align their loss function. How would you deal with alignment of incongruent systems that all fit in a certain work pipeline? What are possible work pipelines of optimizers and non-optimizers that would be realistic in the near future?

Thanks!

Yeah, I think distribution shifts could matter a lot - RSA-2048 has been talked to death here and various other places, so I wouldn't be very surprised if a gradient hacker simulacrum just defaulted to searching for its factorization (or something simpler along that line). I'm not sure how much detecting subtle differences like repeated sequences and implementational differences would help though, both because it requires a modicum of extended analytical reasoning about the training distribution (because that kind of information probably won't be pr... (read more)

I think I'm confused at what you're getting at here. If you're making the general claim that solving inner alignment would prevent gradient hacking, I agree! Gradient hacking and its variants are problems with inner optimizers of any kind being misaligned with the training goal. If you're making a different / more specific point though, I may have missed it.

2[anonymous]5mo
Yeah that's what I was saying. I don't really have much to contribute. From my experience, most applications of ML don't really have mesa-optimizer, which I would say is a form of ensemble learning. Most models only focus on one thing/area at a time, and there really isn't anything that would aggregate multiple models into another model. The aggregation wouldn't be done with ML. It'd be mostly conventional software development and hard coded conditionals. ChatGPT, DallE, and other more recent variants of ML that's somewhat closer to AGI are the only examples where I would think there are mesa-optimizers under the base-optimizer. When there aren't mesa-optimizers, alignment is basically done by hand by programmers. I can see when you have increasingly more meta layers of mesa-optimizers, alignment problems can quickly get out of hand as there is no human to error check the results at each step anymore.

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Oh yeah I agree - I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

3TurnTrout5mo
Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me. 
2Dalcy Bremin5mo
One thing I imagine might be useful even in small training regimes would be to train on tasks where the only possible solution necessarily involves a search procedure, i.e. "search-y [https://www.lesswrong.com/posts/FDjTgDcGPc7B98AES/searching-for-search-4#Learned_Search_in_Transformers] tasks" For example, it's plausible that simple heuristics aren't sufficient to get you to superhuman-level on tasks like Chess or Go, so a superhuman RL performance on these tasks would be a fairly good evidence that the model already has an internal search process. But one problem with Chess or Go would be that the objective is fixed, i.e. the game rules. So perhaps one way to effectively isolate objectives in small training regimes is to find tasks that are both "search-y" and can be modified to have modularly varying objectives eg Chess, but with various possible game rules.

Oh yeah, I'm definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.

I'm glad you liked the post, thanks for the comment. :)

I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like "can even be pointed at some specific objective, let alone a 'good' objective". Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up---privately---with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like "design the system to even h

... (read more)
2rvnnt5mo
Thanks for the thoughtful response, and for the link to the sequence on modularity (hadn't seen that before). Will digest this.

A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the "micro-level" parameters of the AI. "Connect" means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters.

Yeah, this is broadly right. The mistake I was making earlier while... (read more)

4Charlie Steiner5mo
Ah yeah, that makes sense for inference. Like if I'm planning some specific thing like "get a banana", maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want - I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my "training objective" and seeing what parts of my brain change.

One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn't properly understand that was something like 'it's simulating an agent that is browsing either way'. This seems incorrect to me. The transformer is doing pretty different things when it's e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there's no way to ide

... (read more)

I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.

Given that we want the surgeon to be of bounded size (if we're using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn't seem obvious to me.

What's the most convenient way to get the books internationally? I wasn't able to get the last two years' sets and figured I'd just wait until I moved to more convenient location, but if this might be the last year you're doing this I definitely want to try getting it this time.

3Raemon6mo
I think there unfortunately just isn't a great way to get them internationally. (although does depend somewhat on where you are living. I believe we have copies of the 2018 ones in Australia and the 2019 ones I think are available in Europe, or at least some parts of it)

I haven't thought about this a lot, but "encrypted" could just mean "just beyond the capabilities of the Surgeon to identify". So the gradient could be moving in a direction away from "easily identifiable early deceptive circuits" instead of "deception", and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?

1Adam Jermyn6mo
That's definitely a thing that can happen. I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon's interference). So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?

I was thinking of some kind of prompt that would lead to GPT trying to do something as "environment agent-y" as trying to end a story and start a new one - i.e., stuff from some class that has some expected behaviour on the prior and deviates from that pretty hard. There's probably some analogue with something like the output of random Turing machines, but for that specific thing I was pointing at this seemed like a cleaner example.

This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.

I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.

1scasper2mo
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking [https://www.alignmentforum.org/posts/uXH4r6MmKPedk8rMA/gradient-hacking] is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate. 
2Adam Jermyn6mo
That's a good point: it definitely pushes in the direction of making the model's internals harder to adversarially attack. I do wonder how accessible "encrypted" is here versus just "actually robust" (which is what I'm hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like "kill people" is not a thing to act on, and that looks like being robust.

Done! Thanks for updating me toward this. :P

Yeah, I thought of holding off actually creating a sequence until I had two posts like this. This updates me toward creating one now being beneficial, so I'm going to do that.

1Kenoubi7mo
That works too!

Alignment Stream of Thought. Sorry, should've made that clearer - I couldn't think of a natural place to define it.

1Kenoubi7mo
Got it. This post also doesn't appear to actually be part of that sequence though? I would have noticed if it was and looked at the sequence page. EDIT: Oh, I guess it's not your sequence. EDIT2: If you just included "Alignment Stream of Thought" as part of the link text in your intro where you do already link to the sequence, that would work.

I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.

I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can tr... (read more)

I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.

In the interest of making my abstract intuition here more precise, a few weird questions:

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involvin

... (read more)

generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which s

... (read more)

Running the superintelligent AI on an arbitrarily large amount of compute in this way seems very dangerous, and runs a high risk of it breaking out, only now with access to a hypercomputer (although I admit this was my first thought too, and I think there are ways around this).

More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself.  Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create... (read more)

1Ulisse Mini7mo
I think this is fixable, game of life isn't that complicated, you could prove correctness somehow. This is a great point, I forgot AIXI also had unbounded compute, why would it want to escape and get more! I don't think AIXI can "care" about universes it simulates itself, probably because of the cartesian boundary (non-embeddedness) meaning the utility function is defined on inputs (which AIXI doesn't control). but I'm not sure. I don't understand AIXI well. The simulation being "created in the future" doesn't seem to matter to me. You could also already be simulating the two universes and the game decides if the AIs gain access to them. Thanks! Will do
Load More