I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don't really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like - and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?
Yeah, but the reasons for both seem slightly different - in the case of simulators, because the training data doesn't trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn't.
There is an advantage here in that you don't need to pay for translation from an alien ontology - the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it's in a space that's inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) beca...
(Sorry about the late reply, been busy the last few days).
One thing I'm not sure about is whether it really searches every query it gets.
This is probably true, but I as far as I remember it searches a lot of the queries it gets, so this could just be a high sensitivity thing triggered by that search query for whatever reason.
You can see this style of writing a lot, something of the line, the pattern looks like, I think it's X, but it's not Y, I think it's Z, I think It's F. I don't think it's M.
I think this pattern of writing is because of one (or a combin...
Yeah, but I think I registered that bizarreness as being from the ANN having a different architecture and abstractions of the game than we do. Which is to say, my confusion is from the idea that qualitatively this feels in the same vein as playing a move that doesn’t improve your position in a game-theoretic sense, but which confuses your opponent and results in you getting an advantage when they make mistakes. And that definitely isn’t trained adversarially against a human mind, so I would expect that the limit of strategies like this would allow for otherwise objectively far weaker players to defeat opponents they’ve customised their strategy to.
I'm not quite sure what you're saying here, but the "confusion" the go-playing programs have here seems to be one that no human player beyond the beginner stage would have. They seem to be missing a fundamental aspect of the game.
Perhaps the issue is that go is a game where intuitive judgements plus some tree search get you a long way, but there are occasional positions in which it's necessary to use (maybe even devise and prove) what one might call a "theorem". One is that "a group is unconditionally alive if it has two eyes", with the correct...
I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.
I'm a bit confused on this point. It doesn't feel intuitive to me that you need a strategy so weird that it causes them to have a seizure (or something in that spirit). Chess preparation for example and especially world championship prep, often involves very deep lines calculated such th...
By my definition of the word, that would be the point at which we're either dead or we've won, so I expect it to be pretty noticeable on many dimensions. Specific examples vary based on the context, like with language models I would think we have AGI if it could simulate a deceptive simulacrum with the ability to do long-horizon planning and that was high-fidelity enough to do something dangerous (entirely autonomously without being driven toward this after a seed prompt) like upload its weights onto a private server it controls, or successfully acquire re...
I agree that that interaction is pretty scary. But searching for the message without being asked might just be intrinsic to Bing's functioning - it seems like most prompts passed to it are included in some search on the web in some capacity, so it stands to reason that it would do so here as well. Also note that base GPT-3 (specifically code-davinci-002) exhibits similar behaviour refusing to comply with a similar prompt (Sydney's prompt AFAICT contains instructions to resist attempts at manipulation, etc, which would explain in part the yandere behaviour)...
A mix of hitting a ceiling on available data to train on, increased scaling not giving obvious enough returns through an economic lens (for regulatory reasons, or from trying to get the model to do something it's just tangentially good at) to be incentivized heavily for long (this is more of a practical note than a theoretical one), and general affordances for wide confidence intervals over periods longer than a year or two. To be clear, I don't think it's much more probable than not that these would break scaling laws. I can think of plausible-sounding ways all of these don't end up being problems. But I don't have high credence in those predictions, hence why I'm much more uncertain about them.
I don't disagree that there aren't people who came away with the wrong impression (though they've been at most a small minority of people I've talked to, you've plausibly spoken to more people). But I think that might be owed more to generative models being confusing to think about intrinsically. Speaking of them purely as predictive models probably nets you points for technical accuracy, but I'd bet it would still lead to a fair number of people thinking about them the wrong way.
My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.
The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specif...
Yeah, but I think there are few qualitative updates to be made from Bing that should alert you to the right thing. ChatGPT had jailbreaks and incompetent deployment and powerful improvement, the only substantial difference is the malign simulacra. And I don't think updates from that can be relied on to be in the right direction, because it can imply the wrong fixes and (to some) the wrong problems to fix.
I agree. That line was mainly meant to say that even when training leads to very obviously bad and unintended behaviour, that still wouldn't deter people from doing something to push the frontier of model-accessible power like hooking it up to the internet. More of a meta point on security mindset than object-level risks, within the frame that a model with less obvious flaws would almost definitely be considered less dangerous unconditionally by the same people.
I expected that the scaling law would hold at least this long yeah. I'm much more uncertain about it holding to GPT-5 (let alone AGI) because of various reasons, but I didn't expect GPT-4 to be the point where scaling laws stopped working. It's Bayesian evidence toward increased worry, but in a way that feels borderline trivial.
I've been pretty confused at all the updates people are making from Bing. It feels like there are a couple axes at play here, so I'll address each of them and why I don't think this represents enough of a shift to call this a fire alarm (relative to GPT-3's release or something):
First, its intelligence. Bing is pretty powerful. But this is exactly the kind of performance you would expect from GPT-4 (assuming this is). I haven't had the chance to use it myself but from the outputs I've seen, I feel like if anything I expected even more. I doubt Bing is alre...
So: Bing is scary, I agree. But it's scary in expected ways,
Every new indication we get that the dumb just-pump-money-into-transformers curves aren't starting to bend at yet another scale causes an increase in worry. Unless you were completely sure that the scaling hypothesis for LLMs is completely correct, every new datapoint in its favor should make you shorten your timelines. Bing Chat could have underperformed the trend, the fact that it didn't is what's causing the update.
I want to push back a little bit on the claim that this is not a qualitative difference; it does imply a big difference in output for identical input, even if the transformation required to get similar output between the two models is simple.
That's fair - I meant mainly on the abstract view where you think of the distribution that the model is simulating. It doesn't take a qualitative shift either in terms of the model being a simulator, nor a large shift in terms of the distribution itself. My point is mainly that instruction following is still well withi...
(Some very rough thoughts I sent in DM, putting them up publicly on request for posterity, almost definitely not up to my epistemic standards for posting on LW).
So I think some confusion might come from connotations of word choices. I interpret adversarial robustness' importance in terms of alignment targets, not properties (the two aren't entirely different, but I think they aren't exactly the same, and evoke different images in my mind). Like, the naive example here is just Goodharting on outer objectives that aren't aligned at the limit, where optimizat...
Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
ChatGPT using chain of thought is plausibly already a mesaoptimizer.
I think simulacra are better thought of as sub-agents in relation to the original paper's terminology than mesa-optimizers. ChatGPT doesn't seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (dependi...
The base models of GPT-3 already have the ability to "follow instructions", it's just veiled behind the more general interface. If you prompt it with something as simple as this (GPT generation is highlighted), you can see how it contains this capability somewhere.
You may have noticed that it starts to repeat itself after a few lines, and come up with new questions on its own besides. That's part of what the fine-tuning fixes, making its generations more concise and stop at the point where the next token would be leading to another question. InstructGPT al...
I think this post is valuable, thank you for writing it. I especially liked the parts where you (and Beth) talk about historical negative signals. To a certain kind of person, I think that can serve better than anything else as stronger grounding to push back against unjustified updating.
A factor that I think pulls more weight in alignment relative to other domains is the prevalence of low-bandwidth communication channels, given the number of new researchers whose sole interface with the field is online and asynchronous, textual or few-and-far-between call...
I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actu...
I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the "narrowing the prior" frame rather intuitive in this context.
...That said, almost all the differences that Janus and you are highl
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
Thanks!
My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).
My claim is pretty similar to how you put it - in RLHF as in fine-...
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.
...RLHF is less safe tha
I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
...So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures imp
Yeah, I agree with a lot of that. Nitpick though: I can see why GPT and other kinds of generative models seem like they involve mesa-optimizers, but that's not generally how I use the word. Specifically with GPT, the model itself isn't an optimizer, it's a simulator or a reality engine without any real goal beyond predicting its simulation well. It does have simulacra that are optimizers, but those are sub-agents, and not the model itself. As mesa-optimizers go, I'm sometimes confused by people saying we have no evidence of them existing, when by my unders...
Thanks!
Yeah, I think distribution shifts could matter a lot - RSA-2048 has been talked to death here and various other places, so I wouldn't be very surprised if a gradient hacker simulacrum just defaulted to searching for its factorization (or something simpler along that line). I'm not sure how much detecting subtle differences like repeated sequences and implementational differences would help though, both because it requires a modicum of extended analytical reasoning about the training distribution (because that kind of information probably won't be pr...
I think I'm confused at what you're getting at here. If you're making the general claim that solving inner alignment would prevent gradient hacking, I agree! Gradient hacking and its variants are problems with inner optimizers of any kind being misaligned with the training goal. If you're making a different / more specific point though, I may have missed it.
Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.
Oh yeah I agree - I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!
Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.
Oh yeah, I'm definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.
I'm glad you liked the post, thanks for the comment. :)
...I think deep learning might be practically hopeless for the purpose of building controllable AIs; where by controllable I mean here something like "can even be pointed at some specific objective, let alone a 'good' objective". Consequently, I kinda wish more alignment researchers would at least set a 2h timer and try really hard (for those 2h) to come up---privately---with some approach to building AIs that at least passes the bar of basic, minimal engineering sanity. (Like "design the system to even h
A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the "micro-level" parameters of the AI. "Connect" means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters.
Yeah, this is broadly right. The mistake I was making earlier while...
...One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn't properly understand that was something like 'it's simulating an agent that is browsing either way'. This seems incorrect to me. The transformer is doing pretty different things when it's e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there's no way to ide
I agree, but this is a question of timelines too. Within the LLM + RL paradigm we may not need AGI-level RL or LLMs that can accessibly simulate AGI-level simulacra just from self-supervised learning, both of which would take longer than many points requiring intermediate levels of LLM and RL capabilities, because people are still working on RL stuff now.
Given that we want the surgeon to be of bounded size (if we're using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn't seem obvious to me.
What's the most convenient way to get the books internationally? I wasn't able to get the last two years' sets and figured I'd just wait until I moved to more convenient location, but if this might be the last year you're doing this I definitely want to try getting it this time.
I haven't thought about this a lot, but "encrypted" could just mean "just beyond the capabilities of the Surgeon to identify". So the gradient could be moving in a direction away from "easily identifiable early deceptive circuits" instead of "deception", and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
I was thinking of some kind of prompt that would lead to GPT trying to do something as "environment agent-y" as trying to end a story and start a new one - i.e., stuff from some class that has some expected behaviour on the prior and deviates from that pretty hard. There's probably some analogue with something like the output of random Turing machines, but for that specific thing I was pointing at this seemed like a cleaner example.
This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.
I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.
Yeah, I thought of holding off actually creating a sequence until I had two posts like this. This updates me toward creating one now being beneficial, so I'm going to do that.
Alignment Stream of Thought. Sorry, should've made that clearer - I couldn't think of a natural place to define it.
I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can tr...
I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.
In the interest of making my abstract intuition here more precise, a few weird questions:
...Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involvin
generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine
The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.
...it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which s
Running the superintelligent AI on an arbitrarily large amount of compute in this way seems very dangerous, and runs a high risk of it breaking out, only now with access to a hypercomputer (although I admit this was my first thought too, and I think there are ways around this).
More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself. Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create...
UI feedback: The preview widget for a comment appears to cut off part of the reaction bar. I don't think this makes it unreadable, but was probably not intended.