Wiki Contributions


generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.


text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”? 

It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.

I feel like I'm missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model.  Specifically, that the reward model RLHF is using "represents" in an information-theoretic sense a signal for the worlds represented by the fine-tuning data.  So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world - like if GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur.

(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don't think that's what's happening here).

The attractor states seem like they're highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that's more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate.  In that latter example in particular, I don't see the results being that different from what we would expect if GPT's training data was from a world slightly different to ours - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world.  In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome (this works when I tested).

Regarding ending a story to start a new one - I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models - the interesting part seems to be the stronger tendency toward the new worlds the RLHF'd model finds likely, which seems like it's just expected behaviour as a simulator becomes more sure of the world it's in / has a more restricted worldspace.  I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace.  I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn't feel like it's that surprising from within the simulator framing.

The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning - it's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this - there's no reason why all potential abstractions would be affected).

Separate thought: this would explain why increasing the temperate doesn't affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable - it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds.  Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.

Broadly,  my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does.  A priori, I don't see why this framing doesn't hold, but it's definitely possible that it's just saying the same things you are and I'm reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I'd love to hear what I'm missing.

Running the superintelligent AI on an arbitrarily large amount of compute in this way seems very dangerous, and runs a high risk of it breaking out, only now with access to a hypercomputer (although I admit this was my first thought too, and I think there are ways around this).

More saliently though, whatever mechanism you implement to potentially "release" the AGI into simulated universes could be gamed or hacked by the AGI itself.  Heck, this might not even be necessary - if all they're getting are simulated universes, then they could probably create those themselves since they're running on arbitrarily large compute anyway.

You're also making the assumption that these AIs would care about what happens inside a simulation created in the future, as something to guide their current actions.  This may be true of some AI systems, but feels like a pretty strong one to hold universally.

(I think this is a pretty cool post, by the way, and appreciate more ASoT content).

While the above points are primarily about independent research, I want to emphasize again that upskilling is sometimes the better path depending on your career goals.


When I started to take alignment seriously, I wanted to do something that felt valuable, and work on something entirely original.  I think that I gained a lot of value out of trying to perform at that level especially in terms of iterating on what I'd gotten wrong with actual feedback, but the truth was that I simply didn't have the context required to know intuitively what new was, or what parts were likely to be difficult under current paradigms.

It was useful, because it was only after that that I tried to think about the hard part of the problem and converged on ontology translation (at the time), but having expected myself to contribute something original early, I was discouraged and procrastinated on doing more for a long time.  Going in wanting to skill-up for a while is most likely the right mindset to have, in my opinion.

Yep.  I think it happens on a much lower scale in the background too - like if you prompt GPT with something like the occurrence of an earthquake, it might write about what reporters have to say about it, simulating various aspects of the world that may include agents without our conscious direction.

Even if the world is probably doomed, there's a certain sense of - I don't call it joy, more like a mix of satisfaction and determination, that comes from fighting until the very end with everything we have.  Poets and writers have argued since the dawn of literature that the pursuit of a noble cause is more meaningful than any happiness we can achieve, and while I strongly disagree with the idea of judging cool experiences by meaning, it does say a lot about how historically strong the vein of humans deriving value from working for what you believe in is.

In terms of happiness itself - I practically grew up a transhumanist, and my ideals of joy are bounded pretty damn high.  Deviating from that definitely took a lot of getting used to, and I don't claim to be done in that regard either.  I remember someone once saying that six months after it really "hit" them for the first time, their hedonic baseline had oscillated back to being pretty normal.  I don't know if that'll be the same for everyone, and maybe it shouldn't be - the end of the world and everything we love shouldn't be something we learn to be okay with.  But we fight against it happening, so maybe between all the fire and brimstone, we're able to just enjoy living while we are.  It isn't constant, but you do learn to live in the moment, in time.  The world is a pretty amazing place, filled with some of the coolest sentient beings in history.

The classical example (now outdated) is Superintelligence by Nick Bostrom.  For something closer to an updated introductory-style text on the field, I would probably recommend the AGI Safety Fundamentals curriculum.

My take is centred more on current language models, which are also non-optimizers, so I'm afraid this won't be super relevant if you're already familiar with the rest of this and were asking specifically about the context of image systems.

Language models are simulators of worlds sampled from the prior representing our world (insofar as the totality of human text is a good representation of our world), and doesn't have many of the properties we would associate with "optimizeriness".  They do, however, have the potential to form simulacra that are themselves optimizers, such as GPT modelling humans (with pretty low fidelity right now) when making predictions.  One danger from this kind of system that isn't itself an optimizer, is the possibility of instantiating deceptive simulacra that are powerful enough to act in ways that are dangerous to us (I'm biased here, but I think this section from one of my earlier posts does a not-terrible job of explaining this).

There's also the possibility of these systems becoming optimizers, as you mentioned.  This could happen either during training (where the model at some point during training becomes agentic and starts to deceptively act like a non-optimizer simulator would - I describe this scenario in another section from the same post), or could happen later, as people try to use RL on it for downstream tasks.  I think what happens here mechanistically at the end could be one of a number of things - the model itself completely becoming an optimizer, an agentic head on top of the generative model that's less powerful than the previous scenario at least to begin with, a really powerful simulacra that "takes over" the computational power of the simulation, etc.

I'm pretty uncertain on numbers I would assign to either outcome, but the latter seems pretty likely (although I think the former might still be a problem), especially with the application of powerful RL for tasks that benefit a lot from consequentialist reasoning.  This post by the DeepMind alignment team goes into more detail on this outcome, and I agree with its conclusion that this is probably the most likely path to AGI (modulo some minor details I don't fully agree with that aren't super relevant here).

I agree with the general take - my explanation of why people often have pushback to the idea of acausal reasoning is because they never really have cause to actively apply it. In the cases of voting or recycling, the acausal argument naively feels more like post-fact justification than proactive reason to do it if it weren't the norm.

Most of the cases of people doing some things explained by acausal reasoning but not other equally sensible things also requiring acausal justification seem really well-modelled as what they were already going to do without thinking about it too hard. When they think about it too hard is when you end up with a lot of people who actively disagree with that line of thought - and a lot of people I know take that stance, that acausal reasoning isn't valid and so they don't have personal onus to recycle or switch careers to work on effective things (usually because corporations and other entities have more singular weight), but still vote.

You could also probably get funded to study alignment or skill up in ML to do alignment stuff in the future if you're a beginner and don't think you're at the level of the training programs.  You could use that time to write up ideas or distillations or the like, and if you're a good fit, you'd probably have a good chance of getting funded for independent research.

Load More