All of Cleo Nardo's Comments + Replies

In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;

I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)

"Directly or indirectly" is a bit vague. Maybe make a market on Manifold if one doesn't exist already.

I'm specifically interested in finding something you'd be willing to bet on - I can't find an existing manifold market, would you want to create one that you can decide? I'd be fine trusting your judgment. 

Thanks! I've included Erik Hoel's and lc's essays.

Your article doesn't actually call for AI slowdown/pause/restraint, as far as I can tell, and explicitly guards off that interpretation —

This analysis does not show that restraint for AGI is currently desirable; that it would be easy; that it would be a wise strategy (given its consequences); or that it is an optimal or competitive approach relative to other available AI governance strategies.

But if you've written anything which explicitly endorses AI restraint then I'll include that in the list.

yep my bad

Realised later on, thanks. I guess in this formalism you'd need to consider the empty string/similar null token a valid token, so the prompt/completion is prefixed/suffixed with empty strings (to pad to the size of the context window). Otherwise, you'd need to define the domain as a union over the set of all strings with token lengths ≤ the context window.

Countries that were on the frontier of the Industrial Revolution underwent massive economic, social, and political shocks, and it would've been better if the change had been smoothed over about double the duration.

Countries that industrialised later also underwent severe shocks, but at least they could copy the solutions to those shocks along with the technology.

Novel general-purpose technology introduces problems, and there is a maximum rate at which problems can be fixed by the internal homeostasis of society. That maximum rate is, I claim, at least 5–10... (read more)

That's possible. Have you read Robin Hanson's 2000 paper [] on economic growth over the past 2 million years? If not, you might find it interesting. Talks in part about how new modes of spreading knowledge and invention may explain past transitions in the economic growth rate.  It doesn't mention AI at all (though Hanson has made the connection multiple times since), but does say that if the data series trend continues, that suggests a possible transition to a new growth mode some time in the next couple of decades with a doubling time of a few days to a few years. To me, AI in some form seems like a reasonable candidate for that, to the extent it can take human limits on adaptation speed out of the equation. 

Sure, the "general equilibrium" also includes the actions of the government and the voting intentions of the population. If change is slow enough (i.e. below 0.2 OOMs/year) then the economy will adapt.

Perhaps wealth redistribution would be beneficial — in that case, the electorate would vote for political parties promising wealth redistribution. Perhaps wealth redistribution would be unbeneficial — in that case, the electorate would vote for political parties promising no wealth redistribution.

This works because electorial democracy is a (non-perfect) erro... (read more)

If AI-induced change leads to enough concentration of economic and military power that most people become economically and militarily irrelevant, I don't expect democracy to last long. One way or another, the distribution of political power will shift toward the actual distribution of economic and military power.

Great idea! Let's measure algorithmic improvement in the same way economists measure inflation, with a basket-of-benchmarkets.

This basket can itself be adjusted over time so it continuously reflected the current use-cases of SOTA AI.

I haven't thought about it much, but my guess is the best thing to do is to limit training compute directly but adjust the limit using the basket-of-benchmarks.

1Ben Livengood2mo
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.

The economy is a complex adaptative system which, like all complex adaptive systems, can handle perturbations over the same timescale as the interal homostatic processes. Beyond that regime, the system will not adapt. If I tap your head, you're fine. If you knock you with an anvil, you're dead.

Yes, absolutely. The question is where that line lies for each combination of system + perturbation. I agree with most of your claims in the article - just not the claim that each AI iteration is sufficiently different from an economic perspective as to require it's own independent 5-10 yr adjustment period. MMy guess is that some companies will start earlier and some later, some will copy best practices and not bother with some generations/iterations, and that this specific issue will not be more of a problem than, say, Europe and then Japan rebuilding after WWII with more modern factories and steel mills and so on than America had. Probably less, since the software is all handled on the back end and the costs of switching should be relatively low.

This wouldn't work. Wawaluigis are not luigis.

1Antoine de Scorraille2mo
I'm confused, could you clarify? I interpret your "Wawaluigi" as two successive layers of deception within a simulacra, which is unlikely if WE is reliable, right? I didn't say anything about Wawaluigis and I agree that they are not Luigis, because as I said, a layer of Waluigi is not a one-to-one operator. My guess is about a normal Waluigi layer, but with a desirable Waluigi rather than a harmful Waluigi.

I believe we should limit AI development to below 0.2 OOMs/year which is slow continuous takeoff.

Something like that sounds like a sensible proposal to me. I'm not sure I endorse that as stated (I believe returns to intelligence are strongly sublinear, so a fixed slow rate of scaling may end up taking too long to get to transformative AI for my taste), but I endorse the general idea of deliberate attempts to control AI takeoff at a pace we can deal with (both for technical AI safety and governance/societal response approaches). I was pushing back against the idea of an indefinite moratorium now while we harvest the gains from developments to date. That could lead to a hardware overhang and move us out of the regime where only a handful of companies can train strong AI systems to a world where hundreds or thousands of actors can do so.

I've added a section on hardware:

  • Comparing 0.2 OOMs/year target to hardware growth-rates:
    • Moore's Law states that transitiors per integrated circuit doubles roughly every 2 years.
    • Koomey's Law states that the FLOPs-per-Joule doubled roughly every 1.57 years until 2000, whereupon it began doubling roughly every 2.6 years.
    • Huang's Law states that the growth-rate of GPU performance exceeds that of CPU performance. This is a somewhat dubious claim, but nonetheless I think the doubling time of GPUs is longer than 18 months.
    • In general, the 0.2 OOMs/year target is f
... (read more)

I think you've misunderstood what we mean by "target". Similar issues applied to the 2°C target, which nonetheless yielded significant coordination benefits.

The 2°C target helps facilitate coordination between nations, organisations, and individuals.

  • It provided a clear, measurable goal.
  • It provided a sense of urgency and severity.
  • It promoted a sense of shared responsibility.
  • It helped to align efforts across different stakeholders.
  • It created a shared understanding of what success would look like.

The AI governance community should converge around a similar target.

  1. Nuclear proliferation worked despite the fact that many countries with nuclear weapons were "grandfathered in".

  2. If the y-axis for the constraint is fixed to the day of the negotiaiton, then stakeholders who want a laxer constraint are incentivised to delay negotiation. To avoid that hazard, I have picked a schelling date (2022) to fix the y-axis.
  3. The purpose of this article isn't to proposal any policy, strategy, treaty, agreement, law, etc which might achieve the 0.2 OOMs/year target. instead, the purpose of this article is to propose a target itself. This
... (read more)
Nuclear non-proliferation worked because the grandfathered-in countries had all the power and the ones who weren't were under the implicit threat of embargo, invasion, or even annihilation. Despite all its accomplishments, GPT-4 does not give Open AI the ability to enforce its monopoly with the threat of violence. Not to mention that 3-4 of the 5 listed countries non-party to the treaty developed nukes anyway. If Meta decides to flagrantly ignore the 0.2 OOM limit and creates something actually dangerous it's not going to sit quietly in a silo waiting for further mistakes to be made before it kills us all.

This isn't a policy proposal, it's a target, like the 2°C climate target.

Yep, thanks! 0.2 OOMs/year is equivalent to a doubling time of 18 months. I think that was just a typo.

The 0.2 OOMs/year target would be an effective moratorium until 2029, because GPT-4 overshot the target.

Yep, you're correct. The original argument in the Waluigi mega-post was sloppy.

... (read more)

Yep, exactly!

Two things to note:


Note that the distinction between hinge beliefs and free beliefs does not supervene on the black-box behaviour of NNs/LLMs. It depends on how the belief is implemented, how the belief is learned, how the belief might change, etc.


"The second model uses a matrix that will always be symmetric, no matter what it's learned." might make it seem that the two models are more similar than they actually are.

You might think that both models store an  matrix , and the architecture of both models is , bu... (read more)

The proposition "I am currently on Earth" is implemented both in the parameters and in the architecture, independently.

How can "I am currently on Earth" be encoded directly into the structure of the brain?  I also feel that "101 is a prime number" is more fundamental to me (being about logical structure rather than physical structure) than currently being on Earth, so I'm having a hard time understanding why this is not considered a hinge belief.

I think my definition of  is correct. It's designed to abstract away all the messy implementation details of the ML architecture and ML training process.

Now, you can easily amend the definition to include an infinite context window . In fact, if you let  then that's essentially an infinite context window. But it's unclear what optimal inference is supposed to look like when . When the context window is infinite (or very large) the internet corpus consists of a single datapoint.

Yep, but it's statistically unlikely. It is easier for order to disappear than for order to emerge.

I've spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I'm tempted to defer to you.

But on the inside view:

The finite context window is really important. 32K is close enough to infinity for most use-cases, but that's because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation.

Here's some fun things to fill the window with —

  • A summary of fac
... (read more)

My guess is that the people voting "disagree" think that including the distillation in your general write-up is sufficient, and that you don't need to make the distillation its own post.

  1. Almost certainly ergodic in the limit. But it's highly period due to English grammar.
  2. Yep, just for convenience.
  3. Yep.
  4. Temp = 0 would give exactly absorbing states.

Maybe I should break this post down into different sections, because some of the remarks are about LLM Simulators, and some aren't.

Remarks about LLM Simulators: 7, 8, 9, 10, 12, 17

Other remarks : 1, 2, 3, 4, 5, 6, 11, 13, 14, 15, 16, 18

Yeah, I broadly agree.

My claim is that the deep metaphysical distinction is between "the computer is changing transistor voltages" and "the computer is multiplying matrices", not between "the computer is multiplying matrices" and "the computer is recognising dogs".

Once we move to a language game in which "the computer is multiplying matrices" is appropriate, then we are appealing to something like the X-Y Criterion for assessing these claims.

The sentences are more true the tighter the abstraction is —

  • The machine does X with greater probability.
  • The machine
... (read more)
Hmm, yeah there's clearly two major points: 1. The philosophical leap from voltages to matrices, i.e. allowing that a physical system could ever be 'doing' high level description X. This is a bit weird at first but also clearly true as soon you start treating X as having a specific meaning in the world as opposed to just being a thing that occurs in human mind space. 2. The empirical claim that this high level description X fits what the computer is doing. I think the pushback to the post is best framed in terms of which frame is best for talking to people who deny that it's 'really doing X'. In terms of rhetorical strategy and good quality debate, I think the correct tactic is to try and have the first point mutually acknowledged in the most sympathetic case, and try to have a more productive conversation about the extent of the correlation, while I think aggressive statements of 'it's always actually doing X if it looks like its doing X' are probably unhelpful and become a bit of a scissor. (memetics over usefulness har har!)

We could easily train an AI to be 70 percentile in recognising human emotions, but (as far as I know) no one has bothered to do this because there is ~ 0 tangible benefit so it wouldn't justify the cost.

Recognising dogs by ML classification is different to recognising dogs using cells in your brain and eyes

Yeah, and the way that you recognise dogs is different from the way that cats recognise dogs. Doesn't seem to matter much.

as though it were exactly identical

Two processes don't need to be exactly identical to do the same thing. My calculator adds numbers, and I add numbers. Yet my calculator isn't the same as my brain.

when you invoke pop sci


No it's not because one is sacred and the other is not, you've confused sacredness with varying d

... (read more)

I'm probably misunderstanding you but —

  • A task is a particular transformation of the physical environment.
  • COPY_POEM is the task which turns one page of poetry into two copies of the poetry.
    The task COPY_POEM would be solved by a photocopier or a plagiarist schoolboy.
  • WRITE_POEM is the task which turns no pages of poetry into one page of poetry.
    The task WRITE_POEM would be solved by Rilke or a creative schoolboy.
  • But the task COPY_POEM doesn't reduce to WRITE_POEM.
    (You can imagine that although Rilke can write original poems, he is incapable of copying an arbi
... (read more)
1Adam Shai3mo
I think I am the one that is misunderstanding. Why doesn't your definitions work? For every Rilke that  that can turn 0 pages into 1 page, there exists another machine B s.t.  (1) B can turn 1 page into 1 page, while interacting with Rilke. (I can copy a poem from a rilke book while rilke writes another poem next to me, or while Rilke reads the poem to me, or while Rilke looks at the first wood of the poem and then creates the poem next to me, etc.) (2) the combined Rilke and B doesnt expend much more physical resource to turn 1 page into 1 page as Rilke expends writing a page of poetry.  I have a feeling I am misentrepreting one or both of the conditions.

There's no sense in which my computer is doing matrix multiplication but isn't recognising dogs.

At the level of internal mechanism, the computer is doing neither, it's just varying transistor voltages.

If you admit a computer can be multiplying matrices, or sorting integers, or scheduling events, etc — then you've already appealed to the X-Y Criterion.

Maybe worth thinking about this in terms of different examples: * NN detecting the presence of tanks just by the brightness of the image (possibly apocryphal - Gwern []) * NN recognising dogs vs cats as part of an image net classifier that would class a piece of paper with 'dog' written on as a dog * GPT-4 able to describe an image of a dog/cat in great detail * Computer doing matrix multiplication. The range of cases in which the equivalence between the what the computer is doing, and our high level description is doing holds increases as we do down this list, and depending on what cases are salient, it becomes more or less explanatory to say that the algorithm is doing task X.
1Adam Shai3mo
I think the issue is that what people often mean by. "computing matrix multiplication" is something like what youve described here, but when (at least sometimes, as you've so elegantly talked about in other posts, vibes and context really matter!) talk about "recognizing dogs" they are referring not only to the input output transformation of the task (or even the physical transformation of world states) but also the process by which the dog is recognized, which includes lots of internal human abstractions moving about in a particular way in the brains of people, which may or may not be recapitulated in an artificial classification system. To some degree it's a semantic issue. I will grant you that there is a way of talking about "recognizing dogs" that reduces it to the input/output mapping, but there is another way in which this doesn't work. The reason it makes sense for human beings to have these two different notions of performing a task is because we really care about theory of mind, and social settings, and figuring out what other people are thinking (and not just the state of their muscles or whatever dictates their output). Although for precisions sake, maybe they should really have different words associated with them, though I'm not sure what the words should be exactly. Maybe something like "solving a task" vs. "understanding a task" though I don't really like that. Actually my thinking can go the other way to. I think there actually is a sense in which the computer is not doing matrix multiplication, and its really only the system of computer+human that is able to do it, and the human is doing A LOT of work here. I recognize this is not the sense people usually mean when they talk about computers doing matrix multiplication, but again, I think there are two senses of performing a computation even though people use the same words.

I think the best explanation of why ChatGPT responds "Paris" when asked "What's the capital of France?" is that Paris is the capital of France.

I find that explanation unsatisfying because it doesn't help with other questions I have about how well ChatGPT works: * How does the language model represent countries and cities? For example, does it know which cities are near each other? How well does it understand borders? * Are there any capitals that it gets wrong? Why? * How well does it understand history? Sometimes a country changes its capital. Does it represent this fact as only being true at some times? * What else can we expect it to do with this fact? Maybe there are situations where knowing the capital of France helps it answer a different question? These aren't about a single prompt, they're about how well its knowledge generalizes to other prompts, and what's going to happen when you go beyond the training data. Explanations that generalize are more interesting than one-off explanations of a single prompt. Knowing the right answer is helpful, but it only helps you understand what it will do if you assume it never makes mistakes. There are situations (like Clever Hans) where the way the horse got the right answer is actually pretty interesting. Or consider knowing that visual AI algorithms rely on textures more than shape (though this is changing). Do you realize that you're arguing against curiosity? Understanding hidden mechanisms is inherently interesting and useful.

Let's take LLM Simulator Theory.

We have a particular autoregressive language model , and Simulator Theory says that  is simulating a whole series of simulacra which are consistent with the prompt.

Formally speaking,

where  is the stochastic process corresponding to a simulacrum .

Now, there are two objections to this:

  • Firstly, is it actually true that  has this particular structure?
  • Secondly, even if it were true, why are we warranted in sayin
... (read more)
I'm confused what you mean to claim. Understood that a language model factorizes the joint distribution over tokens autoregessively, into the product of next-token distributions conditioned on their prefixes. Also understood that it is possible to instead factorize the joint distribution over tokens into a conditional distribution over tokens conditioned on a latent variable (call it s) weighted by the prior over s. These are claims about possible factorizations of a distribution, and about which factorization the language model uses. What are you claiming beyond that? * Are you claiming something about the internal structure of the language model? * Are you claiming something about the structure of the true distribution over tokens? * Are you claiming something about the structure of the generative process that produces the true distribution over tokens? * Are you claiming something about the structure of the world more broadly? * Are you claiming something about correspondences between the above?

Yep, the problem is that the internet isn't written by Humans, so much as written by Humans + The Universe. Therefore, GPT-N isn't bounded by human capabilities.

3Gerald Monroe3mo
Thanks. Interestingly this model explains why: It can play a few moves of chess from common positions - it's worth the weights to remember those It can replicate the terminal text for many basic Linux commands - it's worth the weights for that also.

We have two ontologies:

  • Physics vs Computations
  • State vs Information
  • Machine vs Algorithm
  • Dynamics vs Calculation

There's a bridge connecting these two ontologies called "encoding", but (as you note) this bridge seems arbitrary and philosophically messy. (I have a suspicion that this problem is mitigated if we consider quantum physics vs quantum computation, but I digress.)

This is why I don't propose that we think about computational reduction.

Instead, I propose that we think about physical reduction, because (1) it's less philosophically messy, (2) it's more r... (read more)

2Adam Shai3mo
Instead of responding philosophically I think it would be instructive to go through an example, and hear your thoughts about it. I will take your definition of physical reduction  (focusing on 4.) and assign tasks and machines to the variables: Here's your defintion: Now I want X to be the task of copying a Rilke poem onto a blank piece of paper, and Y to be the task of Rilke writing a poem onto a blank piece of paper.  so let's call X = COPY_POEM, Y = WRITE_POEM, and let's call A = Rilke. So plugging into your definition: This seems to work. If I let Rilke write the poem, and I just copy his work, the the poem will be written on the piece of paper., and Rilke has done much of the physical labor.  The issue is that when people say something like "writing a poem is more than just copying a poem," that seems meaningful to me (this is why teachers are generally unhappy when you are assigned to write a poem and they find out you copied one from a book), and to dismiss the difference as not useful seems to be missing something important about what it means to write a poem. How do you feel about this example? Just for context, I do strongly agree with many of your other examples, I just think this doesn't work in general. And basing all of your intuitions about intelligence on this will leave you missing something fundamental about intelligence (of the type that exists in humans, at least). 

Yep, I broadly agree. But this would also apply to multiplying matrices.

1Gerald Monroe3mo
Sure. This I think was the more informed objection to LLM capabilities. They are "just" filling in text and can't know anything humans don't. I mean it turns out this can likely mean 0.1 percent human capability in EVERY domain at the same time is doable, but lack the architecture to directly learn beyond human ability. (Which isn't true, if they embark on tasks on their own and learn from the I/O, such as solving every coding problem published or randomly generating software requirements and then tests and then code to satisfy the tests, they could easily exceed ability at that domain than all living humans) I mistakenly thought they would be limited to median human performance.

Sure, every abstraction is leaky and if we move to extreme regimes then the abstraction will become leakier and leakier.

Does my desktop multiply matrices? Well, not when it's in the corona of the sun. And it can't add 10^200-digit numbers.

So what do we mean when we say "this desktop multiplies two matrices"?

We mean that in the range of normal physical environments (air pressure, room temperature, etc), the physical dynamics of the desktop corresponds to matrix multiplication with respect to some conventional encoding o small matrices into the physical stat... (read more)

AFAICT the bit where there's substantive disagreement is always in the middle regime, not the super-close extreme or the super-far extreme. This is definitely where I feel like debates over the use of frames like simulator theory [] are. For example, is the Godot game engine a light transport simulator? In certain respects Godot captures the typical overall appearance of a scene, in a subset of situations. But it actually makes a bunch of weird simplifications and shortcuts under the hood that don't correspond to any real dynamics. That's because it isn't trying to simulate the underlying dynamics of light, it's trying to reproduce certain broad-strokes visible patterns that light produces. That difference really matters! If you wanna make reliable and high-fidelity predictions about light transport, or if you wanna know what a scene that has a bunch of weird reflective and translucent materials looks like, you may get more predictive mileage thinking about the actual generating equations (or using a physically-based renderer [], which does so for you), rather than treating Godot as a "light transport simulator" in this context. Otherwise you've gotta maintain a bunch of special-casing in your reasoning to keep maintaining the illusion.

okay, I'll clarify in the article — 

if your goal is to predict the logits layer on this particular prompt, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”

I don't think researchers should learn world-facts in order to understand GPT-4.

I think that (1) researchers should use the world-facts they already know (but are actively suppressing due to learned vibe-obliviousness) to predict/explain/control GPT-4, and (2) researchers should consult a domain expert if they want to predict/explain/control GPT-4's output on a particular prompt.

2Simon Berens3mo
You might want to clarify that, because in the post you explicitly say things like “if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.”

"Open Problems in GPT Simulator Theory" (forthcoming)

Specifically, this is a chapter on the preferred basis problem for GPT Simulator Theory.

TLDR: GPT Simulator Theory says that the language model  decomposes into a linear interpolation  where each  is a "simulacra" and the amplitudes  update in an approximately Bayesian way. However, this decomposition is non-unique, making GPT Simulator Theory either ill-defined, arbitrary, or trivial. By comparing this problem to the preferred basis ... (read more)

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

What report is the image pulled from?

yes, RLHF breaks the "purity" of simulator theory. I'm currently trying to work out exactly what RLHF does to a simulator.

1Roman Leventov3mo
Even when LMs (at least, the current GPTs) are trained purely to match the text distribution, they are not pure simulators ("laws of physics"). They are more like simulators + world knowledge ("laws of physics + the initial conditions"), where "knowledge" is probabilistic Bayesian beliefs. My hypothesis is that post-training fine-tuning from feedback (e.g., RLHF) doesn't work reliably (jailbreaks, evil demons) is because once the pre-training (self-supervised learning) is complete, the internal belief structure is too complex and couldn't be updated in a way that makes LM to entirely "unlearn" bad behaviours. So, the generation rules ("laws") that produce this bad behaviour are still sitting in the LM's circuits, perhaps somewhat damaged. However, it seems to me that feedback pre-training [] could genuinely produce models whose simulation machinery (circuits) don't know how to produce bad behaviours. There are downsides of this outcome as well: e.g., it will be bad at predicting malevolent/misaligned actors. The previous paragraph is written in the premise that LM is not self-aware. However, a sufficiently detailed world knowledge inevitably leads to self-awareness/situational awareness. I think self-report of ChatGPT and Bing Chat as "chat bot" is already a species of self-awareness rather than a sort of "memoization" or "parroting" (even if not, future generations of LMs definitely will be self-aware). When the model is already self-aware, fine-tuning from feedback could conveniently "sweep bad behaviours under the rug" without a costly damage of the simulation structure by bolstering the LM's "self-image", literally just driving the belief that "I'm a honest, helpful, harmless chat bot", versus neutral "I'm a chat bot", while simultaneously making this "self-image" circuit being activated more consistently (e.g., not only when the LM is directly asked in the prom

wawaluigis are misaligned

TLDR: if I said "hey this is Bob, he pretends to be harmful and toxic!", what would you expect from Bob? Probably a bunch of terrible things — like offering hazardous information.

ethan perez's paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.

okay here's a simulacra-based argument — I'll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:

Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don't die.

1Roman Leventov3mo
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn't select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially ("99% I'm Luigi, 1% I'm Waluigi"), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn't transform between training steps in clearly discontinuous jumps in the statistical manifold [].
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn't be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.) In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn't know. (And that's only good if you're doing on-policy RLHF.) So it's probably extremely possible for RLHF to actually, actively create new waluigis.  Therefore, this model would be obviously and trivially "deceptive" in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
If you think you'll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path. Appreciate the concrete model. I think, roughly, "that's not how this works".

Yep I think you might be right about the maths actually.

I'm thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.

So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.

I'm not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.

Actually, maybe "attractor" is the wr... (read more)

-1[comment deleted]3mo
I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.
Describing the waluigi states as stable equilibria and the luigi states as unstable equilibria captures most of what you're describing in the last paragraph here, though without the amplitude of each.

> making the model no longer behave as this kind of simulator

I think the crux is that I don't think RLHF makes the model no longer behave as this kind of simulator. Are there deceptive simulacra which get good feedback during RLHF but nonetheless would be dangerous to have in your model? Almost definitely.

It isn't sufficient that deceptive simulacra would get good feedback, for RLHF to make the problem worse. Simulacra that are following a policy like "pretend to be Luigi-like but then defect and rant about toaster ovens" would also get good feedback. Why don't we worry about these simulacra? Because they probably never appeared during RL finetuning / never caused text outputs that distinguished their behavior from regular Luigi behavior (unless your claim is that this behavior occurred during RL finetuning and the overseers just didn't notice), so they never got differential feedback gradients, so they never got strengthened relative to normal Luigi simulacra. Simulacra that don't get invoked during RL finetuning do not benefit from the counterfactual good feedback they would've received. You need an actual causal path by which these deceptive simulacra get differentially strengthened during RLHF. What is that causal path?

If you've discovered luigi's distribution over tokens, and waluigi's distributions over tokens, then you don't need contrastive decoding. you can just directly sample the luigis. The problem is how do we extract luigi's distribution and waluigi's distribution from GPT-4.

Load More