Physicist switching to AI alignment

Studying these man-made horrors so they are no longer beyond my comprehension

Wiki Contributions


I am quite confused. It is not clear to me if at the end you are saying that LLMs do or don't have a world model. Can you clearly say on which "side" do you stand on? Are you even arguing for a particular side? Are you arguing that the idea of "having a world model" doesn't apply well to an LLM/is just not well defined?

Said this, you do seem to be claiming that LLMs do not have a coherent model of the world (again, am I misunderstanding you?), and then use humans as an example of what having a coherent world model looks like. This sentence is particularly bugging me:

For example, an LLM that can answer a question about the kinetic energy of a bludger probably doesn't have a clear boundary between models of fantasy and models of reality. But switching seamlessly between emulating different people is implicit in what they are attempting to do - predict what happens in a conversation.

In the screenshots you provided GPT3.5 does indeed answer the question, but it seem to distinguish it being not real (it says "...bludgers in Harry Potter are depicted as...", "...in the Harry Potter universe...") and indeed it says it doesn't have specific information about their magical properties. I also, in spite of being a physicist with knowledge that HP isn't real, I would have gladly tried to answer that question kinda like GPT did. What are you arguing? LLMs seem to have the distinction at least between reality and HP or not?

And large language models, like humans, do the switching so contextually, without explicit warning that the model being used is changing. They also do so in ways that are incoherent.

What's incoherent about the response it gave? Was the screenshot not meant to be evidence?


The simulator theory (which you seem to rely on) is, IMO, a good human-level explanation of what GPT is doing, but it is not a fundamental-level theory. You cannot reduce every interaction with an LLM as a "simulation", somethings are just weirder. Think of pathological examples of the input being "££££..." repeated 1000s of times: the output will be some random, possibly incoherent, babbling (funny incoherent output I got from the API inputting "£"*2000 and asking it how many "£" there were: 'There are 10 total occurrences of "£" in the word Thanksgiving (not including spaces).'). Notice also the random title it gives to the conversations. Simulator theory fails here.


In the framework of simulator theory and lack of world model, how do you explain that it is actually really hard to make GPT overtly racist? Or how the instruct finetuning is basically never broken?

If I leave a sentence incomplete why doesn't the LLM completes my sentence instead of saying "You have been cut off, can you please repeat?"? Why doesn't the "playful" roleplaying take over, while (as you seem to claim) it takes over when you ask for factual things? Do they have a model of what "following instruction means" and "racisms" but not what "reality" is?


To state my belief: I think hallucinations, non-factuality and a lot of the problems are better explained by addressing the failure of RLHF and not from a lack of a coherent world model. RLHF apparently isn't that good at making sure that GPT-4 answers factually. Especially  since it is really hard to make it overtly racist. And especially since they reward it for "giving it a shot" instead of answering "idk" (because that would make it answer always "idk"). I explain it as: in training the reward model a lot of non-factual things might appear, and even some non-factual thing are actually the preferred response that human like.

Or it might just be the autoregressive paradigm that once it make a mistake (just by randomly sampling the "wrong" token) the model "thinks": *Yoda voice* 'mhmm, a mistake in the answer I see, mistaken the continuation of the answer should then be'.

And the weirdness of the outputs after a long repetition of a single token is explained by the non-zero repetition penalty in ChatGPT and so the output will kinda resemble the output of a glitch token.

The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.

I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things

Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode

I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: "[question about thing]"
GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing"
Me: "yeah, you know, the thing"
GPT: "Ah, yeah the thing [writes four paragraphs about the thing]"
Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4)

Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February).

But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.

Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.

I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯

It's still not factual.

To me it isn't clear what alignment are you talking about.

You say that the list is about "alignment towards genetically-specified goals", which I read as "humans are aligned with inclusive genetic fitness", but then you talk about what I would describe as "humans aligned with each other" as in "humans want humans to be happy and have fun". Are you confusing the two?

South Korea isn't having kids anymore. Sometimes you get serial killers or Dick Cheney.

Here the first one shows misalignment towards IGF, while the second shows misalignment towards other humans, no?

I'd actually argue the answer is "obviously no".

RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.

Do many people really forget that RLHF was meant to make GPT say true things?

When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly "painting the target around the arrow": they decided a-posteriori that whatever GPT-4 does is aligned.

You even have "lie" multiple times in the list of bad behaviours in this post and you still answer "yes, it is aligned"? Maybe you just have a different experience? Do you check what it says? If I ask it about my own expertise it is full of hallucinations.

Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don't have the mathematical form of (sot hhe reverse of what I explained in this paragraph).

Basically I expect the neural networks to be a crude approximation of a hard-coded cognition algorithm. Not the other way around.

What NNs do can't be turned into an algorithm by any known route.

NN-> agorithms was one of my assumptions. Maybe I can relay my intuitions for why it is a good assumption:

For example in the paper https://arxiv.org/abs/2301.05217 they explore grokking by making a transformer learn to do modular addition, and then they reverse engineer what algorithm the training "came up with". Furthermore, supporting my point in this post, the learned algorithm is also very far from being the most efficient, due to "living" inside a transformer. And so, in this example, if you imagine that we didn't know what the network was doing, and someone was  just trying to do the same thing that the NN did, but faster and more efficiently, it would study the network, look a the bonkers algo that it learned, realize what it does, and then write the three assembly code lines to actually do the modular addition so much faster (and more precise!) without wasting resources and time by using the big matrices in the transformer.

I can also tackle the problem from the other side: I assume (is it non-obvious?) that predicting-the-next-token can be also be done with algorithms and not only neural networks. I assume that Intelligence can also be made with algorithms rather than only NNs. And so there is very probably a correspondence: I can do the same thing in two different way. And so NN -> agorithms is possible. Maybe this correspondence isn't always in favour of more simpler algos and NNs are sometimes actually less complex, but it feels a bolder claim to for it to be true in general.

To support my claim more we could just look at the math. Transformers, RNN, etc...  are just linear algebra and non-linear activation functions. You can write that down or even, just as an example, just fit the multi-dimensional curve with a nonlinear function, maybe just a polynomials: do a Taylor expansion and maybe you discard the term that contribute less, or something else entirely... I am reluctant to even give ideas on how to do it because of the dangers, but the NNs can most definitely be written down as a multivariate non-linear function. Hell, neural networks, in physics are often regarding as just fitting with many parameters a really complex function we don't have the mathematical form of (sot he reverse of what I explained in this paragraph).

And neural networks can be evolved, which is their biggest strength. I do expect that predicting-the-next-token algorithms can be actually much better than GPT-4, by using the same analogy that Yudkowsky uses for why designed nanotech is probably much better than natural nanotech: the learned algorithms must be evolvable and so they sit around much shallower "loss potential well" than designed algorithms could be.

And it seems to me that this reverse engineering process is what is interpretability is all about. Or at least what the Holy Grail of interpretability is.

Now, as I've written down in my assumptions, I don't know if any of the learned cognition algorithms can be written down efficiently enough to have an edge on NNs:

[I assume that] algorithms interpreted from matrix multiplications are efficient enough on available hardware. This is maybe my shakiest hypothesis: matrix multiplication in GPUs is actually pretty damn well optimized


Maybe I should write a sequel to this post showing my all of these intuitions and motivations on how NN->Algo is a possibility.

I hope I made some sense, and I didn't just ramble nonsense 😁.

Well, tools like Pythia helps us peer inside the NN and helps us reason about how things works. The same tools can help the AGI reason about itself. Or the AGI develops its own better tools. What I am talking about is an AGI doing what the interpretability researchers are doing now (or what OpenAI is trying to do with GPT-4 interpreting GPT-2).

It doesn't' matter how, I don't know how, I just wanted to point out the simple path to algorithmic foom even if we start with a NN.

Disclaimer: These are all hard questions and points that I don't know their true answers, these are just my views, what I have understood up to now. I haven't studied the expected utility maximisers exactly because I don't expect the abstraction to be useful for the kind of AGI we are going to be making.

There's a huge gulf between agentic systems and "zombie-agentic" systems (that act like agents with goals, but have no explicit internal representation of those goals)

I feel the same, but I would say that it's the “real-agentic” system (or a close approximation of it) that needs God-level knowledge of cognitive systems (why orthodox alignment by building the whole mind from theory is really hard). An evolved system like us or like GPT, IMO, seems more close to a “zombie-agentic” system.
I feel the key thing to understand each other might be coherence, and how coherence can vary from introspection, but I am not knowledgeable enough to delve into this right now.

How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent's shortcomings)?

The view in my mind that makes sense is that a utility function is an abstraction that you put on top of basically anything if you wish. It's a hat to describe a system that does things in the most general way. The framework is borrowed from economics where human behaviour is modelled with more or less complicated utility functions, but whether there is or not an internal representation is mostly irrelevant. And, again, I don't expect a DL system do display anything remotely close to a "goal circuit", but that we can still describe them as having a utility function and them being maximisers (of not infinite cognition power) of that UF. But the UF, form our part, would be just a guess. I don't expect us to crack that with interpretability of neural networks learned by gradient descent.

What I meant to articulate was: the utility function and expected utility maximiser is a great framework to think about intelligent agents, but it's a theory put on top of the system, it doesn't need to be internal. In fact that system is incomputable (you need an hypercomputer to make the right decision).

Load More