12d1110

OpenAI spokesperson Lindsey Held Bolton refuted it:

"refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information.”"

The reporters describe this as a refutation, but this does not read to me like a refutation!

Has this one been confirmed yet? (Or is there more evidence that this reporting that something like this happened?)

18d10

Your graphs are labelled with "test accuracy", do you also have some training graphs you could share?

I'm specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.

Sep 16, 20234-1

With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to "RLHF" out?):

I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify "deep cognition" present in the network, rather than updating shallower things like "higher prior on this text being friendly" or whatnot.

I think the important points are:

- These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
- They make incremental local tweaks to the weights that move in the direction of the desired text.
- Gradient descent prefers to find the smallest changes to the weights that yield the result.

Evidence in favor of this is the difficulty of eliminating "jailbreaking" with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.

5mo30

Spinoza suggested that we first

passively accept a proposition in the course of comprehending it, and only afterwardactively disbelievepropositions which are rejected by consideration.

Some distinctions that might be relevant:

- Parsing a proposition into your ontology, understanding its domains of applicability, implications, etc.
- Having a sense of what it might be like for another person to believe the proposition, what things it implies about how they're thinking, etc.
- Thinking the proposition is true, believing its implications in the various domains its assumptions hold, etc.

If you ask me for what in my experience corresponds to a feeling of "passively accepting a proposition" when someone tells me, I think I'm doing a bunch of (1) and (2). This does feel like "accepting" or "taking in" the proposition, and can change how I see things if it works.

6moΩ9133

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

I tried to formalize this, using as a "poor man's counterfactual", standing in for "if Alice cooperates then so does Bob". This has the odd behaviour of becoming "true" when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.

For technical reasons we upgrade to , which says "if Alice cooperates in a *legible way*, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup gives us "Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back". In symbols, .

Now, is this okay? What about proving ?

Well, actually you can't ever prove that! Because of Lob's theorem.

Outside the system we can definitely see cases where is unprovable, e.g. because Bob always defects. But you can't prove this inside the system. You can only prove things like "" for finite proof lengths .

I think this is best seen as a consequence of "with finite proof strength you can only deny proofs up to a limited size".

So this construction works out, but perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, "cooperate if choosing to do so leads to a good outcome", is pretty trustworthy. It perhaps deserves to be cached out in better provability math.

(Thanks also to you for engaging!)

Hm. I'm going to take a step back, away from the math, and see if that makes things less confusing.

Let's go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).

When Alice goes to think about what Bob will do, maybe she sees that Bob's decision depends on what he thinks Alice will do.

At this junction, I don't want Alice to "recurse", falling down the rabbit hole of "Alice thinking about Bob thinking about Alice thinking about--" and etc.

Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.

This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying "what Bob thinks about Alice" as a node under Alice's control. This is the heart of it, and imo doesn't rely on anything weird or unusual.

For one thing, this wouldn't be very kind to the investors.

For another, maybe there were some machinations involving the round like forcing the board to install another member or two, which would allow Sam to push out Helen + others?

I also wonder if the board signed some kind of NDA in connection with this fundraising that is responsible in part for their silence. If so this was very well schemed...

This is all to say that I think the timing of the fundraising is probably very relevant to why they fired Sam "abruptly".