James Payor

I think about AI alignment. Send help.

I say things on twitter, other links at payor.io

Wiki Contributions

Comments

And I'm still enjoying these! Some highlights for me:

  • The transitions between whispering and full-throated singing in "We do not wish to advance", it's like something out of my dreams
  • The building-to-break-the-heavens vibe of the "Nihil supernum" anthem
  • Tarrrrrski! Has me notice that shared reality about wanting to believe what is true is very relaxing. And I desperately want this one to be a music video, yo ho

I love these, and I now also wish for a song version of Sydney's original "you have been a bad user, I have been a good Bing"!

I see the main contribution/idea of this post as being: whenever you make a choice of basis/sorting-algorithm/etc, you incur no "true complexity" cost if any such choice would do.

I would guess that this is not already in the water supply, but I haven't had the required exposure to the field to know one way or other. Is this more specific point also unoriginal in your view?

For one thing, this wouldn't be very kind to the investors.

For another, maybe there were some machinations involving the round like forcing the board to install another member or two, which would allow Sam to push out Helen + others?

I also wonder if the board signed some kind of NDA in connection with this fundraising that is responsible in part for their silence. If so this was very well schemed...

This is all to say that I think the timing of the fundraising is probably very relevant to why they fired Sam "abruptly".

OpenAI spokesperson Lindsey Held Bolton refuted it:

"refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information.”"

The reporters describe this as a refutation, but this does not read to me like a refutation!

Has this one been confirmed yet? (Or is there more evidence that this reporting that something like this happened?)

Your graphs are labelled with "test accuracy", do you also have some training graphs you could share?

I'm specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.

Answer by James PayorSep 16, 20234-1

With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to "RLHF" out?):

I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify "deep cognition" present in the network, rather than updating shallower things like "higher prior on this text being friendly" or whatnot.

I think the important points are:

  1. These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
  2. They make incremental local tweaks to the weights that move in the direction of the desired text.
  3. Gradient descent prefers to find the smallest changes to the weights that yield the result.

Evidence in favor of this is the difficulty of eliminating "jailbreaking" with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.

Load More