James Payor

I spend time with my budding family, and think about AI alignment.

Some other links at payor.io

Wiki Contributions

Comments

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

I tried to formalize this, using  as a "poor man's counterfactual", standing in for "if Alice cooperates then so does Bob". This has the odd behaviour of becoming "true" when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.

For technical reasons we upgrade to , which says "if Alice cooperates in a legible way, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup gives us "Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back". In symbols, .

Now, is this okay? What about proving ?

Well, actually you can't ever prove that! Because of Lob's theorem.

Outside the system we can definitely see cases where  is unprovable, e.g. because Bob always defects. But you can't prove this inside the system. You can only prove things like "" for finite proof lengths .

I think this is best seen as a consequence of "with finite proof strength you can only deny proofs up to a limited size".

So this construction works out, but perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, "cooperate if choosing to do so leads to a good outcome", is pretty trustworthy. It perhaps deserves to be cached out in better provability math.

(Thanks also to you for engaging!)

Hm. I'm going to take a step back, away from the math, and see if that makes things less confusing.

Let's go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).

When Alice goes to think about what Bob will do, maybe she sees that Bob's decision depends on what he thinks Alice will do.

At this junction, I don't want Alice to "recurse", falling down the rabbit hole of "Alice thinking about Bob thinking about Alice thinking about--" and etc.

Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.

This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying "what Bob thinks about Alice" as a node under Alice's control. This is the heart of it, and imo doesn't rely on anything weird or unusual.

For the setup , it's bit more like: each member cooperates if they can prove that a compelling argument for "everyone cooperates" is sufficient to ensure "everyone cooperates".

Your second line seems right though! If there were provably no argument for straight up "everyone cooperates", i.e. , this implies  and therefore , a contradiction.

--

Also I think I'm a bit less confused here these days, and in case it helps:

Don't forget that "" means "a proof of any size of ", which is kinda crazy, and can be responsible for things not lining up with your intuition. My hot take is that Lob's theorem / incompleteness says "with finite proof strength you can only deny proofs up to a limited size, on pain of diagonalization". Which is way saner than the usual interpretation!

So idk, especially in this context I think it's a bad idea to throw out your intuition when the math seems to say something else. Since the mismatch is probably coming down to some subtlety in this formalization of provability/meta-methamatics. And I presently think the quirky nature of provability logic is often bugs due to bad choices in the formalism.

Yeah I think my complaint is that OpenAI seems to be asserting almost a "boundary" re goal (B), like there's nothing that trades off against staying at the front of the race, and they're willing to pay large costs rather than risk being the second-most-impressive AI lab. Why? Things don't add up.

(Example large cost: they're not putting large organizational attention to the alignment problem. The alignment team projects don't have many people working on them, they're not doing things like inviting careful thinkers to evaluate their plans under secrecy, or taking any other bunch of obvious actions that come from putting serious resources into not blowing everyone up.)

I don't buy that (B) is that important. It seems more driven by some strange status / narrative-power thing? And I haven't ever seen them make an explicit their case for why they're sacrificing so much for (B). Especially when a lot of their original safety people fucking left due to some conflict around this?

Broadly many things about their behaviour strike me as deceptive / making it hard to form a counternarrative / trying to conceal something odd about their plans.

One final question: why do they say "we think it would be good if an international agency limited compute growth" but not also "and we will obviously be trying to partner with other labs to do this ourselves in the meantime, although not if another lab is already training something more powerful than GPT-4"?

I kinda reject the energy of the hypothetical? But I can speak to some things I wish I saw OpenAI doing:

  1. Having some internal sense amongst employees about whether they're doing something "good" given the stakes, like Google's old "don't be evil" thing. Have a culture of thinking carefully about things and managers taking considerations seriously, rather than something more like management trying to extract as much engineering as quickly as possible without "drama" getting in the way.

    (Perhaps they already have a culture like this! I haven't worked there. But my prediction is that it is not, and the org has a more "extractive" relationship to its employees. I think that this is bad, causes working toward danger, and exacerbates bad outcomes.)
     
  2. To the extent that they're trying to have the best AGI tech in order to provide "leadership" of humanity and AI, I want to see them be less shady / marketing / spreading confusion about the stakes.

    They worked to pervert the term "alignment" to be about whether you can extract more value from their LLMs, and distract from the idea that we might make digital minds that are copyable and improvable, while also large and hard to control. (While pushing directly on AGI designs that have the "large and hard to control" property, which I guess they're denying is a mistake, but anyhow.)

    I would like to see less things perverted/distracted/confused, like it's according-to-me entirely possible for them to state more clearly what the end of all this is, and be more explicit about how they're trying to lead the effort.
     
  3. Reconcile with Anthropic. There is no reason, speaking on humanity's behalf, to risk two different trajectories of giant LLMs built with subtly different technology, while dividing up the safety know-how amidst both organizations.

    Furthermore, I think OpenAI kind-of stole/appropriated the scaling idea from the Anthropic founders, who left when they lost a political battle about the direction of the org. I suspect it was a huge fuck-you when OpenAI tried to spread this secret to the world, and continued to grow their org around it, while ousting the originators. If my model is at-all-accurate, I don't like it, and OpenAI should look to regain "good standing" by acknowledging this (perhaps just privately), and looking to cooperate.

    Idk, maybe it's now legally impossible/untenable for the orgs to work together, given the investors or something? Or given mutual assumption of bad-faith? But in any case this seems really shitty.

I also mentioned some other things in this comment.

I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don't have that yet, but I do have some rambly things to say.

I basically don't think overhangs are a good way to think about things, because the bridge that connects an "overhang" to an outcome like "bad AI" seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.

The usual argument that leads from "overhang" to "we all die" has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it's not hard to scale and they aren't cautious. This is then used to justify scaling up your own method with abandon, hoping that we're not about to collectively fall off a cliff.

For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.

Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that "alignment" might be on track. Examples:

  • A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team's projects get to what a main training run gets.)
  • Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
  • Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of "how long a chain of thoughts/inferences can it make as it composes dialogue", and how this changes with scale. (Since the whole "LLMs are safer" thing relies on their thinking being coupled to the text they output; otherwise you're back in giant inscrutable RL agent territory)
  • The delta between "intelligence embedded somewhere in the system" and "intelligence we can make use of" looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the "use our AI to tame the AI before it's too late" plan.)

Also I can't make this point precisely, but I think there's something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don't want "fast takeoff", you want less fissile material lying around, lest it get assembled into something dangerous.

Finally, to more directly talk about LLMs, my crux for whether they're "safer" than some hypothetical alternative is about how much of the LLM "thinking" is closely bound to the text being read/written. My current read is that they're more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any "strange competence" we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.

As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.

"This margin is too small to contain our elegant but unintuitive reasoning for why". Grump. Let's please have a real discussion about this some time.

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?

Load More