LawrenceC

# Wiki Contributions

In a sense, I think all math papers with focus on definitions (as opposed to proofs) feel like this.

I suspect one of the reasons OP feels dissatisfied about the corrigibility paper is that it is not the equivalent of Shannon's seminal results, which generally gave the correct definition of terms, but instead merely gesturing at a problem ("we have no idea how to formalize corrigibility!").

That being said, I resonate a lot with this part of the reply:

Proofs [in conceptual/definition papers] are correct but trivial, so definitions are the real contribution, but applicability of definitions to the real world seems questionable. Proof-focused papers feel different because they are about accepted definitions whose applicability to the real world is not in question.

I feel like most of the barrier in practice for people not "coordinating" in the relevant ways is people not knowing what other people are doing. And a big reason for this is that writing is really hard to write, especially if you have high standards and don't want to ship.

And yeah, better communication tech in general would be good, but I'm not sure how to start on that (while it's pretty obvious what a few candidate steps toward making posts/papers easier to write/ship would look like?)

My guess is most of the value in coordination work here is either in making posts/papers easier to write or ship, or in discovering new good researchers?

If you have that belief, I imagine this paper should update you more towards AI capabilities.

I do believe David Chapman's tweet though! I don't think you can just hotwire together a bunch of modules that are superhuman only in narrow domains, and get a powerful generalist agent, without doing a lot of work in the middle.

(That being said, I don't count gluing together a Python interpreter and a retrieval mechanism to a fine-tuned GPT-3 or whatever to fall in this category; here the work is done by GPT-3 (a generalist agent) and the other parts are primarily augmenting its capabilities.)

I don't see why it should update my beliefs a non-neglible amount? I expected techniques like this to work for a wide variety of specific tasks given enough effort (indeed, stacking together 5 different techniques into a specialist agent is how a lot of academic work in robotics looks like). I also think that the way people can compose text-davinci-002 or other LMs with themselves into more generalist agents basically should screen off this evidence, even if you weren't expecting to see it.

This seems orthogonal to my main claim (roughly: if you do this at a large scale then it starts becoming net negative due to lower quality).

Fair. I think I failed to address this point entirely.

I do think there's a nonzero amount of people who would not be that good at novel alignment research and would still be good at the tasks mentioned here, but I agree that there isn't a scalable intervention here, or at least not more so than standard AI alignment research (especially when compared to some appraoches like the brute-force mechanistic interp many people are doing).

(I interpreted the OP as saying that you convince AGI researchers who are not (currently) working on safety. I think a good steelman + critique of RRM wouldn't have much effect on that population, though I think it's pretty plausible I'm wrong about that because the situation at OpenAI is different from DeepMind.)

Yeah, I also messed up here -- I think this would plausibly have little effect on that population. I do think that a good answer to "why does RLHF not work" would help a nonzero amount, though.

As a person at a lab I'm currently voting for less coordination of this sort, not more

Agree that it's not scalable, but could you share why you'd vote for less?

As far as I can tell, the AI has no specialized architecture for deciding about its future strategies or giving semantic meaning to its words. It outputting the string "I will keep Gal a DMZ" does not have the semantic meaning of it committing to keep troops out of Gal. It's just the phrase players that are most likely to win use in that boardstate with its internal strategy.

This is incorrect; they use "honest" intentions to learn a model of message > intention, then use this model to annotate all the other messages with intentions, which then they then use to train the intent > message map. So the model has a strong bias toward being honest in its intention > message map. (The authors even say that an issue with the model is it has the tendency to spill too many of its plans to its enemies!)

The reason an honest intention > message map doesn't lead to a fully honest agent is that the search procedure that goes from message + history > intention can "change its mind" about what the best intention is.

Like chess grandmasters being outperformed by a simple search tree when it was supposed to be the peak of human intelligence, I think this will have the same effect of disenchanting the game of diplomacy.

This is correct; every time AI systems reach a milestone earlier than expected, this is simultaneously an update upward on AI progress being faster than expected, and an update downward on the difficulty of the milestone.

Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.

(I think there’s also another disagreement here about how close humans are to this natural limit.)

I don't think that Cicero is a general agent made by gluing together superhuman narrow agents! It's not clear that any of its components are super human in a meaningful sense.

I also don't think that "you can't just copy paste together a bunch of systems that are superhuman..." is a fair summary of David Chapman's tweet! I think his tweet is specifically pointing out that naming your components suggestive names and drawing arrows between them does not do the hard work of building your generalist agent (which is far more involved).