Jacob_Hilton

# Wiki Contributions

Sorted by

The LLM output looks correct to me.

Jacob_HiltonΩ350

Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for "the model does $bad_thing with probability less than 1 in a million" (which you believe based on sampling) and then see how unlikely$bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the model actually outputs TRUE on all inputs.

Another possible approach is analogous to fine-tuning. You could start by using surprise accounting to find the best explanation for "the loss of the model is L" (where L is estimated during training), which should incentivize rich explanations of the model's behavior in general. Then to estimate the probability that model does some rare \$bad_thing, you could "fine-tune" your explanation using an objective that encourages it to focus on the relevant tails of the distribution. We have more ideas about estimating the probability of events that are too rare to estimate via sampling, and have been considering objectives other than surprise accounting for this. We plan to share these ideas soon.

Yes, that's a clearer way of putting it in the case of the circuit in the worked example. The reason I said "for no apparent reason" is that there could be some redundancy in the explanation. For example, if you already had an explanation for the output of some subcircuit, you shouldn't pay additional surprise if you then check the output of that subcircuit in some particular case. But perhaps this was a distracting technicality.

I would say that they are motivated by the same basic idea, but are applied to different problems. The MDL (or the closely-related BIC) is a method for model selection given a dataset, whereas surprise accounting is a method for evaluating heuristic explanations, which don't necessarily involve model selection.

Take the Boolean circuit worked example: what is the relevant dataset? Perhaps it is the 256 (input, TRUE) pairs. But the MDL would select a much simpler model, namely the circuit that ignores the input and outputs TRUE (or "x_1 OR (NOT x_1)" if it has to consist of AND, OR and NOT gates). On the other hand, a heuristic explanation is not interested choosing a simpler model, but is instead interested in explaining why the model we have been given behaves in the way it does.

The heuristic explanations in the post do use a single prior or over the set of circuits, which we also call a "reference class". But we wish to allow explanations that use other reference classes, as well as explanations that combine multiple reference classes, and perhaps even explanations that use "subjective" reference classes that do not seem to correspond to any precise prior. These are the sorts of issues explored in the upcoming paper. Ultimately, though, a lot of our heuristic arguments and the surprise accounting for them remain somewhat ambiguous or informal.

Yes, the cost of 1 bit for the OR gate was based on the somewhat arbitrary choice to consider only OR and AND gates. A bit more formally, the heuristic explanations in the post implicitly use a "reference class" of circuits where each binary gate was randomly chosen to be either an OR or an AND, and each input wire to a binary gate was randomly chosen to have a NOT or not. The arbitrariness of this choice of reference class is one obstruction to formalizing heuristic explanations and surprise accounting. We are currently preparing a paper that explores this and related topics, but unfortunately the core issue remains unresolved.

We're removing nondisparagement clauses from our standard departure paperwork, and we're releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We'll communicate this message to former employees.

They have communicated this to me and I believe I was in the same category as most former employees.

I think the main reasons so few people have mentioned this are:

• As I mentioned, there is still some legal ambiguity and additional avenues for retaliation
• Some people are taking their time over what they want to say
• Most people don't want to publicly associate themselves with a controversial situation
• Most people aren't inclined to disparage their former employer anyway, and so they may not think of their own situation as that big of a deal
Jacob_Hilton13-19

Yeah I agree with this, and my original comment comes across too strongly upon re-reading. I wanted to point out some counter-considerations, but the comment ended up unbalanced. My overall view is:

• It was highly inappropriate for the company to have been issuing these agreements so widely, especially using such aggressive tactics and without allowing disclosure of the agreement, given the technology that it is developing.
• The more high-profile and credible a person is, the more damaging it is for this person to have been subject to the agreement.
• Nevertheless, it is a mistake to think of potential "disparagement" as part of the job duties of most of the people mentioned, and the post appears to wildly misinterpret the meaning of this term as "taking any actions which might make the company less valuable".
• Ultimately, it would have looked extremely bad for the company to enforce one of these agreements, so the primary effect of the contract comes down to how individuals felt that it constrained their behavior. We don't have great visibility into this. It's possible that some of these people felt quite constrained, and it's also possible that some of these people weren't even aware of the non-disparagement clause because they didn't notice it when they signed.
• Thankfully, most of this is now moot as the company has retracted the contract. I should emphasize that there may remain some legal ambiguity and additional avenues for retaliation, but I am optimistic that these will be cleaned up in the near future. There will still be non-disparagement agreements in place in cases where "the non-disparagement provision was mutual" (in the words of the company), but my strong guess is that this refers only to the original Anthropic departures and perhaps a handful of other individuals who were high up at the company.
• It remains important for people to disclose their financial interest in the company when appropriate, or in some cases give up this interest to avoid a conflict of interest.

Note: I have a financial interest in the company and was subject to one of these agreements until recently.

We were especially alarmed to notice that the list contains at least 12 former employees currently working on AI policy, and 6 working on safety evaluations. This includes some in leadership positions, for example:

I don't really follow this reasoning. If anything, playing a leadership role in AI policy or safety evaluations will usually give you an additional reason not to publicly disparage AI companies, to avoid being seen as partisan, making being subject to such an agreement less of an issue. I would be pretty surprised if such people subject to these agreements felt particularly constrained in what they could say as part of their official duties, although if I am wrong about this then it does seem like quite a concerning thing to have happened. The obvious exception to this is if the role involves unofficial public commentary about labs, but it's not obvious to me that this has been a big part of the role of any of the people on your list, and even then, they may not have felt especially constrained, depending on the individual. It's also worth noting that several of these roles require the holder to give up or donate lab equity to avoid any conflict of interest, regardless of any non-disparagement agreements.

I suspect the crux may be our differing interpretations of the agreement. I'm not sure where your interpretation that it prohibits "taking any actions which might make the company less valuable" comes from, maybe you could highlight the part of the agreement you are basing that on.

[This comment is no longer endorsed by its author]Reply

If the question is whether I think they were true at time given the information I have now, I think all of the individual points hold up except for the first and third "opinions". I am now less sure about what OpenAI leadership believed or cared about. The last of the "opinions" also seems potentially overstated. Consequently, the overall thrust now seems off, but I still think it was good to share my views at the time, to start a discussion.

If the question is about the state of the organization now, I know less about that because I haven't worked there in over a year. But the organization has certainly changed a lot since this post was written over 18 months ago.

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.