David Johnston

Wiki Contributions

Comments

If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

There is a situation in which information markets could be positive sum, though I don't know how practical it is:

I own a majority stake in company X. Someone has proposed an action A that company X take, I currently think this is worse than the status quo, but I think it's plausible that with better information I'd change my mind. I set up an exchange of X-shares-conditional-on-A for USD-conditional-on-A and the analogous exchange conditional on not-A, subsidised by some fraction of my X shares using an automatic market maker. If, by the closing date, X-shares-conditional-on-A trade at a sufficient premium to X-shares-conditional-on-not-A, I do A.

In this situation, my actions lose money vs the counterfactual of doing A and not subsidising the market, but compared to the counterfactual of not subsidising the market and not doing A I gain money because the rest of my stock is now worth more. It's unclear how I do compared to the most realistic counterfactual of "spend $Y researching action A more deeply and act accordingly".

(note that conditional prediction markets also have incentive issues WRT converging to the correct prices, though I'm not sure how important these are in practice)

I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!

I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious!

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

One thing I'm saying is that we don't have clear evidence to support this claim.

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.

So I don’t think hallucination makes it obvious that behavioural safety is not solved.

I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

I think this is an interesting proposal. It strikes me as something that is most likely to be useful against “scalable deception” (“misinformation”), and given the utility of scalable deception such technologies might be developed anyway. I think you do need to check if this will lead to deception technologies being developed that would not otherwise have been, and if so whether we’re actually better off knowing about them (this is analogous to one of the cases against gain of function research: we might be better if not knowing how to make highly enhanced viruses).

I have a paper (planning to get it on arxiv any day now…) which contains a result: independence of causal mechanisms (which can be related to Occam’s razor & your first point here) + precedent (“things I can do have been done before”) + variety (related to your second point - we’ve observed the phenomena in a meaningfully varied range of circumstances) + conditional independence (which OP used to construct the Bayes net) implies a conditional distribution invariant under action.

That is, speaking very loosely, if you add your considerations to OPs recipe for Bayes nets and the assumption of precedent, you can derive something kinda like interventions.

Maybe it’s similar, but high U is not necessary

Thanks for explaining the way to do exhaustive search - a big network can exhaustively search smaller network configurations. I believe that.

However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.

Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.

I think there is an additional effect related to "optimization is not conditioning" that stems from the fact that causation is not correlation. Suppose for argument's sake that people evaluate alignment research partly based on where it's come from (which the machine cannot control). Then producing good alignment research by regular standards is not enough to get high ratings. If a system manages to get good ratings anyway, then the actual papers it's producing must be quite different to typical highly rated alignment papers, because they are somehow compensating for the penalty incurred by coming from the wrong source. In such a situation, I think it would not be surprising if the previously observed relationship between ratings and quality did not continue to hold.

This is similar to "causal Goodhart" in Garrabrant's taxonomy, but I don't think it's quite identical. It's ambiguous whether ratings are being "intervened on" in this situation, and actual quality is probably going to be affected somewhat. I could see it as a generalised version of causal Goodhart, where intervening on the proxy is what happens when this effect is particularly extreme.

Load More