Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html


Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions


I think you'd want to put more effort into making the articles more realistic. These are too short and they use fictional people. I suggest taking real articles from recent news events and modifying them to be about GPT-4/Llama/Claude/etc.

It's all relative. ARCs security was way stronger than the security GPT-4 had before and after ARC's evals. So for GPT-4, beginning ARC testing was like a virus moving from a wet market somewhere to a BSL-4 lab, in terms of relative level of oversight/security/etc. I agree that ARCs security could still be improved -- and they fully intend to do so.

I agree that IF GPT-4 had goals, it could work towards them even if it couldn't pass the ARC eval. However, I think that if it had goals, it probably would have been able to pass the ARC eval. That's my secondary argument though; my main argument is just that its training process probably didn't incentivize it to have (long-term) goals. I'd ask: Suppose GPT-4 had a bunch of commonly-used internal circuitry that basically was asking questions like "what can I do output to get more paperclips?" and "is my current best guess output leaving paperclips on the table, so to speak?" etc. How helpful or harmful would that internal circuitry be? I surmise that it would almost always be more trouble than it is worth, because it would be worth approximately zero. Now, circuitry along the lines of "what is the most likely continuation of this text?" on the other hand, seems pretty useful for lowering training loss, so probably it's got lots of circuitry like that.

I think this hypothesis does not deserve scorn. However I think it is pretty unlikely, less than 1%. My main argument would be that the training process for GPT-4 did not really incentivize such coherent strategic/agentic behavior. (LMK if I'm wrong about that!) My secondary argument would be that ARC did have fine-tuning access to smaller models (IIRC, maybe I'm misremembering) and they still weren't able to 'pull themselves together' enough to pass the eval. They haven't fine-tuned GPT-4 yet but assuming the fine-tuned GPT-4 continues to fail the eval, that would be additional evidence that the capability to pass just doesn't exist yet (as opposed to, it exists but GPT-4 is cleverly resisting exercising it). I guess my tertiary argument is AGI timelines -- I have pretty short timelines but I think a world where GPT-4 was already coherently agentic would look different from today's world, probably; the other AI projects various labs are doing would be working better by now.

To be clear I'm not sure this is possible, it may be fundamentally confused.

I think Steven Byrnes made my point but better: The intuition I was trying to get at is that it's possible to have an intelligent system which is applying its intelligence to avoid deception, as well as applying intelligence to get local goals. So it wouldn't be fair to characterize it as "rigorously search the solution space for things that work to solve this problem, but ignore solutions that classify as deception" but rather as "rigorously search the solution space for things that work to solve this problem without being deceptive" This system would be very well aware of the true fact that deception is useful for achieving local goals; however, it's global goals would penalize deception and so deception is not useful for achieving its global goals. It might have a deception classifier which can be gamed, but 'gaming the deception classifier' would trigger the classifier and so the system would be actively applying its intelligence to reduce the probability that it ends up gaming the deception classifier -- it would be thinking about ways to improve the classifier, it would be cautious about strategies (incl. super-rigorous searches through solution space) that seem likely to game the classifier, etc.

Analogy (maybe not even an analogy): Suppose you have some humans who are NOT consequentialists. They are deontologists; they think that there are certain rules they just shouldn't break, full stop, except in crazy circumstances maybe. They are running a business. Someone proposes the plan: "Aha, these pesky rules, how about we reframe what we are doing as a path through some space of nodes, and then brute search through the possible paths, and we commit beforehand to hiring contractors to carry out whatever steps this search turns up. That way we aren't going to do anything immoral, all we are doing is subcontracting out to this search process + contractor setup." Someone else: "Hmm, but isn't that just a way to get around our constraints? Seems bad to me. We shouldn't do that unless we have a way to also verify that the node-path doesn't involve asking the contractor to break the rules."

First, we should try to balance fears about the downsides of AI—which are understandable and valid—with its ability to improve people’s lives. To make the most of this remarkable new technology, we’ll need to both guard against the risks and spread the benefits to as many people as possible.

Second, market forces won’t naturally produce AI products and services that help the poorest. The opposite is more likely. With reliable funding and the right policies, governments and philanthropy can ensure that AIs are used to reduce inequity. Just as the world needs its brightest people focused on its biggest problems, we will need to focus the world’s best AIs on its biggest problems.

Although we shouldn’t wait for this to happen, it’s interesting to think about whether artificial intelligence would ever identify inequity and try to reduce it. Do you need to have a sense of morality in order to see inequity, or would a purely rational AI also see it? If it did recognize inequity, what would it suggest that we do about it?

OK, so it sounds like you are saying "I personally [ETA: and/or some people I know] could have done a better job red-teaming than ARC did." I suggest you submit an application to work for ARC as a contractor then. [and/or encourage those people to apply]. If you are right, you'll be doing the world a great service!

It is well understood within OpenAI that 'the internet' (tens of millions of people interacting with the models) is a more powerful red-teamer than anything we can do in-house. What's your point?

This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.

What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?

...attempting to answer my own question...

The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)

But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way." 

... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.

Load More