One thing is that even given access to the model weights and the code behind the API, you could not tell if the model was password-locked, whereas you would see the hardcoded verifier. Thus if a lab wanted to hide capabilities they could delete the training data and you would have no way of knowing.

Reply

Russell Conjugations list & voting thread

Ian McKenzie1y30

The Wikipedia article has a typo in one of these: it should say "I am sparkling; you are unusually talkative; he is drunk." (as in the source)

Reply

Inverse Scaling Prize: Round 1 Winners

Ian McKenzie2y11

We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.

The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.

I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.

Reply

A Quick Guide to Confronting Doom

Ian McKenzie2y10

I think 75% is 1:3 rather than 1:2.

Reply

Why Instrumental Goals are not a big AI Safety Problem

Ian McKenzie2y10

A couple of things that come to mind:

You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever.
(In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.
(In response to your reply to Evan) I don't necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.

Reply

Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe?

Answer by Ian McKenzieAug 09, 202010

Is your suggestion to run this system as a source of value, simulating lives for their own sake rather than to improve the quality of life of sentient beings in our universe? Our history (and present) aren't exactly utopian, and I don't see any real reason to believe that slight variations on it would lead to anything happier.

I think we can expect to achieve a lot more than that from a properly aligned AGI. There is so much suffering that could be alleviated right now with proper coordination, as a lower bound on how much better it could be than just effectively running copies of our timeline but at lower resolution.

Reply

Our take on CHAI’s research agenda in under 1500 words

Ian McKenzie4y10

Is the "Going Beyond Agents" section part of CHAI's research agenda, or your take on further challenges from an embedded agency perspective?

Reply

Rationality: An Introduction

Ian McKenzie5y52

In the example with Bob, surely the odds of Bob having a crush on you after winking (2:1) should be higher than a random person winking at you (given as 10:1), as we already have reason to suspect that Bob is more likely to have a crush on you than some random person not part of the six.

Reply