We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.
The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.
I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.
A couple of things that come to mind:
You'll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason -- A steers into a Faraday cage or something more clever.
(In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.
(In response to your reply to Evan) I don't necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.
Is your suggestion to run this system as a source of value, simulating lives for their own sake rather than to improve the quality of life of sentient beings in our universe? Our history (and present) aren't exactly utopian, and I don't see any real reason to believe that slight variations on it would lead to anything happier.
I think we can expect to achieve a lot more than that from a properly aligned AGI. There is so much suffering that could be alleviated right now with proper coordination, as a lower bound on how much better it could be than just effectively running copies of our timeline but at lower resolution.
Is the "Going Beyond Agents" section part of CHAI's research agenda, or your take on further challenges from an embedded agency perspective?
In the example with Bob, surely the odds of Bob having a crush on you after winking (2:1) should be higher than a random person winking at you (given as 10:1), as we already have reason to suspect that Bob is more likely to have a crush on you than some random person not part of the six.
The Wikipedia article has a typo in one of these: it should say "I am sparkling; you are unusually talkative; he is drunk." (as in the source)