I found this really useful, thanks! I especially appreciate details like how much time you spent on slides at first, and how much you do now.
Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.
The coin coming up heads is “more headsy” than the expected outcome, but maybe o3 is about as headsy as Thane expected.
Like if you had thrown 100 coins and then revealed that 80 were heads.
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
Seems like 99+ to me.
Elon Musk to: Ilya Sutskever, Greg Brockman, Sam Altman - Feb 19, 2016 12:05 AM
Frankly, what surprises me is that the AI community is taking this long to figure out concepts. It doesn't sound super hard. High-level linking of a large number of deep nets sounds like the right approach or at least a key part of the right approach
…
Ilya Sutskever to: Elon Musk, (cc: Greg Brockman, Sam Altman) - Feb 19, 2016 10:28 AM
Several points:
It is not the case that once we solve "concepts," we get AI. Other problems that will have to be solved include unsupervised learning, transfer learning, and lifetime learning. We're also doing pretty badly with language right now.
Seems like there was a misunderstanding here where Musk meant “the AI community is taking a long time to figure stuff out” but Ilya thought he meant “figure out how to make AI that can think in terms of concepts”.
Compute is used in two ways: it is used to run a big experiment quickly, and it is used to run many experiments in parallel.
95% of progress comes from the ability to run big experiments quickly. The utility of running many experiments is much less useful. In the old days, a large cluster could help you run more experiments, but it could not help with running a single large experiment quickly.
For this reason, an academic lab could compete with Google, because Google's only advantage was the ability to run many experiments. This is not a great advantage.
Recently, it has become possible to combine 100s of GPUs and 100s of CPUs to run an experiment that's 100x bigger than what is possible on a single machine while requiring comparable time. This has become possible due to the work of many different groups. As a result, the minimum necessary cluster for being competitive is now 10–100x larger than it was before.
I think Ilya said almost exactly this in his NeurIPS talk yesterday. Sounds like “doing the biggest experiments fast is more important than doing many experiments in parallel” is a core piece of wisdom to him.
This might be the source of Sarah Constantin’s incredulity?
Is Eliezer claiming to (train himself to perform only those steps) over thirty seconds, or train himself to (perform only those steps over thirty seconds)?
I'm confused by this argument from Redwood's AI control paper.
5.2 WHY WE DIDN’T CONSIDER STRATEGIES THAT INVOLVE TRAINING THE MODEL [...] Any protocol that involves training the model can be transformed into a validation-style protocol that merely validates the model’s performance on the dataset that was used as a train set. In any case where the validation-style protocol fails, we know that the red-team model is not distinguished from an aligned model by the measurements done by the protocol. Given that none of the training examples allow the reward-generating process to determine whether the policy it is training is undesirable, we don’t see why training would transform the model from an undesirable policy to a desirable policy. It is conceivable that we could find a training scheme that systematically converts undesirable policies to desirable ones, despite the lack of a reward differential causing it to do so. But we don’t know how.
If I understand correctly, the argument is:
But 2 seems wrong -- if good performance on D is sufficient but not neccesary for being desirable, then you could have two models, one desirable and one undesirable, which are indistinguishable on D because they both perform badly, and then turn them both into desirable models by training on D.
As an extreme example, suppose your desirable model outputs correct code and your undesirable model backdoored code. Using D you train them both to output machine checkable proofs of the correctness of the code they write. Now both models are desirable because you can just check correctness before deploying the code.
I think all of the following:
or at least has the potential to, depending on the details. ↩︎