Here's an ill-considered hot take: there may be more opportunity to do good by positioning yourself as MAGA (disingenuously, if you have to) and promoting stuff you think is important where MAGA is at least a little flexible (AI?), because my sense is that this kind of thing is pretty neglected vs opposition to Trump.
(not that I'm volunteering).
Peter Wildeford looks to me like he does tread carefully around criticism of the admin, I can't easily estimate his impact. Dean Ball has gone further in supporting the admin and seems to have been very impactful as a result. Maybe there's substantial opportunity to copy Dean's playbook.
NB this is marginal impact reasonong, not "if I were all powerful what would the best outcome look like" reasoning.
Interesting. What does the distribution of errors look like for numerical questions? Are the answers often close even when they're wrong?
This might not be quite the right test. I'm wondering whether transformers learn to redundantly approximately encode info and size improves the approximation. But we could have intermediate numerical steps that are "close" while the final answer is far off, so my test might not be great. I also don't really know why they'd be incentivised to learn redundant approximate encodings without also being incentivised to learn redundant precise encodings.
I've tested this: models are similarly bad at two-hop problems (when was Obama's wife born?) without explicitly verbalising the intermediate hop (so either: no CoT or dot-of-thought), and much better when they can explicitly verbalise the intermediate hop.
I also think "usefulness" is a threshold phenomenon (to first order - that threshold being "benefits > costs") so continuous progress against skills which will become useful can look somewhat discontinuous from the point of view of actual utility. Rapid progress in coding utility is probably due to crossing the utility threshold, and other skills are still approaching their thresholds.
My view on where the tanks might win is: there's a point at which you basically saturate your capability at "whatever drones are good at" while there might be some other job tanks are good at (my vague guess is that this is something like "attacking well defended positions" - they're fast, take specialized weapons to defeat, and have big guns), and you're better off having that capability than further saturating your drone capability. But I've little in the way of quantitative insight about where saturation might occur, nor how good tanks are at attacking.
A particular point I'm a bit confused about: I've often seen people saying: tanks need infantry support to be safe. However, aren't infantry and tanks both vulnerable to drones?
I guess 5 Abrams and 30 million worth of drones vs 60 million worth of drones might be a better comparison. I think I’d still favour the drones but it’s much less obvious.
Some negative results. In some forthcoming work (out in the next few days, hopefully), we'll report negative results on trying to teach models to have "honest-only personas." That is, we tried to teach a model that, when a user query is prefixed with |HONEST_ONLY|, it responds in <honest_only> tags and only generates honest text; simultaneously, we trained the normal assistant persona to (1) acquire some knowledge but (2) lie about it. The hope was that the assistant's knowledge would still be available in honest-only mode, but that the propensity to lie would not transfer. Sadly, the dishonest propensity did transfer, and this method overall failed to beat a baseline of just training the assistant to be honest using the generic honesty data that we used to train the honest-only persona. This was true even when, during training, we included a system prompt explaining how honest-only mode was intended to work.
This is surprising to me, I would've expected it to work, maybe not perfectly, but there should be a significant difference. I'm less certain what my expectation would be for whether it beats your baseline - maybe "65% it beats baseline but not by a lot".
What about the inverse situation: untagged is honest, tagged is dishonest? The hypothesis here is something like: the unconditioned behaviour is the "true" persona (though I'm not very confident this would work: it'd be weird if propensity had asymmetric generalization properties but knowledge did not)
I don't know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that "AGI" is much less likely to be insanely good at generalization than we thought in 2015.
Evidence is basically this: I don't think "the scaling hypothesis" was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren't expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with ... not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we've got this amazing generally intelligent machine (I'm exaggerating a little bit for effect). In this world, the "G" in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you've accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy - they all become much more likely to strike unexpectedly if generalization is extremely broad.
But this isn't the world we're in; rather, we're in a world where we're helped along by a bit of generalization, but to a substantial extent we're exhaustively teaching the models everything they know (even the RL regime we're in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it's highly likely that we're in qualitatively the same world by the time we have broadly superhuman machines. I'd even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it's easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).
Of course "super-AGI" could still end up hyper-data-efficient, but it seems like we're well on track to get less-generalizing and very useful AGI before we get there.
I know you're asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I've ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I've little specific to contribute right now.
I think it's not in the IABIED FAQ because IABIED is focused on the relatively "easy calls"
IABIED says alignment is basically impossible
Cope Traps
Come on, I’m not doing this to you
Do you know what his aims are? I feel like that's an important part of a model like this!