The paper you link is pretty interesting, but I don't think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the "eyeball test" estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
Hm, so you think if there are some distinctive benchmark questions that have been discussed online, models otherwise trained on that era of internet won't know details about them?
I think the problem with human values is more underdetermination than fragility.
Base human preferences and meta-preferences are allowed to fuzzily span (interpersonally, and intrapersonally between slight variations in context and random noise, and intrapersonally between different ways of ascribing values to models of human behavior) a set of initial conditions that, when run forward, resolve internal conflicts differently and sometimes can end up in mutually-disagreeable end states. So it doesn't make much sense to try to apply a basin of attraction argument directly to human values, because the target doesn't stay in a basin.
Legitimacy is a bundle of meta-preferences - it's about how we want human-modeling, resolution of internal conflicts, etc. to operate. It has no special exception - humans can be ascribed conflicting notions of legitimacy, that lead to different (and potentially mutually disagreeable) end states upon being used. Also bad at being a basin.
Mostly I think we just have to do our best - we have to apply the notions of legitimacy we have to the project of resolving the internal conflicts in our notion of legitimacy. We probably shouldn't demand guarantees for a bunch of individual principles, because our principles can conflict. Seems better to gradually make choices that seem locally legitimate - maybe even choosing a method that we're confident will converge rather than explode, even as we're aware that slightly different versions of it would converge to different things.
Eric Drexler pushing back against statements like
Picture a robotic arm that reaches over to a conveyor belt, picks up a loaded tool, applies the tool to a workpiece under construction, replaces the empty tool on the belt, picks up the next loaded tool, and so on-as in today's automated factories."
, made by... Eric Drexler in the Scientific American article he cites as his "technically specific pushback."
Inoculation prompting reduces RL pressure for learning bad behavior, but it's still expensive to rederive that it's okay to cheat from the inoculation prompt rather than just always being cheaty.
One way the expense binds is regularization. Does this mean you should turn off regularization during inoculation prompting?
Another way is that you might get better reward by shortcutting the computation about cheating and using that internal space to work on the task more. It might be useful to monitor this happening, and maybe try to protect the computation about cheating from getting its milkshake drunk.
Upvotes communicate to your future self, to future others, to current others, to the post's author, and to the site's promotion algorithm.
Voting a few months out tells me you're mostly interested in #1 and #2 from that list, while I'm I'm pretty big on the last two.
So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don't-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it's the wrong modeling assumption to say we'll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On "breaking things," it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn't expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn't break the environment model).
If you're worried about coherent bad behavior because we'll be doing RL on task completion, that doesn't sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien's recent post, or the reports about math-finetuned LLMs drifting.
Recent work (e.g.) has helped clarify the continuum between "general" emergent misalignment, where the AI does a wide variety of bad stuff in a very vibes-based way, through more specific but still vibes-based misaligned behavior, to more and more situationally-aware and narrowly consequentialist bad behavior.
Do you think this is more the sort of thing where you'd want to produce a wide diversity of models, or would you produce a bunch of models on the consequentialism end of this axis if you could?
Am I correct that the human uncertainty about "true values" (or more naturalistically, the underdetermination of how to model humans as having values) isn't actually an active ingredient in the toy problem?
I.e. you start an AI, and it knows it's going to get some observations about humans, model them as having values, and then act to fulfill those values. But if it's updateless, it will have a prior probability distribution over what values it would land on, and it will take the prior expectation and maximize that, basically preventing value learning from taking place.
What do you think about the cheap fix, where we say "oops, that was a mistake, we gave the AI the preferences 'globally maximize the modeled pattern from unknown data,' when we should have given it the preferences 'locally maximize the modeled pattern from unknown data,' i.e. prefer that your outputs match the observed pattern, not that your outputs are globally right."
Sort of agree, but I think there are paths to gradual mind uploading that I'm happy with. It's probably worth it for me, though it's also very likely that enough people will want to be corporeal on Earth that we won't disassemble Sol for its free energy.