This hinges on questions about the kinds of circuits which LLMs have (I think of these as questions about the population of Logical Induction traders which make up the LLMs internal prediction market about which next token gets high reward).
Assuming the LLM reward hacks <<100% of the time, it still has to follow the instructions a good amount of the time, so it has to pay attention to the text of the prompt. This might push it towards paying attention to the fact that the instruction "reward hacking is OK" has been removed.
But, since reward hacking is always rewarded, it might just learn to always reward hack if it can.
This is at least in part conditioning: singing teachers always teach you to think "down" as you sing higher, because the thing you instinctively do when you think "higher" is wrong.
Also, (or so I've heard) the ancient Greeks used "high" and "low" backwards (at least when writing their music down) because they thought of it in terms of taller and shorter strings on a harp, where the taller strings are lower notes.
Oh yeah and as my partner pointed out to me today. While "Coefficients multiply whatever they are next to" lots of things called "coefficients" we commonly encounter have values smaller than 1 (e.g. the coefficient of friction, drag coefficient, coefficient of restitution all commonly have values <1)
I think this is a case of a curb cut effect. If it's easy (vs hard) to pronounce for non-native speakers, it's also easy (vs hard) to get the point across at a noisy party, or over a crackly phone line, or if someone's distracted.
Filter for fiction and they're about the same which I was actually surprised by.
On a different level, "philanthropy" is less weird in the name of a philanthropy org. It's also doing work. If someone has to look up what "philanthropy" means, then they become less confused. If they do that for "coefficient" then they're just even more confused. It's also the case that basically anyone can understand what "philanthropy" means given a one-sentence description, which isn't as easily the case for "coefficient" (I actually don't know a good definition for "coefficient" off the top of my head, despite the fact that I can name several coefficients).
Coefficient Giving is one of the worst name changes I've ever heard:
Clearly OP wanted to shed their existing reputation, but man, this is a really bad name choice.
Excellent! This is a thing I will feel good about pointing people towards when they ask "but how would we pause AI development?"
Clearly OP was introduced to addition and multiplication as the coproduct and product in the category set.
Spitballing:
Deep learning understood as a process of up- and down-weighting circuits is incredibly similar conceptually to logical induction.
Pre- and post-training LLMs is like juicing the market so that all the wealthy traders are different human personas, then giving extra liquidity to the ones we want.
I expect that the process of an agent cohering from a set of drives into a single thing is similar to the process of a predictor inferring the (simplicity-weighted) goals of an agent by observing it. RLVR is like rewarding traders which successfully predict what an agent which gets high reward would do.
Logical Induction doesn't get you all the way, since the circuits can influence other circuits, like traders that are allowed to bet on each other, or something.
(These analogies aren't quite perfect, I swapped between trading day-as-training batch and trading day-as-token)
Somebody must have made these observations before, but I've never seen them.
For Donald Trump, I suspect it's because still has the ability to launch nukes. Emmanuel Macron is also not on this list.