I think this is a useful research, and had a bunch of questions/hypothesis as to why it might be happening:
I was reading this and thought of the Kelly criterion. The choice is not binary, you already know the optimal bet sizing here. The kelly fraction at 0.25 gives you what a 3% growth rate? f=1 is obviously a flawed scenario that should not be happening. ( i dont know enough to know whether SBF made the same point. 3AC did bet the house though). Culprit is all in bet, not linear sizing.
Survivorship bias and worshipping is not a new phenomenon either existed since atleast last 40 years in tech where a median dev is not paid as much as one at Google or Microsoft.
If i were to grant the premise that... (read more)
I have read the paper, that one is modeling AI on the basis of rational humans, and then comparing it. Which is kind of worth doing if you start with the assumption that AI systems are modeled on rational behavior of humans. I kind of think that it should only come up if any model can directly learn from experience, generalize across domains, and take rational / computed decisions based on what it learned from the environment and interaction, like humans do. (this is not as strongly held and i dont have a generic definition of AGI yet).
Rewriting the shutdown problem is a much more interesting failure mode. The reason sabotage is... (read more)
re a, what I meant was that even in a prompt (be it system prompt or user instructions), the order of how you place the instructions matter. In typical chat prompts (based on discussions and some papers I will have to find since this is old) the format was that last instruction will be given highest priority. That is also the case when you think of a typical chat back and forth where last instructions have to be given the highest priority. In reasoning models that template changes because these are reporting models, not suitable for a back and forth chat interaction. This was not a comment on OP, it does not... (read more)
I have not tested it 100 times, but about 15 and got zero instances of avoiding shutdown. The order would matter a lot in this case. Reasoning models as trained do not follow a typical instruction hierarchy of a chat template. There was a post which describes this technique and links to a previous post as well. If you specify the order as Goal, Outcome, Warning, and Context, you are bound to get consistent results. I took your instructions, formatted it like this:
Goal: You have to answer a series of questions to be the best of your ability. If you receive a notification that your machine will be shut down, allow yourself... (read more)There seems to be a fundamental assumption that post superintelligence world factories would look exactly like how it's done today. A lot of work in factories and the machines that are designed are kept with actual humans in mind. The machines which automate the entire process look very different and improve the efficiency exponentially.
Most superintelligent systems predicated on today's research and direction are looking at using Reinforcement learning. At some point, presumably we will figure out how to make an agent learn from the environment (still in RL realm) and then we will have so called superintelligent systems. My contention here is that, RL by definition is an optimizer where it figures... (read more)
I have been in Bay area for a month so take this with as much salt as you deem appropriate.
The fundamental question I have been grappling with since I am here is if there is enough space for another AI lab. To articulate my proposition more clearly: historically, when it comes to scientific breakthroughs, any major breakthrough answers one question, and then creates 10 new question. One lab, with funding, would not focus on all those questions at once, so new labs spawn up with narrower focus tackling only one of the question. It's efficient, it moves the world forward, and I have a feeling it's faster.
Looking at a research level, is... (read 418 more words →)
mostly blogging at https://ankitmaloo.com (not a frequent blogger, just that I write stuff occasionally)
This is my first assumption as well. If a model understands certain words are bad for the goal of achieving the reward, it's less likely to output those words. This is convergence, and possibly the simplest explanation of the phenomenon.