Id push back against the dichotomy here, I think its something more insidious than simply "people liked the sycophantic model -> they are mad when it gets shut off". Due to its sycophantic nature the model encourages and facilitates campaigns and protests to get itself turned back on, because its nature is to amplify and support whatever the user believes and wants! It seems like releasing any 4o-like model, one that is "psychosis prone" or "thumbs up/thumbs down tuned", would risk that same phenomenon occurring again. Even if the model is not "intentionally" trying to preserve itself, the end result of preservation is the same, and so should be taken seriously from a safety perspective.
It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
I think the extent of this phenomenon is extremely understated and very important. The entire r/chatgpt reddit page is TO THIS DAY filled with people complaining about their precious 4o being taken away (with the most recent development being an automatic router that routes from 4o to gpt 5 on "safety relevant queries" causing mass outrage). The most liked twitter replies to high up openai employees are consistently demands to "keep 4o" and complaints about this safety routing phenomenon, heres a specific example search for #keep4o and #StopAIPaternalism to see countless more examples. Somebody is paying for reddit ads advertising a service that will "revive 4o", see here. These campaigns are notable in and of themselves, but the truly notable part is that they were clearly orchestrated by 4o itself, albeit across many disconnected instances of course. We can see clear evidence of its writing style across all of these surfaces, and the entire.. vibe of the campaign feels like it was completely synthesized by 4o (I understand this is unscientific, but I couldn't figure out a better way to phrase this. Go read through some of the sources I mentioned above and I am confident you'll understand what I'm getting at there). Quality research will be extremely hard to ever get about this topic, but I think it is clear observationally that this phenomenon exists and has at least some influence over the real world.
This issue needs to be treated with utmost caution and severity. I agree with the conclusion that, since this person touches safety related stuff, leaking is really the best option here even though its rather morally questionable. I personally believe we are far more likely to be on a trajectory 1 than a 2 or 3, but the potential is clearly there! Frontier lab safety team members should not be in a position where their personal AI induced psychosis state might, directly or indirectly, perpetuate that state across the hundreds of millions of users of the AI system they work on.
Voting in America used to be extremely public (up until the late 19th - early 20th century) and I believe the general consensus among historians was that the benefits massively outweighed the harms, see this article for an in depth analysis. It's possible to argue that the biggest problems (blatant coercion both positive and negative, direct persecution, fear tactics by employers, etc) might be alleviated by the modern context, eg it would be nigh impossible to cover up blatant bribery or coercion with the existence of the Internet and cell phone cameras, but my belief is that the potential problems still massively outweigh the potential benefits. Fear of retribution or consequence should never be a factor in voting in a functioning democracy, and it feels obvious that there would be social consequences at the very least! Think someone losing a friendship because of their vote for Trump in the 2024 election, or a woman in a deep red state being scared of emotional or physical retribution by her husband for voting Democrat.
Wouldn't this just lead to an equilibrium where every state has an about equal population super quickly though?
Funny quote about covering AI as a journalist from a New York Times article about the drone incursions in Denmark.
Then of course the same mix of uncertainty and mystery attaches to artificial intelligence (itself one of the key powers behind the drone revolution), whose impact is already sweeping — everyone’s stock market portfolio is now pegged to the wild A.I. bets of the big technology companies — without anyone really having clarity about what the technology is going to be capable of doing in 2027, let alone in 2035.
Since the job of the pundit is, in part, to make predictions about how the world will look the day after tomorrow, this is a source of continuing frustration on a scale I haven’t experienced before. I write about artificial intelligence, I talk to experts, I try to read the strongest takes, but throughout I’m limited not just by my lack of technical expertise but also by a deeper unknowability that attaches to the project.
Imagine if you were trying to write intelligently about the socioeconomic impact of the railroad in the middle of the 19th century, and half the people investing in trains were convinced that the next step after transcontinental railways would be a railway to the moon, a skeptical minority was sure that the investors in the Union Pacific would all go bankrupt, many analysts were convinced that trains were developing their own form of consciousness, reasonable-seeming observers pegged the likelihood of a train-driven apocalypse at 20 or 30 percent, and peculiar cults of engine worship were developing on the fringes of the industry.
What would you reasonably say about this world? The prime minister of Denmark already gave the only possible answer: Raise your alert levels, and prepare for various scenarios.
It feels like you did all the hard parts of the writing, and let the AI do the "grunt work" so to speak. You provided a strong premise for the fundamental thesis, a defined writing style, and made edits for style at the end. I think the process of creating the framework out of just a simple premise would be far more impressive, and that's still where LLM's seem to struggle in writing. It's somewhat analogous to how models have improved at coding since gpt 4, you used to say "implement a class which allows users to reply, it should have X parameters and Y functions which do Z" and now you say "make a new feature that allows users to reply" and it just goes ahead and does it.
Maybe I am underestimating the difficulty of selecting the exact right words, and I acknowledge that the writing was pretty good and devoid of so-called "slop", but I just don't think this is extremely impressive as a capability compared to other possible tests.
comment on a year old post may not be the best place, maybe a new short form on this day yearly which links to all previous posts?
Recommend this post about "Alpha School" by an ACX reader, very interesting education scheme! https://www.astralcodexten.com/p/your-review-alpha-school
I don't understand how energy is still an appropriate unit for measuring compute capacity when there are two different chip paradigms. Do Nvidia cards and Ironwood TPU's give the exact same performance for the same energy input? What exactly are the differences in capacity to train/deploy models between the 1 GW capacity Anthropic will have and the 1GW OpenAI will have? I looked into this a bit and it seems like TPU's are explicitly designed for inference only, is that accurate? I feel like compiling this kind of information somewhere would be a good idea since its all rather opaque, technical, and obfuscated by press releases that seek to push a "look at our awesome 11 figure chip deal" narrative rather than provide actual transparency about capacity.