This reminds me of that study at Google that found that the common predictor of the most successful groups who worked together on projects was a general sense of psychological safety among those in the group. In a group where people felt free to spoke their mind without potential severe social penalties, those in it came up with better ideas, and addressed issues more effectively.
https://news.ycombinator.com/item?id=11174399
I feel a satisfaction hearing that some figure on social media is embroiled in a controversy and realizing that I had muted them a long time ago. The common themes that turn me off to people in general is
On the opposite end, when I was young I learned about the term "Stock market crash", referring to 1929, and I thought literally a car crashed into the physical location where stocks were traded, leading to mass confusion and kickstarting the Great Depression. Though if that actually happened back then, it would have led to a temporary crash in the market.
Obviously correct. The nature of any entity with significantly more power than you is that it can do anything it wants, and it incentivized to do nothing in your favor the moment your existence requires resources that would benefit it more if it were to use them directly. This is the essence of most of Eliezer's writings on superintelligence.
In all likelihood, ASI considers power (agentic control of the universe) an optimal goal and finds no use for humanity. Any wealth of insight it could glean from humans it could get from its own thinking, or seeding various worlds with genetically modified humans optimized for behaving in a way that produces insight into the nature of the universe via observing it.
Here are some things that would perhaps reasonably prevent ASI from choosing the "psychopathic pure optimizer" route of action as it eclipses' humanity's grasp
I'm not preparing for it because it's not gonna happen
I agree. OpenAI claimed in the gpt-4o blog post that it is an entirely new model trained from the ground up. GPT-N refers to capabilities, not a specific architecture or set of weights. I imagine GPT-5 will likely be an upscaled version of 4o, as the success of 4o has revealed that multi-modal training can reach similar capabilities at what is likely a smaller number of weights (judging by the fact that gpt-4o is cheaper and faster than 4 and 4T)
What makes you believe this?
This is inevitable as language models increase in size and ascend the benchmarks.
This is because, to accurately answer tricky questions, language models must be able to create robust models of a frame of reality based on initial premises. When answering something like an SAT question, correctness depends on isolating the thread of signal in the question/answer relationship, and ignoring the noise. The better it gets at this, the better a language model can isolate some functional relationship between question and answer and ignore the noise.
Reward hacking is just an extension of this. Basically, if a language model is tasked with performing some task to reach some desired outcome, the naive, small language model approach would be to do stuff that is semantically, thematically, logically adjacent to the question until it gets closer to the answer (the "stoichastic parrot" way of doing things). E.g., a model is tasked with some programming task, it will start by doing "good programmer"-seeming stuff, dotting t's, crossing i's, etc.
A non-naive model which can more clearly model the problem/answer space and isolate it from the noise would be much more capable at determining that the line between the problem and desired solution state which involves "not doing what is implied, but is technically feasible within the domain of control given and reaches the solution state". To create a working scheme that follows this pattern, the model must be able to model all the ways in which the reward hacking method is distinct from the "intended" way of arriving at an answer, but importantly not distinct with respect to what is relevant to the solution, despite lying far from the semantic space of stuff that is similar to a non-deceptive, solution.