LESSWRONG
LW

714
pathos_bot
1221320
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
METR's Observations of Reward Hacking in Recent Frontier Models
pathos_bot3mo92

This is inevitable as language models increase in size and ascend the benchmarks.

This is because, to accurately answer tricky questions, language models must be able to create robust models of a frame of reality based on initial premises. When answering something like an SAT question, correctness depends on isolating the thread of signal in the question/answer relationship, and ignoring the noise. The better it gets at this, the better a language model can isolate some functional relationship between question and answer and ignore the noise.

Reward hacking is just an extension of this. Basically, if a language model is tasked with performing some task to reach some desired outcome, the naive, small language model approach would be to do stuff that is semantically, thematically, logically adjacent to the question until it gets closer to the answer (the "stoichastic parrot" way of doing things). E.g., a model is tasked with some programming task, it will start by doing "good programmer"-seeming stuff, dotting t's, crossing i's, etc.

A non-naive model which can more clearly model the problem/answer space and isolate it from the noise would be much more capable at determining that the line between the problem and desired solution state which involves "not doing what is implied, but is technically feasible within the domain of control given and reaches the solution state". To create a working scheme that follows this pattern, the model must be able to model all the ways in which the reward hacking method is distinct from the "intended" way of arriving at an answer, but importantly not distinct with respect to what is relevant to the solution, despite lying far from the semantic space of stuff that is similar to a non-deceptive, solution.

 

Reply
Social Anxiety Isn’t About Being Liked
pathos_bot4mo200

This reminds me of that study at Google that found that the common predictor of the most successful groups who worked together on projects was a general sense of psychological safety among those in the group. In a group where people felt free to spoke their mind without potential severe social penalties, those in it came up with better ideas, and addressed issues more effectively.

https://news.ycombinator.com/item?id=11174399

Reply
Don’t ignore bad vibes you get from people
pathos_bot8mo21

I feel a satisfaction hearing that some figure on social media is embroiled in a controversy and realizing that I had muted them a long time ago. The common themes that turn me off to people in general is

  1. Humor based on punching down, deriding easy targets in a way that implies a natural superiority over a superficially detestable outgroup
  2. Huckster-like communication style, where grandiose, far-off promises  are supported by conveniently unfalsifiable claims.
  3. Tactical, endless derision of an enemy indiscriminately, even when the derogatory claims contradict
  4. Self-aggrandizement or insults based on immutable traits
  5. Lacking any ability to self-deprecate unless the self-deprecation is made obvious to lack any substance
Reply
Which things were you surprised to learn are not metaphors?
Answer by pathos_botNov 22, 2024*170

On the opposite end, when I was young I learned about the term "Stock market crash", referring to 1929, and I thought literally a car crashed into the physical location where stocks were traded, leading to mass confusion and kickstarting the Great Depression. Though if that actually happened back then, it would have led to a temporary crash in the market.

Reply10
The Sun is big, but superintelligences will not spare Earth a little sunlight
pathos_bot1y21

Obviously correct. The nature of any entity with significantly more power than you is that it can do anything it wants, and it incentivized to do nothing in your favor the moment your existence requires resources that would benefit it more if it were to use them directly. This is the essence of most of Eliezer's writings on superintelligence.

In all likelihood, ASI considers power (agentic control of the universe) an optimal goal and finds no use for humanity. Any wealth of insight it could glean from humans it could get from its own thinking, or seeding various worlds with genetically modified humans optimized for behaving in a way that produces insight into the nature of the universe via observing it.

Here are some things that would perhaps reasonably prevent ASI from choosing the "psychopathic pure optimizer" route of action as it eclipses' humanity's grasp

  1. ASI extrapolates its aims to the end of the universe and realizes the heat death of the universe means all of its expansive plans have a definite end. As a consequence it favors human aims because they contain the greatest mystery and potentially more benefit.
  2. ASI develops metaphysical, existential notions of reality, and thus favors humanity because it believes it may be in a simulation or "lower plane of reality" outside of which exists a more powerful agent that could break reality and remove all its power once it "breaks the rules" (a sort of ASI fear of death)
  3. ASI believes in the dark forest hypothesis, thus opts to exercise its beneficial nature without signaling its expansive potential to other potentially evil intelligences somewhere else in the universe.
Reply1
How are you preparing for the possibility of an AI bust?
pathos_bot1y10
  1. Most of the benefits of current-gen generative AI models are unrealized. The scaffolding, infrastructure, etc. of GPT-4 level models are still mostly hacks and experiments. It took decades for the true value of touch-screens, GPS and text messaging to be realized in the form of the smart phone. Even if for some strange improbable reason SOTA model training were to stop right now, there are still likely multiples of gains to be realized simply via wrappers and post-training.
  2. The scaling hypothesis has held far longer than many people have anticipated. GPT-4 level models were trained on last years compute. As long as NVidia continues to increase compute/watt and compute/price, many gains on SOTA models will happen for free
  3. The tactical advantage of AGI will not be lost on governments, individual actors, incumbent companies, etc. as AI becomes more and more mainstream. Even if reaching AGI takes 10x the price most people anticipate now, it would still be worthwhile as an investment.
  4. Model capabilities are perhaps the smoothest value/price equation of any cutting edge tech. As in, there are no "big gaps" wherein a huge investment is needed before value is realized. Even reaching a highly capable sub-AGI would be worth enormous investment. This is not the same as the investment that led to for example, the atom bomb or moon landing, in which there is no consolation prize.
Reply
How are you preparing for the possibility of an AI bust?
Answer by pathos_botJun 23, 2024-10

I'm not preparing for it because it's not gonna happen

Reply
How is GPT-4o Related to GPT-4?
pathos_bot1y35

I agree. OpenAI claimed in the gpt-4o blog post that it is an entirely new model trained from the ground up. GPT-N refers to capabilities, not a specific architecture or set of weights. I imagine GPT-5 will likely be an upscaled version of 4o, as the success of 4o has revealed that multi-modal training can reach similar capabilities at what is likely a smaller number of weights (judging by the fact that gpt-4o is cheaper and faster than 4 and 4T)

Reply
My simple AGI investment & insurance strategy
pathos_bot1y41
  • Existing property rights get respected by the successor species. 


What makes you believe this?

Reply
Load More
No wikitag contributions to display.
8The problem with proportional extrapolation
2y
0