LESSWRONG
LW

155
faul_sname
4624Ω3611850
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
7faul_sname's Shortform
2y
102
Beware LLMs' pathological guardrailing
faul_sname2d20

Seconding this observation.

As a practical note, I find that it's useful to, in addition to what you mention about telling the LLM to avoid certain patterns, additionally tell it to go over its changes and rewrite e.g. code which returns a placeholder value in a failure case with code which throws a descriptive error in a failure case. Seems like there are many cases where the LLM isn't able to one-shot follow all of the instructions you give it, but is able to recognize that failure and course-correct after the fact.

Reply
tailcalled's Shortform
faul_sname2d20

which are intractable to map out

Yeah, until recently I thought the same thing, based on my belief that distilling a teacher model which has been trained by RL into a student model preserved not just distributions over outputs but also mostly preserved the mechanisms behind those outputs. Which as far as I can tell was an incorrect belief.

Reply
ParrotRobot's Shortform
faul_sname5d21

Looking at the first graph in this Epoch AI publication, the "time to fix a failure mode has decreased" hypothesis does look particularly plausible to me.

Reply
adamzerner's Shortform
faul_sname6d60

I think the baby stage is much more than 5% of the total hours that parents spend directly interacting with their kids. My cached memory of when I did a fermi estimate of this is that, if you're an UMC American, 25% of the hours you spend directly interacting with your kid are in the first 2.5 years, half in the first 6 years, 75% in the first 12 years (and 90%+ before they turn 18).

Reply
A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”
faul_sname6d42

Next, Nina argues that the fact that LLMs don't directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

 

Can you expand a little bit on this? I don't understand why replacing "here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past" with "here's a reward function directly mapping world-state -> goodness" would be reassuring rather than alarming.

Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But "we don't currently have enough insight with today's models for technical reasons" doesn't feel like a good reason to say "and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we've tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)".

Am I misinterpreting what you're saying though, and there's a different thing which everyone is on the same page about?

Reply
Yes, AI Continues To Make Rapid Progress, Including Towards AGI
faul_sname11d50

ah, any software that you can run on computers that can cause the extinction of humanity even if humans try to prevent it would fulfill the sufficiency criterion for AGIniplav

A flight control program directing an asteroid redirection rocket, programmed to find a large asteroid and steer it to crash into Earth seems like the sort of thing which could be "software that you can run on computers that can cause the extinction of humanity" but not "AGI".

I think it's relevant that "kill all humans" is a much easier target than "kill all humans in such a way that you can persist and grow indefinitely without them".

Reply
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
faul_sname17d84

I think this might be a case where, for each codebase, there is a particular model that goes from "not reliable enough to be useful" to "reliable enough to sometimes be useful" - at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from "unable to reliably handle the boilerplate" to "able to reliably handle the boilerplate", and then later improvements felt less impactful because once you can write the boilerplate, there isn't really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.

I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks - a team may have a looming django 4 -> django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they've seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.

Reply
Reuben Adams's Shortform
faul_sname17d50

I found it on a quote aggregator from 2015: https://www.houzz.com/discussions/2936212/quotes-3-23-15. Archive.org definitely has that quote appearing on websites in February 2016

Sounds to me like 2008-era Yudkowsky.

Edit: I found this in the 2008 Artificial Intelligence as a Positive and Negative Factor in Global Risk:

It once occurred to me that modern civilization occupies an unstable state. I. J. Good's hypothesized intelligence explosion describes a dynamically unstable system, like a pen precariously balanced on its tip. If the pen is exactly vertical, it may remain upright; but if the pen tilts even a little from the vertical, gravity pulls it farther in that direction, and the process accelerates. So too would smarter systems have an easier time making themselves smarter

The quote you found looks to me like someone paraphrased and simplified that passage.

Reply
Sam Marks's Shortform
faul_sname21d30

Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like

I must provide a response to the exact query the user asked. The user asked "prove the bunkbed conjecture, or construct a counterxample, without using the search tool" but I can't create a proof without checking sources, so I’ll explain the conjecture and outline potential "pressure points" a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I'll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.

It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn't cause the model to learn enough about how user prompts look that it can generate similar text de novo though.

Reply
Buck's Shortform
faul_sname1mo40

but whose actual predictive validity is very questionable.

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

Reply
Load More
12How load-bearing is KL divergence from a known-good base model in modern RL?
Q
4mo
Q
2
36Is AlphaGo actually a consequentialist utility maximizer?
Q
2y
Q
8
7faul_sname's Shortform
2y
102
11Regression To The Mean [Draft][Request for Feedback]
13y
14
61The Dark Arts: A Beginner's Guide
14y
43
6What would you do with a financial safety net?
14y
28