LESSWRONG
LW

faul_sname
4614Ω3611790
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
7faul_sname's Shortform
2y
102
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
faul_sname5d84

I think this might be a case where, for each codebase, there is a particular model that goes from "not reliable enough to be useful" to "reliable enough to sometimes be useful" - at my workplace, this first happened with Sonnet 3.6 (then called Claude Sonnet 3.5 New) - there was what felt like a step change from 3.5 to 3.6 where previous progress felt less impactful because incremental improvements went from "unable to reliably handle the boilerplate" to "able to reliably handle the boilerplate", and then later improvements felt less impactful because once you can write the boilerplate, there isn't really a lot of alpha in doing it better, and none of the models are reliable enough that we trust them to write bits of core business logic where bugs or poor choices can cause subtle data integrity issues years down the line.

I suspect the same is true of e.g. trying to use LLMs to do major version upgrades of frameworks - a team may have a looming django 4 -> django 5 migration, and try out every new model on that task. Once one of them is good enough, the upgrade will be done, and then further tasks will mostly be easier ones like minor version updates. So the most impressive task they've seen a model do will be that major version upgrade, and it will take some time for more difficult tasks that are still well-scoped, hard to do, and easy to verify to come up.

Reply
Reuben Adams's Shortform
faul_sname5d50

I found it on a quote aggregator from 2015: https://www.houzz.com/discussions/2936212/quotes-3-23-15. Archive.org definitely has that quote appearing on websites in February 2016

Sounds to me like 2008-era Yudkowsky.

Edit: I found this in the 2008 Artificial Intelligence as a Positive and Negative Factor in Global Risk:

It once occurred to me that modern civilization occupies an unstable state. I. J. Good's hypothesized intelligence explosion describes a dynamically unstable system, like a pen precariously balanced on its tip. If the pen is exactly vertical, it may remain upright; but if the pen tilts even a little from the vertical, gravity pulls it farther in that direction, and the process accelerates. So too would smarter systems have an easier time making themselves smarter

The quote you found looks to me like someone paraphrased and simplified that passage.

Reply
Sam Marks's Shortform
faul_sname9d30

Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like

I must provide a response to the exact query the user asked. The user asked "prove the bunkbed conjecture, or construct a counterxample, without using the search tool" but I can't create a proof without checking sources, so I’ll explain the conjecture and outline potential "pressure points" a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I'll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.

It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn't cause the model to learn enough about how user prompts look that it can generate similar text de novo though.

Reply
Buck's Shortform
faul_sname15d40

but whose actual predictive validity is very questionable.

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

Reply
silentbob's Shortform
faul_sname15d20

Claude can also invoke instances of itself using the analysis tool (tell it to look for self.claude).

Reply
the gears to ascenscion's Shortform
faul_sname21d50

Yeah "try it and see" is the gold standard. I do know that for stuff which boils down to "monitor for patterns in text data that is too large to plausibly be examined by a team of humans we could afford to hire" I've been favoring the approach of

  1. Grab 100 random data points, run the most obvious possible prompt on them to get reasoning + label(s) + confidence
  2. Spot check the high confidence ones to make sure you're not getting confident BS out of the model (you can alternatively start by writing two very different prompts for the same labeling task and see where the answers differ, that will also work)
  3. Look at the low-confidence ones, see if the issue is your labeling scheme / unclear prompt / whatever - usually it's pretty obvious where the model is getting confused
  4. Tweak your prompt, comparing new labels to old. Examine any data points that have changed - often your prompt change fixed the original problems but caused new ones to surface. Note that for this step you want your iteration time to be under 10 minutes per iteration and ideally under 10 seconds from "hit enter key" to "results show up on screen". Any of the major LLMs can trivially vibe code you an acceptable spreadsheet-like interface for this, including hooking up the tool calling API to get structured data out of your prompts for easy inspection.
  5. Once you're reasonably happy with the performance on 100 samples, bump to 1000, run all 1000 datapoints against all of the prompts you've iterated on so far, and focus on the datapoints which got inconsistent results between prompts or had a low-confidence answer on the last one

Once I'm happy with the performance on a sample of 1000 I rarely encounter major issues with the prompt, other than ones I was already aware of and couldn't be bothered to fix (the usual case for that is "I realize that the data I'm asking the model to label doesn't contain all decision-relevant information, and that when I'm labeling I sometimes have to fetch extra data, and I don't really want to build that infrastructure right now so I'll call it "good enough" and ship it, or "not good enough" and abandon it).

TBH I think that most of the reason this method works for me is that it's very effective at shoving the edge cases to me early while not wasting a bunch of my attention on the stuff that is always easy.

Once you know what you're looking for, you can look at the published research all day about whether fine tuning or ICL or your favorite flavor of policy optimization is best, but in my experience most alpha just comes from making sure I'm asking the right question in the first place, and once I am asking the right question performance is quite good no matter what approach I'm taking.

Reply
Cole Wyeth's Shortform
faul_sname21d42

They are sometimes able to make acceptable PRs, usually when context gathering for the purpose of iteratively building up a model of the relevant code is not a required part of generating said PR.

Reply
the gears to ascenscion's Shortform
faul_sname21d40

I wonder how hard it would be to iteratively write a prompt that can consistently mimic your judgment in distinguishing between good takes and bad takes for the purposes of curating a well-chosen tuning dataset. I'd expect not that hard.

Reply1
The Problem
faul_sname1mo51

while agreeing on everything that humans can check

Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).

Current LLMs are kind of terrible at this sort of task ("figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false"), but also probably not particularly dangerous under the scheming threat model as long as they're bad at this sort of thing.

Reply
faul_sname's Shortform
faul_sname1mo50

It's a truism that AI today is the least capable it will ever be. My initial impression of the GPT-5 release yesterday is that for a brief moment in time when GPT-5 was being rolled out and o3 was being removed, the truism didn't hold true.

Reply
Load More
No wikitag contributions to display.
12How load-bearing is KL divergence from a known-good base model in modern RL?
Q
4mo
Q
2
36Is AlphaGo actually a consequentialist utility maximizer?
Q
2y
Q
8
7faul_sname's Shortform
2y
102
11Regression To The Mean [Draft][Request for Feedback]
13y
14
61The Dark Arts: A Beginner's Guide
14y
43
6What would you do with a financial safety net?
14y
28