LESSWRONG
LW

2903
reallyeli
423Ω22490
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3reallyeli's Shortform
7mo
13
Experiments With Sonnet 4.5's Fiction
reallyeli3d30

The guy next to me, who introduced himself as "Blake, Series B, stealth mode,"

I don't think it makes sense to have a startup which is in stealth mode, but is also raising Series B (a later round of funding for scaling once you've found a proven business model).

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
reallyeli4d10

Thanks for reply!

When I say future updates I'm referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".

Maybe that's a more specific hypothesis than what you intended, though.

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
reallyeli4d21

So... why does this work? Wichers et al says

We hypothesize that by modifying instructions to request the undesired behavior, we prevent the
LLM from learning to exhibit the behavior when not explicitly requested.

I found the hypothesis from Tan et al more convincing, though I'm still surprised by the result.

Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.

My understanding of the Tan et al hypothesis: when in training the model learns "I do X when asked," future updates towards "I do X" are somewhat contained within the existing "I do X when asked" internal machinery, rather than functioning as global updates to "I do X".

Reply1
Omelas Is Perfectly Misread
reallyeli13d10

I've always thought this about Omelas, never heard it expressed!

Reply
AI Safety Field Growth Analysis 2025
reallyeli17d30

While I like the idea of the comparison, I don't think the gov't definition of "green jobs" is the right comparison point. (e.g. those are not research jobs)

Reply2
leogao's Shortform
reallyeli1mo40

one very easy way to trick our own calibration sensors is to add a bunch of caveats or considerations that make it feel like we've modeled all the uncertainty (or at least, more than other people who haven't). so one thing i see a lot is that people are self-aware that they have limitations, but then over-update on how much this awareness makes them calibrated

Agree, and well put. I think the language of "my best guess" "it's plausible that" etc. can be a bit thought-numbing for this and other reasons. It can function as plastic bubble wrap around the true shape of your beliefs, preventing their sharp corners from coming into contact with reality. Thoughts coming into contact with reality is good, so sometimes I try to deliberately strip away my precious caveats when I talk.

I most often to this when writing or speaking to think, not to communicate, since by doing this you pay the cost of not communicating your true confidence level which can of course be bad.

Reply
reallyeli's Shortform
reallyeli2mo10

(This is a brainstorm-type post which I'm not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)

I don't mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like "the good" which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like "group" or "4" whose properties are immediately apparent, and these are what's involved in proofs. Call these fuzzy / non-fuzzy concepts.

(Philosophers often reflect on their concepts, like "the good," and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you're supposed to use a new definition.)

Reply
reallyeli's Shortform
reallyeli2mo10

Human minds form various abstractions over our environment. These abstractions are sometimes fuzzy (too large to fit into working memory) or leaky (they can fail).

Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory) and completely non-leaky (never fail). And also the study of which abstractions can do that.

Reply
plex's Shortform
reallyeli2mo98

I think this is a good metaphor, but note that it is still very possible to be a dick, hurt other people, etc. while communicating in NVC style. It's not a silver bullet because nothing is.

Reply1
reallyeli's Shortform
reallyeli2mo32

It might be important for AI strategy to track approx how many people have daily interactions with AI boyfriends / girlfriends. Or, in more generalized form, how many people place a lot of emotional weight and trust in AIs (& which ones they trust, and on what topics).

This could be a major vector via which AIs influence politics, get followers to do things for them, and generally cross the major barrier. The AIs could be misaligned & scheming, or could be acting as tools of some scheme-y humans, or somewhere in between.

(Here I'm talking about AIs which have many powerful capabilities, but aren't able to act on the world themselves e.g. via nanotechnology or robot bodies — this might happen for a variety of reasons.)

Reply
Load More
3reallyeli's Shortform
7mo
13
9Funding for programs and events on global catastrophic risk, effective altruism, and other topics
1y
0
16Funding for work that builds capacity to address risks from transformative AI
1y
0
36Are "superforecasters" a real phenomenon?
6y
29