LESSWRONG
LW

bilalchughtai
924Ω577230
Message
Dialogue
Subscribe

My website is here.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
5bilalchughtai's Shortform
1y
10
Try training token-level probes
bilalchughtai3mo10

typo: unambitious -> unambiguous 

Reply1
bilalchughtai's Shortform
bilalchughtai3mo20

karpathy reviews sleep trackers: https://karpathy.bearblog.dev/finding-the-best-sleep-tracker/

Reply
How do you deal w/ Super Stimuli?
bilalchughtai6mo*51

As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every "productive" task faster on a laptop than on my phone.

Here are some things object level things I do that I find helpful that I haven't yet seen discussed.

  • Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
  • Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.
  • Every time I get a notification I didn't need to get, I instantly disable it. This also generalizes to unsubscribing from emails I don't need to receive.
Reply
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
bilalchughtai6mo30

What is the error message?

Reply1
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
bilalchughtai6mo30

Yep, this sounds interesting! My suggestion for anyone wanting to run this experiment would be to start with SAD-mini, a subset of SAD with the five most intuitive and simple tasks. It should be fairly easy to adapt our codebase to call the Goodfire API. Feel free to reach out to myself or @L Rudolf L if you want assistance or guidance.

Reply1
Activation space interpretability may be doomed
bilalchughtai6mo60

How do you know what "ideal behaviour" is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a "true model feature" and a "true model feature"? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.

Reply
Activation space interpretability may be doomed
bilalchughtai6mo20

Yes, that's right -- see footnote 10. We think that Transcoders and Crosscoders are directionally correct, in the sense that they leverage more of the models functional structure via activations from several sites, but agree that their vanilla versions suffer similar problems to regular SAEs.

Reply
Activation space interpretability may be doomed
bilalchughtai6mo54

Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.

Reply
AI Safety as a YC Startup
bilalchughtai6mo10

For another perspective on leveraging startups for improving the world see this blog post by @benkuhn.

Reply
bilalchughtai's Shortform
bilalchughtai6mo41

A LW feature that I would find helpful is an easy to access list of all links cited by a given post.

Reply
Load More
102Detecting Strategic Deception Using Linear Probes
Ω
5mo
Ω
9
68Paper: Open Problems in Mechanistic Interpretability
Ω
5mo
Ω
0
148Activation space interpretability may be doomed
Ω
6mo
Ω
34
100Reasons for and against working on technical AI safety at a frontier AI lab
6mo
12
27Book Summary: Zero to One
6mo
2
80Remap your caps lock key
7mo
20
114You should consider applying to PhDs (soon!)
7mo
19
5bilalchughtai's Shortform
1y
10
43Understanding Positional Features in Layer 0 SAEs
1y
0
55Unlearning via RMU is mostly shallow
Ω
1y
Ω
4
Load More