LESSWRONG
LW

metawrong
9170
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Are Sparse Autoencoders a good idea for AI control?
metawrong7mo20

Finetuned a SAE on deceptive/non deceptive reasoning traces from  Gemma 9b

 

If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to 'real' deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?

Reply
Shortform
metawrong8mo10

How does this explain the Decoy effect [1]?

  1. ^

    I am not sure how real and how well researched the 'decoy effect' is

Reply
The Plan - 2024 Update
metawrong8mo10

So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ?

My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?

Reply
Monthly Roundup #23: October 2024
metawrong10mo10

> The news is good, and there are now seven shows in my tier 1

@Zvi Which are the other shows in your tier 1?

Reply
Ryan Kidd's Shortform
metawrong10mo10

LASR (https://www.lasrlabs.org/) is giving a £11,000 stipend for a 13 week program, assuming 40h/week it works out to ~$27

Reply
I'm a Former Israeli Officer. AMA
metawrong2y10

Thank you for doing this!

Few random questions, of course feel free to say as much or as little as you want:

Have you been personally to the Gaza strip (or the West Bank) ? If yes - what are your impressions? How do people live there? Is it common for regular (non-military) Jewish people to hang around those places? How common is for Palestinians to hang around outside of Gaza and the West Bank? How common is for Jewish and Palestinian people to mix in everyday life and interact?

I have just recently learned about the Gilat Shalit prisoner exchange (a single captured Israeli soldier was exchanged for over 1000 Palestinian prisoners). What do you think about that exchange?

What is your probability of the IDF knowing about the attack but still letting it happen? If it is near 0, what is your current best explanation for IDF being caught by surprise by such a massive attack?

Reply
No posts to display.