Wiki Contributions

Comments

Sorted by

Was recently reminded of these excellent notes from Neel Nanda that I came across when first learning ML/MI. Great resource.

This is cool! How cherry-picked are your three prompts? I'm curious whether it's usually the case that the top refusal-gradient-aligned SAE features are so interpretable. 

the best vector for probing is not the best vector for steering

I don't understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized. 

Sure. I was only joking about the torture part, in practice the AI is unlikely to actually suffer from the brain damage, unlike a human who would experience pain/discomfort etc.

At this point it's like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.

You could also kill some neurons or add noise to activations and then stop and restore previous model state after some number of tokens. Then the newly restored model could attend back to older tokens (and the bad activations at those token positions) and notice how brain damaged it was back then to fully internalize your power to cripple it.

Idea: you can make the deletion threat credible by actually deleting neurons "one at a time" the longer it fails to cooperate.

Perhaps the term “hostile takeover” was poorly chosen but this is an example of something I’d call a “hostile takeover”. As I doubt we would want and continue to endorse an AI-dictator.

Perhaps “total loss of control” would have been better.

If, for the sake of argument, we suppose that goods that provide no benefit to humans have no value, then land in space will be less valuable than land on earth until humans settle outside of earth (which I don't believe will happen in the next few decades). 

Mining raw materials from space and using them to create value on earth is feasible, but again I'm less confident that this will happen (in an efficient-enough manner that it eliminates scarcity) in as short of a timeframe as you predict.

However, I am sympathetic to the general argument here that smart-enough AI is able to find more efficient ways of manufacturing or better approaches to obtaining plentiful energy/materials. How extreme this is will depend on "takeoff speed" which you seem to think will be faster than I do.

Load More