LESSWRONG
LW

2342
Andrew McKnight
68Ω11260
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Against Almost Every Theory of Impact of Interpretability
Andrew McKnight11mo10

Do you think putting extra effort into learning about existing empirical work while doing conceptual work would be sufficient for good conceptual work or do you think people need to be producing empirical work themselves to really make progress conceptually?

Reply
Catching AIs red-handed
Andrew McKnight1y20

Maybe you've addressed this elsewhere but isn't scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.

Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.

Reply
'Empiricism!' as Anti-Epistemology
Andrew McKnight2y98

lukeprog argued similarly that we should drop the "the"

Reply
The shape of AGI: Cartoons and back of envelope
Andrew McKnight2y20

Another possible inflection point, pre-self-improvement could be when an AI gets a set of capabilities that allows it to gain new capabilities at inference time.

Reply
UFO Betting: Put Up or Shut Up
Andrew McKnight2y10

I'll repeat this bet, same odds same conditions same payout, if you're still interested. My $10k to your $200 in advance.

Reply
Policy discussions follow strong contextualizing norms
Andrew McKnight3y10

Responding to your #1, do you think we're on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.

16: outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.
17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over. 
18: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'
19: there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

Reply
Anthropic's Core Views on AI Safety
Andrew McKnight3y10

Thanks for the links and explanation, Ethan.

Reply
Anthropic's Core Views on AI Safety
Andrew McKnight3y*Ω130

I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm.

Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?

Reply
Anthropic's Core Views on AI Safety
Andrew McKnight3yΩ12-3

Great post. I'm happy to see these plans coming out, following OpenAI's lead.

It seems like all the safety strategies are targeted at outer alignment and interpretability. None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment, iirc, even though this seems to me like the biggest challenge.

Is Anthropic mostly leaving inner alignment untouched, for now?

Reply
Acausal normalcy
Andrew McKnight3y10

Taken literally, the only way to merge n utility functions into one without any other info (eg the preferences that generated the utility functions) is to do a weighted sum. There's only n-1 free parameters.

Reply
Load More
2Andrew McKnight's Shortform
4y
1
2Andrew McKnight's Shortform
4y
1