Align it
As the worst instance of this, the best way to understand a lot of AIS research in 2022 was “hang out at lunch in Constellation”.
Is this no longer the case? If so, what changed?
This is a reasonable point, but I have a cached belief that frozen food is substantially less healthy than non-frozen food somehow.
This is very interesting. My guess is that this would take a lot of time to set up, but if you have eg. recommended catering providers in SFBA, I'd be very interested!
Completely fair request. I think I was a bit vague when I said "on top or around AI systems."
The point here is that I want to find techniques that seem to positively influence model behavior, which I can "staple on" to existing models without a gargantuan engineering effort.
I am especially excited about these ideas if they seem scalable or architecture-agnostic.
Here are a few examples of the kind of research I'm excited about:
If someone were to discover a great regularizer for language models, OSNR turns out to work well and I sort out the issues around strategic redundancy, and externalized reasoning oversight turns out to be super promising, then we could just stack all three.
We could then clone the resulting model and run a debate, ior stack on whatever other advances we've made by that point. I'm really excited about this sort of modularity, and I guess I'm also pretty optimistic about a few of these techniques having more bite than people may initially guess.
Recently my alignment orientation has basically been “design stackable tools on top or around AI systems which produce pressure toward increased alignment.”
I think it’s a pretty productive avenue that’s 1) harder to get lost in, and 2) might be eventually sufficient for alignment.
neuron has
I was confused by the singular "neuron."
I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for "smuggling."
Is my understanding here basically correct?
Shallow comment:
How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.
The first point isn’t super central. FWIW I do expect that humans will occasionally not swap words back.
Humans should just look at the noised plan and try to convert it into a more reasonable-seeming, executable plan.
Edit: that is, without intentionally changing details.
What can be used to auth will be used to auth
One of the symptoms of our society's deep security inadequacy is the widespread usage of unsecure forms of authentication.
It's bad enough that there are systems which authenticate you using your birthday, SSN, or mother's maiden name by spec.
Fooling bad authentication is also an incredibly common vector for social engineering.
Anything you might have, which others seem unlikely to have (but which you may not immediately see a reason to keep secret), could be accepted by someone you implicitly trust as "authentication."
This includes: