All Posts

Sorted by Magic (New & Upvoted)

Tuesday, July 28th 2020
Tue, Jul 28th 2020

Frontpage Posts
Shortform
4avturchin12dSome random ideas how to make GPT-base AI safer. 1) Scaffolding: use rule-based AI to check every solution provided by GPT part. It could work for computations or self-driving or robotics, but not against elaborated adversarial plots. 2) Many instances. Run GPT several times and choose random or best answer - we already doing this. Run several instances of GPT with different parameters or different training base and compare answers. Run different prompt. Median output seems to be a Shelling point around truth, and outstanding answers are more likely to be wrong or malicious. 3) Use intrinsic GPT properties to prevent malicious behaviour. For example, higher temperature increases randomness of the output and mess up with any internal mesa optimisers. Shorter prompts and lack of long memory also prevents complex plotting. 4) Train and test on ethical database. 5) Use prompts which include notion of safety, like "A benevolent AI will say..." or counterfactuals which prevents complex planing in real world (An AI on the Moon) 6) Black boxing of internal parts of the system like the NN code. 7) Run it million times in test environments or tasks. 8) Use another GPT AI to make "safety TL;DR" of any output or prediction of possible bad things which could happen from a given output. Disclaimer: Safer AI is not provably safe. It is just orders of magnitude safer than unsafe one, but it will eventually fail.
1ejenner12dGradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.
1Bob Jacobs12dThe Law of the Minimization of Mystery [https://news.ycombinator.com/item?id=941387] by David J. Chalmers [https://en.wikipedia.org/wiki/David_Chalmers] is just a less precise version of the Law of Likelihood [https://en.wikipedia.org/wiki/Likelihood_principle#:~:text=The%20law%20of%20likelihood,-A%20related%20concept&text=is%20the%20degree%20to%20which,if%20less%2C%20then%20vice%20versa.]