https://azuleye.github.io/
However, I am pretty pessimistic in general about reliable safeguards against superintelligence with any methods, given how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.
To use an imperfect analogy, I could defeat the narrowly superintelligent Stockfish at 'queen odds chess' where Stockfish starts the game down a queen. Can't we think of interpretability and black-box safeguards as the extra pieces we can use to reliably win against rogue superintelligence?
To use an imperfect analogy, I could defeat the narrowly superintelligent Stockfish at 'queen odds chess' where Stockfish starts the game down a queen.
Can't we think of interpretability and black-box safeguards as the extra pieces we can use to reliably win against rogue superintelligence?