One of the cornerstone issue of AI alignment is the black box problem. Our machine learning models are inscrutable tables of floating point numbers and we don't know how to decpher them to understand what is actually going on inside. This is bad. Everyone seem to agree that this is bad. And for some reason we keep making it worse.

I've heard about some efforts to increase AI transparency. To look inside the black box and figure out some of the gears inside of it. I think they are noble and important. And yet they are absolutely inadequte. It's incredibly hard to decipher human-understandable insights from the ML model. While, encoding the approximation of an existent algorithm in one is easy. Interpretability research in the current climate is not just going against a flow, it's an attempt to row a boat against a tsunami.

So maybe we should, at first, stop decreasing the AI transparency: making even more complex, more general models, encrypting even somewhat known algorithms into inscrutable float tables? This seems as an obvious thought to me. Let our AI tools consist of multiple ML models connected by inputs and outputs in an actual human readable code. Multiple black boxes with some known interactions between them is superior transparency-wise to one huge black box where nothing is known for sure. And yet we seem to continue moving in the opposite direction.

More and more general models are developed.  GATO can do multiple different tasks with exactly the same weights. When CICERO was released and it was revealed that it consists of a language model working in conjunction with a strategy engine, it was casually mentioned that we should expect a unified model in the future. GPT-4 can take images as an input. I hope OpenAI is using a separate model to parse them, so it's a two black boxes scenario instead of one, but I wouldn't be surprised if it's not the case. Why are we accepting this for granted? Why are we not opposing these tendencies? 

I expect that if humanity in our folly creates a true AGI via blind search by gradient decent it will more likely happen with a deliberate attempt to increase the generality of some model from just its capabilities in one single domain. 

Recently, LessWrong adjacent circles started more vocally fighting against capability gains in AI. This is a hard fight that is going to cost us quite some social capital, and yet it's necessary if we want to increase the chances of our survival or at least dying with more dignity. But I do not notice any similar opposition to generality gains specifically, which, moreover, cost us whatever AI transparency we have.

This fight seems to me similarly important. But also cheaper. The end user may not even notice, whether it's multiple black boxes interacting with each other or just one. Capability of resulting AI product may even be the same. The amount of and variety of inputs a system of multiple smaller black boxes can have can be equal to the one huge black box. But at least we will know what is going on inside the system to some degree and evade the class of risks that can happen specifically when the model generalise between multiple domains. And maybe, if we stop sacrificing transparency, whatever efforts to get more of it we are currently doing, will eventually catch up and bear fruits for AI alignment.

New to LessWrong?

New Comment