When I looked at mesa-optimization for the first time, my mind immediately associated it with a familiar problem in business: human individuals may not be “mesa optimizers” in a strict sense, but they can act as optimizers and they are expected to do so when a manager delegates (=base optimization)...
Some time ago I happened to read the concept of training rationale described by Evan Hubinger, and I really liked it. In case you are not aware: training rationales are a bunch of questions that ML developers / ML teams should ask themselves in order to self-assess pros and cons...
AI definitely poses an existential risk, in the sense that it can generate models with the hidden (possibly undetectable?) intention of competing against humanity for resources. The more intelligent the model, the higher its chance of success! The thought of an AI takeover is so scary that I won’t even...
If you ask around what are the typical ways to infer information, most people will answer: Deductions, Inductions, and Abductions. Of course, there are more ways than that, but there is no unified approach in their classification. I want to challenge that. The reason why I am unhappy with the...
In an artificial being, all the following: * Consciousness * Emotionality * Intelligence * Personality * Creativity * Volition are distinct properties that, in theory, may be activated independently. Let me explain. What I Hope to Achieve By publishing this post, I hope to develop and standardise some useful terms...
I have been reading recently about a technique that can be used to partially control the behaviour of Large Language Models: the technique is exploiting control vectors[1] to alter the activation patterns of the LLMs and trigger some desired behaviour. While the technique does not provide guarantees, it gives high...