mkatavic's Shortform

Marko Katavic

This is a special post for quick takes by Marko Katavic. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Reading the Mythos model card - the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.

For any reduction on alignment benchmarks two competing hypotheses are potentially true:

Alignment is better
Model is better at hiding misalignment

Given that capability increase should increase alignment risk, observing alignment benchmark improvement with capability improvement and confidently concluding 1 instead of 2 doesn’t seem logical to me. Can anyone point out what I’m missing?

I could get behind 1 if with each model generation a new alignment set of benchmarks that are not directly mechanically linked to a previous generation set of benchmarks was introduced and backtested against older model families - but this doesn’t seem to be the regime.

I've only read a summary and selected paragraphs thus far, so I might miss important context, but the same question came to my mind. I'd also be interested in a good explanation how 1 can be concluded and 2 ruled out with confidence. If it's based on CoT monitoring and mechinterp, is the trust in these methods still warranted at such a high capability level?

I have a rather deranged analogy with forecasting asteroid impacts. Suppose that the model family is close to being aligned and that it acts increasingly coherent. Then benchmarks would show the model being close to aligned with increasing precision... until the ideal behavior becomes further from the model's goals than incoherence distorts these goals.

Additionally, newer models, unlike older ones, are actually tested by interpretability tools, not just by reading the CoT to determine verbalised eval awareness.

Thank you for your answer.

On your asteroid impact parallel - I don't think problems of precision and validity address my main point - whether the benchmark is imprecise or invalid is orthogonal to how we interpret what it's outputting vis-a-vis the two hypotheses. I could see an argument for validity being a strong driver of hypothesis alignment - but you'd need to know whether it's valid or not - which is an unknown. So you have to operate from uncertainty.

On your second point - I get that interpretability is a new vector - but in the text, alignment increasing is cited against benchmark data, not only against mechanistic interpretability data. And the interpretability data is actually pointing in both directions. If you read section 4.1 - the findings find features associated with concealment, strategic manipulation and avoiding suspicition were activating. On that basis alone, I'd update slightly towards overall reduction in misalignment being a product of better hiding than actual alignment gains - however - I believe the most intellectually honest position is to stay in balanced uncertainty between the two.

I worry that a lot of the interpretation is coming from the prior, not the posterior, of the data.

EDIT note: I wasn't aware about the LLM assistant policy for newcomers, so have reverted the take to my unedited pre-assistance text. This is currently clean of LLM output, both in ideas and phrasing.

I am new here - and still reading and learning, and therefore, this post is not an argument, it's a request for salient material to update my priors, as so far I haven't seen anything to dissuade me of two core philosophical positions that guide my thinking that AI alignment is not only hard, but practically impossible (not theoretically impossible, mind you). I've read Elizier and I follow the argumentation trail in the book, but I think I approach it with slightly different language and perhaps a few different angles. Even though I work in ML, I would say the only intellectual thread that has followed me from university to today is an interest into epistemology, so it's my longest uninterrupted thread.

From everything I see - the position of "alignment is a solvable problem" is epistemologically questionable. In the same way that being epistemically certain about anything is questionable.

1. For more than 100k years, humans have attempted to impose control and alignment to other intelligences and have failed (namely, to other humans) despite great incentives for power-seeking individuals to succeed. This is mainly due to the sheer infinite complexity of human decision-making and the option space available (the option space is near-infinite, pending nature of the universe, and decision-space infinite). It stems from this that alignment is necessarily not a problem of observability, monitoring, error catching - but that true alignment can only come through goal alignment and fundamental installation of morality in the new intelligence. I understand that alignment research's answer to this question is - the human intelligence is an encountered state of the world - whereas the artificial intelligence will be evolved, and as we're guiding the evolution mechanism, we control the forces that create the morals - which we're unable to do in humans. For this to be true, I think we would necessarily have to have a high degree of epistemic certainty on the understanding of both the mechanisms of evolution and the nature of the neurology of human brains in relation to our moral philosophies. This does not sound like a reasonable position - evolution is a historic mechanism of which we're learning by doing induction on the data in the form of outcomes, and therefore we're limited to our understanding of causal pathways. Neuroscience is nascent enough that, assuming a rate of scientific paradigm upgrading of the other sciences, discounting for civilisation upgrade, you'd still expect it to go through a few cycles of fundamental revisement.

1a. Corollary of the above is that humans don't agree on a moral philosophy - so which moral philosophy will we align the AI to? We don't even know what is the predominant moral philosophy in humanity - as our measurement instruments of it are incredibly lossy - and short of large scale ethical dilemma experiments in large variance state-spaces, we can't hope to know. Even if you discount measurement, you'd have to assume that the theoretical space of human morals is well explored - which, given the rate of change, seems like it would be hard to defend. Finally, I believe human morality to be a homeostatic state forced on us by our struggle against our environment. Given AI will exist in a different dynamic environment, even if we assume CEV, the target is mutable.

2. To my knowledge humans have not once created a technological advance that at a certain level of technological sophistication has succeeded on the first try. This necessarily means that the only chance is to reach an alignment level ahead of reaching a capability level - which given we are a species who learn through testing (as is patently true in the nature of our current scientific paradigm) necessitates a belief in generalisation across capability levels such that we may infer our ability to align at the next stage of capability from a previous level of capability (let's call it ratio of transference). I don't see any logical reason why this belief holds as certain, nor a basis in which a prior can be established. What is the equivalent well researched discipline through which we are forming belief that generalising into lower abstraction space holds the laws we have established at higher abstraction space? Indeed - it seems to me that it is the nature of scientific discovery that as we understand more, laws at previous abstraction layers break. I would argue that the most intellectually honest position is that generalisation of our alignment ability into higher tier capability levels is one of symmetric ignorance, which, given asymmetric outcomes, pushes the burden of proof on the more optimistic side of the debate. I am aware of the early experiments at lower ability level and that they held at higher levels - but we need similar functions to exist in other domains from which we can infer that the shape of the function, and indeed a stable function, will hold. Indeed, the nature of this ignorance is strongest support within my mental model that the pessimism / optimism collapses in how a researcher navigates this particular question.

Further to this - the nature of human historic invention of complex systems is one of trial and error - and indeed, it is the primary way we understand how complex systems should be build, as the feedback loop from the environment is critical to well establishing the system. This, intuitively, means that due to AI being a complex system, AGI is structurally a one-shot problem.

Therefore, my core presuppositions boil down to:

1. We don't know what human morals are with any degree of epistemic certainty
2. ...even if we did we don't know how to control and align our closest analogue - a human level intelligence to any operationalisable level except power application - which gives rise to power games.
3. We have as a species never technologically progressed without failing first

From my current reading of the world, the probability that we as a species can define what "aligning an intelligence" means with any degree of epistemic certainty that would be necessary to explore the solution space well is infinitesimally small. I would go as far as to say that the epistemic uncertainty holds ad infinitum, but certainly in the present is what i intuitively feel as true. And therefore, under my mental model, even the goal of alignment is not understood with a high degree of epistemic certainty - let alone the mechanisation of how to do it.

My reasoning above is orthogonal to the question of should someone be doing alignment. Obviously, given we're in a negative-sum game (trivially true) to build super-intelligence, whether it's possible is irrelevant, and we should try. I make no normative statements in the above post.

Reading the Mythos model card - the level of confidence at which lower incidence at misalignment benchmarks is read as “more alignment” feels very under-calibrated.

For any reduction on alignment benchmarks two competing hypotheses are potentially true:

Alignment is better
Model is better at hiding misalignment

Additionally, newer models, unlike older ones, are actually tested by interpretability tools, not just by reading the CoT to determine verbalised eval awareness.