Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran's group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
The problem is, even if the argument that wireheading doesn't happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying "your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won't happen, good news!" But it's not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an "ingredient in a plan" to get the rocket to Mars.
I also don't think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a "trap" (and there can be many additional complications).
What do you mean "randomly come upon A"? RL is not random. Why wouldn't it find A?
Let the proxy reward function we use to train the AI be and the "true" reward function that we intend the AI to follow be . Supposedly, these function agree on some domain but catastrophically go apart outside of it. Then, if all the training data lies inside , which reward function is selected depends on the algorithm's inductive bias (and possibly also on luck). The "cancellation" hope is then that inductive bias favors over .
But why would that be the case? Realistically, the inductive bias is something like "simplicity". And human preferences are very complex. On the other hand, something like "the reward is such-and-such bits in the input" is very simple. So instead of cancelling out, the problem is only aggravated.
And that's under the assumption that and actually agree on , which is in itself wildly optimistic.
In the post Richard Ngo talks about delineating "alignment research" vs. "capability research", i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
Of course, on most of those axes, going "left" is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean "right" are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it's better to start developing our understanding of agents from parts that don't immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It's possible that it's better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano's work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.
I am separately worried about "Carefully Controlled Moderate Superintelligences that we're running at scale, each instance of which is not threatening, but, we're running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an "instance" is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it's more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we're somehow confident that the "instances" can't or won't coordinate either causally or acausally, then they are arguably truly "instances", but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence[1].)
More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.
There seem to be two underlying motivations here, which are best kept separate.
One motivation is having a good vocabulary to talk about fine-grained distinctions. I'm on board with this one. We might want to distinguish e.g.:
But then, first, it is clear that existing AI is not superintelligence according to any of the above interpretations. Second, I see no reason not to use catchy words like "hyperintelligence", per One's suggestion. (Although I agree that there is an advantage to choosing more descriptive terms.)
Another motivation is staying ahead of the hype cycles and epistemic warfare on twitter or whatnot. This one I take issue with.
I don't have an account on twitter, and I hope that I never will have. Twisting ourselves into pretzels with ridiculous words like "AIdon'tkilleveryoneism" is incompatible with creating a vocabulary optimized for actually thinking and having productive discussions among people who are trying to be the adults in the room. Let the twitterites use whatever anti-language they want. The people trying to do beneficial politics there: I sincerely wish you luck, but I'm laboring in a different trench, and let's use the proper tool for each task separately.
I understand that there can be practical difficulties such as, what if LW ends up using a language so different from the outside world that it will become inaccessible to outsiders, even when those outsiders would otherwise make valuable contributions. There are probably some tradeoffs that are reasonable to make with such considerations in mind. But let's at least not abandon any linguistic position at the slightest threatening gesture of the enemy.
This post is an overview of Steven Byrnes' AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes' goal is to reverse engineer the human utility function, or at least some of its central features. I don't think this will succeed in the sense of, we'll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
I hope that in the future this programme makes more direct contact with the mathematical formalism of agent theory, of the sort the LTA is constructing. However, I realize that this is a difficult challenge.
Why are we giving up on plain "superintelligence" so quickly? According to Wikipedia:
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the most gifted human minds. Philosopher Nick Bostrom defines superintelligence as "any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest".
According to Google AI Overview:
Superintelligence (or Artificial Superintelligence - ASI) is a hypothetical AI that vastly surpasses human intellect in virtually all cognitive domains, possessing superior scientific creativity, general wisdom, and social skills, operating at speeds and capacities far beyond human capability, and potentially leading to profound societal transformation or existential risks if not safely aligned with human goals.
I don't think I saw anyone use "superintelligence" as "better than a majority of humans on some specific tasks" before very recently. (Was DeepBlue a superintelligence? Is a calculator superintelligence?)
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as "fundamentally confused ideas" which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from "Superintelligence", in which Bostrom talks about the problem of wireheading. Turner declares this to be "nonsense" since, according to Turner, RL systems don't seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today's systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow's systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it's not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don't know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don't account for some errors, rather than despite of it.
After this argument, Turner adds that "as a point of further fact, RL approaches constitute humanity's current best tools for aligning AI systems today". This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI's legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn't invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called "counting argument" for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on "good" behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don't perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn't exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don't think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam's razor, but Occam's razor doesn't strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam's razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn't always generalize well empirically, it's just that when it fails we add more data until it starts generalizing. But, if the failure is "kill all humans", there is nobody left to add more data.
Turner's conclusion is "it becomes far easier to just use the AIs as tools which do things we ask". The extent to which I agree with this depends on the interpretation of the vague term "tools". Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I'm already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don't think we know how to safety create "tools" that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
The main thing this post is missing is any rigorous examples or existence proofs of these AIT natural latents. I'm guessing that the following construction should work:
It would be nice to see something like that in the post.
These ideas seem conceptually close to concepts like sophistication in algorithmic statistics, and the connection might be worth investigating.
Now, about the stated motivation: the OP claims that natural latents capture how "reasonable" agents choose to define categories about the world. The argument seems somewhat compelling, although some further justification is required for the claim that
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer.
That said, I think that real-world categorizations are also somewhat value-laden: depending on the agent's preferences, and on the laws of the universe in which they find themselves, there might be particular features they care about much more than other features. (Since they are more decision-relevant.) The importance of these features will likely influence which categories are useful to define. This fact cannot be captured in a formalism on the level of abstraction in this post. (Although maybe we can get some of the way there by drawing on rate-distortion theory?)
Still unpublished.
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as "traps", since they hinge of the non-asymptotic properties of generalization, i.e. on the "prior" or "inductive bias". Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)