Timothy Lillicrap, a staff research scientist at DeepMind and (among other things) a member of the team that trained AI agents to play games like Go at a superhuman level, recently gave a talk at Harvard's ML Foundations seminar. The talk was not a showcase of new research results, as many of these talks are; instead, it was an attempt to grapple with the recent and stunning successes of large language models like ChatGPT. Tim admitted that his AGI timelines have become shorter, that he is worried, and that he is increasingly prioritizing studying AI safety. I found this admission by a mainstream AI researcher at the forefront of progress striking, and was also struck by what felt like tacit agreement (or at least, not tacit condemnation of the worries as totally unreasonable) from the room—a room with a large number of impressive academic AI researchers.
Tim is not the only one sounding the alarm, of course. There is the well-publicized request for 'pausing' the training of large models, and the eminent Geoff Hinton has publicly turned to doomerism. But the fact that AGI and AI risk were normal topics at a regularly scheduled Harvard seminar, rather than in letters or op-eds, speaks to a sea change.
Part of this change in sentiment is because 2022 was a banner year for spectacular and public-facing AI progress, between the release of ChatGPT and the release of powerful diffusion generative models like DALL·E 2 and Stable Diffusion. Given that GPT-4 is powerful enough to pass many of the tests previously thought to be challenging for AI (the SAT, the GRE, AP exams, law exams, medical exams, and even a quantum computing exam), researchers have begun to ask and debate other questions in earnest. Do large language models really have a world model, or are they just stochastic parrots? Could large language models be conscious? Moreover, these questions are not being contemplated on the fringe; philosopher David Chalmers talked about the consciousness question during a keynote talk at NeurIPS, one of the largest and most prestigious AI conferences.
The problem of AI safety and the potential consequences of AGI are no longer niche topics discussed on sites like LessWrong, but immediate and visceral issues. From academia to industry, there is increasing recognition that these problems are problems, and that they should be taken seriously. "Overpopulation on Mars" may happen sooner than expected.
Or so the progress and hype suggests. Is AGI really on the horizon? Why or why not? In this essay, we will attempt to carefully examine the following question posed by Open Philanthropy as part of their AI Worldviews essay contest:
What is the probability that AGI is developed by January 1, 2043?
Any attempt to answer this question requires addressing the following sub-questions:
We will focus on addressing these questions one at a time. Other interesting and important questions, like those related to the risks posed by AGI, are considered out of scope.
Like porn, we may like to think that we know AGI when we see it. But recent experience has suggested we are prone to get into annoying debates about whether a model has 'really' achieved something, or merely acts 'as if' it has achieved that thing. Both for the purpose of fostering more productive discussions, and sharpening our intuitions about what we are looking for, it is useful to clarify what we mean by general intelligence.
Psychologists speak of different kinds of intelligence, like social intelligence, emotional intelligence, or IQ. More generally, we use the term intelligence to refer to some kind of facility for a particular task or class of tasks. Not all tasks are associated with some kind of intelligence: we do not say that a rock has high 'being a rock' intelligence, for example. Certain tasks are distinguished as corresponding to intelligence, like playing strategic games, abstract and symbolic reasoning, recognizing objects, multi-party communication, navigation, memory-related tasks, and creatively generating images or text. Many of these tasks fall under the umbrella of rational decision-making under uncertainty. (That is, how do you behave in a way that is rewarded by your environment when your knowledge of the environment is highly incomplete?) 'General' intelligence, then, refers to an agent's aggregate performance on a suite of such tasks. Agents with general intelligence can perform many different kinds of tasks well.
It is important to stress that we will be defining general intelligence as performance-based. If an agent behaves as if it is intelligent across a broad range of interesting tasks, we will say that it has achieved general intelligence without additional qualifications. Whether the agent 'really' understands what it is doing, however we choose to define that, is irrelevant. (Although there is a case to be made that, for a judiciously chosen set of tasks, some form of general intelligence is necessary to do all of them well.)
Although our benchmark for general intelligence is humans, and biological intelligence more generally, AGI may look quite different from human intelligence. For one, neuroscientists like Tony Zador argue that much of animal behavior is shaped by hundreds of millions of years of evolution, and an associated 'genomic bottleneck' for information encoding, rather than learning during an animal's lifetime. It is thought that tasks like object recognition and maneuvering gracefully in three-dimensional space are relatively easy for humans not because they are intrinsically easy, but because evolution had strong incentives to optimize our ability to perform these tasks. (Recognizing your friends and foes, and drawing closer or running as necessary, is important for survival.) Even if researchers training models iterate through architectures or hyperparameters in a way that is somewhat analogous to natural selection, AGI will not benefit from the same huge amount of evolutionary optimization enjoyed by biological intelligence.
The substrate for AGI, rate-based neural networks simulated on silicon, is also quite different from the substrate of biological intelligence: living cells and (spiking) neural networks. Although the rate-based neural networks used by modern AI models can be justified as related to spiking models, at least under certain approximations, the relationship between the two is more subtle than is usually appreciated (for an account of the relationship through the lens of 'latent population-level factors', see DePasquale et al.). There are also many other biological features whose computational importance is not totally understood: substructure to neurons like dendritic trees, non-neuronal cell types like glia, intracellular circuits involving RNA and proteins, and so on. Arguably, these differences are unimportant, since we care about performance rather than about how a given level of performance is achieved.
The input data of a putative AGI, and the kinds of outputs it supports, may also be quite different. While humans and other animals receive a rich stream of multimodal sensory data (sights, sounds, smells, physical touches, etc.; some birds, like pigeons, can even sense magnetic fields), a powerful future large language model may support only text inputs and outputs. A lack of support for some particular input or output modality, or physical embodiment, should not disqualify an otherwise capable system—although in this case, the AGI label should be qualified as being with respect to a certain kind of reasoning. Consider the fact that although nearly all humans can be viewed as possessing general intelligence, many humans have sensory impairments that reduce performance on certain tasks: for example, formerly blind people with restored vision may have trouble connecting the visual appearance of objects with how those objects feel, even if they are otherwise cognitively normal.
Along similar lines, the way AGI may be trained, and the training data used, will most likely not resemble animal learning. Modern systems use variants of the backpropagation algorithm (that is, gradient descent with respect to some objective function), which is thought to be quite different from how real neural circuits learn. The training of language models like GPT-4 and diffusion generative models like Stable Diffusion involves using large volumes of data scraped from the internet, plus human feedback and fine-tuning; meanwhile, animals learn from evolution and moment-to-moment experiences.
There are plenty of other differences, including energy consumption (human bodies and brains are relatively energy-efficient in the context of both training and performance) and self-directedness. While animals act in the world even without prompting (and while even behavior in the absence of any interesting sensory stimuli has interesting structure), we may not care that an artificial agent does not do anything unless prompted, and only does its assigned task. As long as it performs well when it is asked to perform, we ought to consider it intelligent.
Biology comparisons aside, we also do not expect AGI to perfect. Some problems are simply hard (in the computational complexity sense), and even highly advanced systems are subject to these bounds. We can refer to a system as having achieved general intelligence without it being able to prove the Riemann hypothesis, or solve comparably difficult open mathematical problems, quickly and efficiently.
Having made some points about what we are not looking for, let us try to discuss the characteristics of what we are looking for, and how we might measure whether those characteristics have been achieved. To do this, it is helpful to start by considering simple forms of intelligence, and then considering increasingly more complex forms.
If intelligence is about facility with tasks, the simplest form of intelligence enables an agent to perform a single, simple task well. The paradigmatic example of this is a model trained to classify images (e.g. associating MNIST images with the digit depicted). We will refer to well-defined and scope-limited tasks like recognizing digits as primitive tasks. We acknowledge that it is difficult to draw the line between tasks which are primitive and tasks which are not (although something like 'living a successful life' appears not to be primitive), but do not have enough space to be more rigorous here.
Success at a single primitive task marks our lowest level of intelligence:
Level 1 (Task-specific) intelligence. System can perform at least one primitive task (e.g. object recognition, playing Go) well. Many common and basic learning systems are here.
Measuring. A standard measure, e.g. percent of test set images classified correctly, or cumulative score within a fixed amount of time for Atari games.
Better than being able to perform one task well is being able to perform multiple tasks well. This is something that even the simplest forms of life can do; even single-celled organisms are capable of intelligent decision-making within their ecological niche for the purpose of surviving and reproducing. More complicated artificial agents can also do this.
(Technical aside: the same model should be able to perform multiple tasks well. A model like AlphaZero can play chess, shogi, and Go, but not all at the same time. On the other hand, this is easy to circumvent by gluing each of the three models together, and using a flag that recognizes which game is being played, so the distinction may not be that important in practice.)
Define a task class as a collection of primitive tasks. The next level of intelligence is marked by success at all tasks within some class:
Level 2 (Task-class-specific) intelligence. System can perform all primitive tasks within a task class well. Simple organisms like bacteria (which know how to move towards food, reproduce, etc., but have a limited ability to learn new things) are probably here.
Measuring. Appropriate combination* of task scores.
What we mean by 'appropriate combination' requires some unpacking, and will be explored more later.
An important warning regarding our above (qualitative) definition is that performing one task well may allow a system to perform others well; an artificial agent that can play Atari games at a high level may also be fairly good at object recognition if it is forced to use screen pixels as input, for example. Hence, the line between Level 1 and Level 2 is somewhat blurry. It may be useful to distinguish between a primitive task and compound task, which we can define as collection of related primitive tasks.
Beyond agents that can perform well within a prespecified task class are those that are more task-agnostic. By task-agnostic, we do not mean that agents can perform any task well, since some tasks are impossible. Roughly speaking, we mean that agents can achieve good performance on an extremely broad range of interesting-but-not-impossible tasks. Something like MuZero may be a good example of such an agent, since via self-play it can teach itself to play an extremely broad range of games well. (Although it must 'start from scratch' when presented with a completely new game.)
At this level, agents are not required to be flexible, in the sense that learning a new task (or even a slight variant of a known task) can involve a large amount of retraining. Agents are also not expected to be able to perform extremely complex tasks, e.g. writing a good novel.
Level 3 (Task-agnostic but inflexible) intelligence. System can perform a large variety of primitive and compound tasks, even novel tasks, although it may require a substantial amount of training or retraining before achieving good performance. Sufficiently complex tasks may still be out of reach.
Measuring. Appropriate combination* of task scores. Testing the performance of agents on many novel tasks is particularly important here. Since the task class is not fixed, we must sample from some distribution of novel tasks during testing.
Although language is not technically mandatory for task-agnostic agents, it is extremely useful, since language allows many tasks to be presented and reasoned about in a common format.
One nontrivial step beyond task-agnostic but inflexible agents are agents which possess some degree of flexibility, in the sense that they can achieve good performance on novel tasks with a minimal amount of retraining. This takes us one step closer to general intelligence, and might be called "simple general" intelligence:
Level 4 (Task-agnostic and flexible; "Simple general") intelligence. System can perform a large variety of primitive and compound tasks. Many novel tasks can be learned in a zero- or few-shot fashion, and task perturbations generally do not significantly affect performance. Sufficiently complex tasks may still be out of reach.
Measuring. Appropriate combination* of task scores. Testing the few-shot performance of agents on novel tasks is particularly important here. Since the task class is not fixed, we must sample from some distribution of novel tasks during testing.
Arguably, the latest version of ChatGPT has already achieved Level 4 intelligence, since it can perform many tasks (e.g. passing a quantum computing exam) it was never explicitly trained to do, with no change to its parameters, and usually in a short amount of time. There remain substantial disagreements about the capabilities of ChatGPT, however.
What's missing? The main performance-related difference between state-of-the-art models and humans appears to be in the kinds of tasks each can do well. Humans can write beautiful novels, and prove mathematical theorems that relate concepts from disparate domains in an unexpected way, and reason effectively about possible futures and counterfactuals.
It is thought that some of the most impressive achievements of humans relate to their ability to construct useful and coherent world models, and to efficiently use those models to reason about particular problems, even over long time horizons. But since all we can measure is performance, we will define the highest level of intelligence we consider—"true general" intelligence, if one likes—as achieving good (few-shot) performance across many sufficiently hard novel tasks:
Level 5 (Task-agnostic, flexible, and complex; "True general") intelligence. System can perform a large variety of primitive and compound tasks. Many novel tasks can be learned in a zero- or few-shot fashion, and task perturbations generally do not significantly affect performance. Even complex tasks, e.g. those that are thought to involve the construction of complicated plans, like writing a novel or many-part symphony, can in principle be performed in a few-shot fashion.
Measuring. Appropriate combination* of task scores. Testing the few-shot performance of agents on novel, challenging tasks is particularly important here. Since the task class is not fixed, we must sample from some distribution of novel, challenging tasks during testing.
The idea is that hard enough tasks will require building world models, and being able to assess causality, and having long-term memory, and so on. Humans do not have these things by accident; evolutionary optimization has endowed us with them because they are extremely useful for navigating a complex and uncertain world.
How might we specifically define an "AGI score" that can be assigned to individual agents? Preferably, humans would tend to score well on this index, and current large language models would tend to score somewhat poorly (or else we must accept that AGI is already here, which seems false). Given that intelligence reflects performance on some suite of tasks, in principle we can obtain such a score by combining performance scores from some large set of tasks.
This leads to at least three problems:
(1) How do we weight these various performance scores relative to one another?
(2) What about tasks for which there is no obvious performance score?
(3) How might one prevent 'gaming the metric'?
The weighting problem. Should each performance score contribute equally to the overall score? Also, are very high scores on some subset of all tasks sufficient, or is it important that the agent does at least acceptably well on all tasks?
One approach is to make arbitrary choices, e.g. weight tasks according to some difficulty measure and choose some kind of average. Another approach is to learn a performance score weighting from human feedback. In the training of large language models, one also confronts a problem of it being unclear a priori how to quantify the 'goodness' of a response; there, at least in one step of a complicated procedure, human raters ranked their preference for different possible responses. These rankings were then used to train a neural network to learn a reward function. In a similar fashion, one could solicit feedback from a large number of human raters (either normal people or 'experts' in neuroscience and AI) and use this feedback to determine a coherent weighting. (In other words: is it more impressive to get a good score in Montezuma's Revenge, or Pitfall? How does good performance on theory-of-mind tests compare to good performance on navigation tasks?)
What if there is no obvious performance score? Again, utilizing human preferences is one way to solve the problem, at least in principle. Even though it is hard to judge the 'goodness' of novels, people can usually distinguish 'good' writing from 'bad' writing, and this can be exploited in order to build a reasonable performance metric.
Preventing metric-gaming. While it may not be totally possible to avoid this, one way is to incorporate some amount of randomness into the assessment of intelligence. As described above, instead of using fixed tasks, one can sample from a space of possible tasks. Ideally, neither the model nor the model's designers have encountered the exact problem it is being tested on before. On the other hand, some problems (e.g. write a great American novel) may be so hard that no added randomness is necessary.
These issues aside, what suite of tasks might be sufficient for testing general intelligence? Here are some suggested categories of tasks:
Ideally, our list contains a mix of standard AI tasks, psychology tasks, neuroscience tasks, and creative tasks, as well as tasks intended to test for the presence of specific features (e.g. memory). If we are serious about testing for AGI, a list like this will probably have to be greatly expanded and refined.
It should be mentioned that the idea of testing for AGI using a suite of benchmark tasks is not entirely new. Startups like Generally Intelligent are trying to do this, for example.
What obstacles prevent current models from exhibiting the kind of (Level 5) general intelligence described in the previous section? There are a few different kinds of obstacles, and all of them can be classified according to expected difficulty: either easy, in the sense that their solution is straightforward, and that they will probably be solved within the next two decades; hard, in the sense that their solution will require some new ideas, but that it is not inconceivable that they will be solved in the near future; and ugly, which means prospects for their solution are more unclear.
These days, especially for large companies pursuing AGI, data is not a problem: there is a lot of it, and it is increasingly well-curated. There is increased community recognition that large and well-curated data sets are important. There are even huge openly-available data sets for training e.g. diffusion models, like LAION-5B. Tim also mentioned this in his talk; the problem is not data availability, it is using that data effectively.
Obtaining data about human preferences (e.g. for training large language models) is harder and more expensive, but companies like OpenAI are becoming increasingly good at doing it effectively and efficiently. To the extent that such data will prove important in training the next generation of models, other outfits will follow suit.
At least for large companies, compute availability is also not a serious issue. Training models like GPT-N can be expensive, but there is a willingness to do it, so it happens anyway. Other companies and researchers working downstream of the efforts to train such 'foundation models' can, for much less money, modify these models to perform well on more specific tasks.
One potentially 'hard' obstacle which we have discussed already is constructing an acceptable AGI score. (Although such a score is not strictly necessary, it is likely that it will be extremely useful for iterating in the direction of AGI. It may be necessary in practice, if not in principle.) Even if one follows the prescription suggested earlier, doing things like constructing a solid task library and effectively weighting the various tasks in a way that respects human preferences about intelligence may require some cleverness.
A more technical hard obstacle is determining how to use data and compute effectively. This is something Tim talked about at length during his talk. In the training of large language models, the place where large amounts of data and compute are used is in the initial next token prediction step, i.e. the part where you train the model to correctly predict the next word given an enormous number of text fragments from some corpus (e.g. the entire gatherable internet). Although reinforcement learning from human feedback helps make the model friendlier and more attuned to human preferences, it is thought that most of its innate capabilities come from the next token prediction step.
Currently, designers would like to spend more compute on this step, but doing this is hard, since it can lead to overfitting. Although it is not totally obvious how to address this, one thing that is clear is that there is plenty of room to generalize the next token prediction approach in ways that might enhance models’ capabilities. From the point of view of reinforcement learning, large language models are trained like contextual bandits: each episode involves trying to predict the next word, and ends once that has been done. One can imagine an alternative approach where models predict the next several words, but receive some form of intermediate feedback. This turns training into something that looks more like a standard reinforcement learning problem, and causes the usual issues of long-term planning and credit assignment to come up.
Another obstacle is appropriately leveraging insights from the training of sophisticated deep reinforcement learning models like MuZero and AlphaStar. These insights include (i) the utility of explicitly constructing world models; (ii) the benefits of self-play; (iii) the benefits of pitting competing models against one another (see the discussion of league play from the AlphaStar paper). Self-play, which allows a model like MuZero to ‘train itself’ to expert level literally overnight, could be particularly important. The difficulty is appropriately translating self-play into a more general context; e.g. is ‘self-talk’ a useful form of self-play? In general, the more humans can be taken out of the loop, and the more models can learn from self- or simulated experience, the more likely it will be that AGI can be trained quickly and cheaply.
Most of the obstacles whose prospects are unclear relate to world modeling: the ability of humans to envision possible futures, consider counterfactual versions of the past, reason about cause and effect, and otherwise mentally simulate a real or imagined environment.
Related to world modeling, it is unclear how to force models to reckon with uncertainty in a close-to-optimal (or at least, acceptably well) fashion, in the same way brains are thought to in various contexts. (See the Bayesian brain hypothesis and similar ideas about brains that reason probabilistically, which in their weak form mostly concern perception, and in their strong form may be a foundation for even much of higher-level cognition.)
Finally, current large language models generally lack a robust long-term memory, whose benefits are obvious. Blaise Aguera y Arcas of Google spoke at a recent consciousness and AI workshop on the strengths and limitations of existing large models, and noted that current language models ‘lack a hippocampus', and argued that this may be one of the most critical ingredients they are currently missing.
Each of these points is important enough that we will cover them in more detail in the next section.
Humans have world models, and interrogating those world models—including their associated biases, priorities, and limitations—is a major project of psychology and neuroscience. Evolution has endowed us with world models almost certainly because they are useful, and they are useful because they allow us to learn from something other than direct experience. Knowing that fire is hot, and thus painful, means that we do not need to put our hands on a hot object to realize that such an action may be suboptimal. (Or at the very least, we do not need to touch too many hot objects to anticipate the dangers of fire in a novel context.) The ability to learn from pure contemplation, and to extract more (albeit possibly wrong, depending on priors) information from a given collection of sensory input than the input itself contains, is perhaps the defining feature of a world model.
Particular brain areas are associated with world models, like the hippocampus (for e.g. spatial maps), prefrontal cortex (for e.g. 'higher-level' cognition), and potentially model-based-reinforcement-learning-associated regions like the basal ganglia. On the other hand, depending on how one more carefully defines a world model—as we neglect to do here—even primary sensory areas like V1 may have a kind of a world model.
World models have received a decent amount of focus in the context of deep reinforcement learning, where agents like MuZero have shown that they can provide an extremely effective route to superhuman performance, and where formalizing the problem of world modeling is at least somewhat tractable. For many other models, like GPT-4, although we can certainly say that no world model was put in by hand, it is completely unclear whether there is a sense in which there is a world model. In the same way that it is not easy to 'see' where the world models of humans are, since all we can observe are things like spikes and behavior, it is also not easy to see where in a collection of neural network weights a world model might hide.
Part of the problem of assessing the presence of world models is determining good (performance-based, rather than representation-based, since intelligence can come in many forms) tests for world models. Psychologists and neuroscientists regularly do this kind of thing in limited contexts (e.g. testing the intuitive physics of young children), but we still need an exhaustive test for the coherent and elaborate world model we expect a generally intelligent agent to possess.
Given that we presumably have world models because they enable us to perform many tasks well, it at least stands to reason that a sufficiently advanced model with a sufficiently large number of parameters and volume of training data may also construct a world model 'for free'.
But maybe not. It is also possible that specific architecture needs to be in place to support robust world modeling, and that we would need to identify such architectures before we can achieve AGI. This may look like the 'bag of tricks' view of the brain: a module for intuitive physics, a module for causality and causal inference, a module that constructs 'maps' of environments or abstract concept spaces, and so on. For insightful discussion of world modeling, and where current machines appear to lag behind humans, see Lake et al. 2016.
Humans are sensitive to uncertainty at both the level of perception and cognition, and this feature is extremely useful in navigating a complex and highly uncertain world. Sensitivity to uncertainty allows us to avoid risky actions, which we may expect to go well on average, but which may have unacceptably large possible consequences if they go poorly.
Current models generally do not include sensitivity to uncertainty, at least explicitly, although ideas like distributional reinforcement learning have been shown to improve performance. While there are ideas for how to incorporate some kind of certainty into neural networks, like Bayesian neural networks, for the most part Bayesian strategies have mostly been proposed to solve isolated problems. (In the context of the brain, how the brain might employ Bayesian strategies is also usually examined in the context of specific problems, like tracking heading direction or inferring motion structure.)
The audience of this essay probably does not need to be convinced of the utility of assessing one's certainty for making good decisions. The main thing to stress is that how to do this for machines, in a general way rather than a 'bag of tricks' way, is currently not clear. It may be the case that a 'bag of tricks' way is what is needed.
As with world modeling, it is possible that reckoning with uncertainty happens automatically for sufficiently performant models. Either way, doing so would probably significantly mitigate the well-known issue of 'hallucinations' (although this issue has improved in the most recent version of GPT).
Humans can remember things even from a long time ago, although the biology behind our memories is not clear, and may relate to intracellular processes in addition to long-lasting changes in synaptic weights.
Large language models are based on transformers, and have memories limited by the context windows associated with their attention mechanisms. ChatGPT will eventually forget what you told it X tokens ago. Given this obvious flaw, there is plenty of work trying to rectify the situation, although it is unclear which solution (or combination of solutions) will be most performant in the end. In addition to more vanilla solutions, like using adaptive context windows, there are more exotic ideas like modern Hopfield networks, which build on a variety of older ideas from AI and neuroscience.
It is also possible that more ideas must be borrowed from a neuroscientific understanding of memory (which is arguably lacking at present) before we can make substantial progress on the problem of reliable neural network memory.
Conditional on AGI being possible, is there a will to achieve it? This section will be short, because the answer is very clearly yes. Large companies like Microsoft and Google are investing huge sums of money into improving their AI capabilities because it has been and continues to be extremely profitable to do so. Current chatbots like ChatGPT and Bing Chat have provided Microsoft with money, press, and prestige. The business opportunities associated with highly capable chatbots, diffusion models, and so on remain vast and unexplored.
Although there are some incentives to be slow and careful, the economic incentives to build systems with increasingly impressive capabilities are much higher.
Given what we have discussed, how long might it take for us to build AGI? We are in one of three possible worlds, each characterized by how true Sutton's bitter lesson is.
Strong Bitter Lesson. All we need to achieve AGI is to scale up existing architectures. More parameters is enough.
Weak Bitter Lesson. Scaling up existing architectures, with some important tweaks here and there (e.g. generalizing next token prediction to something more sophisticated, and exploiting self-talk and self-play), will be enough to achieve AGI. More parameters takes us most of the way there.
No Bitter Lesson. New and different ideas are necessary for large models to achieve genuine understanding, and to do things like develop coherent and useful world models. In other words, something like a world model will not appear unless we put it in.
The Strong Bitter Lesson is probably not true, but there is enough uncertainty that I cannot say for sure. Let's say it has a 20% probability. Lillicrap and many other leading AI researchers appear to believe in the Weak Bitter Lesson; on the other hand, it is extremely difficult to say whether new and different ideas are necessary. After all, it is at least possible that the brain looks like a 'bag of tricks' for reasons other than developmental or evolutionary tractability. Let's say that either is roughly equally likely, conditional on the Strong Bitter Lesson being false (that is, we assign each other possibility a 40% likelihood).
In the Strong Bitter Lesson world, whether we achieve AGI by 2043 is simply a matter of how quickly we can train bigger models. Given that this appears to adhere to a form of Moore's law, it is almost certain—but maybe not completely certain, since there are plenty of unknown unknowns. Let's say that AGI happens by 2043 with 90% probability.
In the Weak Bitter Lesson world, whether we achieve AGI by 2043 depends on the obstacles we previously identified as hard, like figuring out how to more effectively leverage compute. Things like world models may appear when models become sufficiently performant. The aforementioned obstacles are not insurmountable, and there is a reasonable chance that we address them acceptably well given twenty years' time, especially if we extrapolate the rapid recent pace of AI research forward in time. Given unknown unknowns, and being conservative, let's say AGI happens by 2043 with 60% probability.
In the No Bitter Lesson world, whether we achieve AGI by 2043 depends on figuring out how to implement many special and complex components, like the construction of world models, supporting causal reasoning, maps, and reckoning with uncertainty. The neuroscience and psychology behind many of these components remains somewhat poorly understood, and it is possible that a deep understanding of each individual component takes more than two decades. On the other hand, we may not need a totally robust scientific understanding in order to build something that is reasonably performant. Being conservative, let's say AGI happens by 2043 with 20% probability.
Putting these estimates together, we estimate that there is a 50% probability AGI is developed by 2043. Of course, the estimate is less useful than the journey.