Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Learning Other's Values or Empowerment in simulation sandboxes is all you need

TL;DR: We can develop self-aligning DL based AGI by improving on the brain's dynamic alignment mechanisms (empathy/altruism/love) via safe test iteration in simulation sandboxes.

AGI is on track to arrive soon[1] through the same pragmatic, empirical and brain inspired research path that has produced all recent AI success to date: Deep Learning. The DL approach offers its own natural within-paradigm solution for alignment of AGI: first transform the task into a set of measurable in-simu benchmark test environments that capture the essence and distribution of the true problem in reality, then safely iterate ala standard technological evolution guided by market incentives.

We can test alignment via sandboxed simulations of small AGI societies to safely explore and evaluate mind architecture space for the designs of altruistic agents that learn, adopt, and then optimize for the values (or empowerment) of others, all while scaling up in intelligence and power[2]; eventually progressing to large eschatonic simworlds where human-level agents grow up, learn, cooperate and compete to survive, culminating in a winner acquiring decisive (super)powers and facing an ultimate altruistic vs selfish choice to save or destroy their world, all the while never realizing they are in a sim (and probably lacking even the precursor concepts for such metaphysical realizations)[3].

To the extent that we have 'solved' various subtasks of cognition such as a vision, speech, natural language tasks, various games, etc, it has been through a global evolutionary research process guided by coordination on benchmark sim environments and competition on specific approaches. Over time the benchmark/test environments are growing more complex, integrative and general. So a reasonable (if optimistic) hypothesis is that this trend can continue all the way to aligned AGI.

The future often appears very strange and novel when viewed through the lens of the present. The novelty herein - from the standard AI alignment mindset - is perhaps the idea that we can and must actually test alignment safely and adequately in simulations. But testing in-simu is now just standard practice in modern engineering. We no longer test nuclear weapons in reality as the cost/benefit tradeoff strongly favors simulation, and even far safer technologies such as automobiles are also all tested in simulations thanks to the progressive deflationary march of Moore's Law. From this engineer's perspective it is fairly obvious both that testing is required, and that testing powerful AGI - something probably far more dangerous than nuclear weapons - in our one and only precious mainline reality would be profoundly unwise, to say the least.

The rest of this article fleshes out some of the background, technical challenges, details, and implications of alignment for anthropomorphic AGI in simboxes[4]. In essence the core challenge is finding clever ways to more efficiently explore and test the design space all while balancing various tradeoffs in order to avoid paying an excessive alignment tax[5].

1. Measuring Alignment

By alignment we mean the degree to which one agent optimizes the world in the direction other agent(s) would optimize the world, if they only could. This high-level article will avoid precise mathematical definitions, but for the math minded alignment should conjure something like weighted integrals/sums of dot products over discounted utility functions.[6].

We can measure alignment in general by evaluating agents in various specific situations that feature counterfactual inter-agent utility divergence. Or in other words, we can evaluate agents in situations where their actions have non-trivial impact on other agents, such that the others would have strong opinions on the primary agent's choice.

We can use creative world design to funnel agents into various test scenarios, followed with evaluation by random panels of human observer judges who decide alignment scores, aggregation/normalization of said scores, training narrow AI helpers to predict human ratings, and then scaling up.

Information generally only flows out of the sim; the agents are unaware that they are being judged[7], and thus the human judgments are not available as a learning signal for sim agents, so we can avoid all the various deception and feedback problems anticipated in naive open training scenarios.

Intelligent socially adept humans are already quite capable of modeling and inferring the goals and alignments of other agents, but our judges can also exploit superpowers: they will be able to directly inspect, analyze, search and compare agent mind states and thought histories, both historical and in real-time. The combination of brain-like AGI architectures with accessible inner monologues [8], powerful mind debugging tools, and carefully designed knowledge-constrained and firewalled simboxes help prevent deception and most of the myriad difficulties anticipated in the classic AI alignment literature.

The central difficulty in aligning DL based (brain-like) AGI is something else: the challenge of balancing selfish empowerment bootstrapping goals vs alignment goals during developmental learning[9]. As a result we should expect any alignment scores to fluctuate, especially earlier during the agent's developmental trajectory. Even the most altruistic adults may have evolved from formerly selfish children - and we rightly do not fault (let alone cull!) them for it.

Thus many evaluations are necessary to develop alignment scaling theories. For the most promising agents we eventually want penultimate full systems tests, where we can scale the agents up - perhaps even to a bit beyond human level (in some respects) - to see how altruistic/aligned they actually are even after taking over the world. One such example eschatonic[10] scenario would be a world where through some final acquisition of powerful magics the winning agent can choose between:

  • resurrecting and permanently empowering all the other agents, but only at the sacrificial expense of their own life, or:
  • permanent power over the world, but at the expense of all the other agents (and no resurrection).

This is a useful proxy for an obvious endgame scenario we care about in the real world (whether future AGI will empower and immortalize us - even at great cost to itself - or instead choose its own survival/empowerment over ours).

Eschatonic simworlds provide another means to measure alignment more directly through the lens of the agents themselves: at the final moment we can pull (or copy) all the other agents out of the simulation (living or dead) and present them with a choice of which world to resurrect into[11]. There is naturally some additional cost to such evaluations (as the resurrectees will require some time to evaluate the possible world options, naturally aided through godseye observational powers), but these evaluation costs can be fairly small relative to the cost of a complete world sim run. This mechanism could also help to test the fidelity of the winning agent's alignment mechanisms. [12]

The "losers pick from the winner's worlds" mechanism could be considered a long-horizon implementation of the generalized VCG mechanism which measures the net externality or impact of an agent decision as the amount it improves/worsens net utility from the perspective of all other agents. Alignment/Altruism is naturally a measure of net positive externality.

2. Reverse Engineering the Brain

There is a natural convergent path to AGI in our universe: reverse engineering the brain[13]. Unlike current computers, brains are fully computationally pareto-efficient, and thus Moore's Law progress is necessarily progress towards the brain (as neural computation is simply the general convergent solution). Furthermore, brains are practical universal learning machines, so it was always inevitable that the successful algorithmic trajectory to AGI (ie deep learning) would be brain-like. Evolution found variants of the same general pareto-optimal universal learning architecture long ago, multiple times in evolutionary deep time, convergently in distant lineages (vertebrate and invertebrate), and then conserved and differentially scaled up variants of this general architecture over and over in unrelated lineages. The human brain is just a linearly scaled up primate brain[14]; the secret of intelligence (for both brains and AGI alike) is that simple, general, scaling-efficient architectures and learning algorithms are all you need, as new capabilities simply emerge automatically from scaling[15].

Understanding these convergent trajectories and their key constraints is crucial as it allows predicting the general shape of, and constraints on, approaching AGI.

The Trajectory of Moore's Law

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law ...
-- Rich Sutton, "The Bitter Lesson"

The general trajectory of Moore's Law can be divided semi-arbitrarily into three main phases: the serial computing era, the parallel computing era, and the approaching neuromorphic computing era [16]. Each phase transition is demarcated by an increasingly narrow barrier in program-space that allows further acceleration of only increasingly specific types of programs that are increasingly closer to physics. The brain already lies at the end of this trajectory, and thus AGI arrives - quite predictably[17] - around the end of Moore's Law.

The first and longest phase of Moore's Law was the classic serial computing Dennard Scaling era, which lasted from the 1950's up to around 2006. Intel dominated this golden era of CPUs. Die shrinkage was used mostly for pure serial speedup, which is ideal as for the most part it uniformly and automatically speeds up all programs. The inflating transistor budget was used to hide latency through ever larger caches and ever more complex pipeline stages and prediction engines. But eventually this path slammed into a physics imposed wall with clock rates stalling in the single digit ghz for any economically viable chips. CPUs are ideal for running your javascript or python code, but are near entirely useless for AGI: vastly lacking in computational efficiency which is the essential foundation of intelligence.

The second phase of Moore's Law is 'massively'[18] parallel computing, beginning in the early 2000's and still going strong, the golden era of GPUs as characterized by the rise of Nvidia over Intel. GPUs utilize die shrinkage and transistor budget growth near exclusively for increased parallelization. However GPUs still do not escape the fundamental Von Neumman bottleneck that arises from the segregation of RAM and logic. There are strong economic reasons for this segregation in the current semiconductor paradigm (specialization allows for much cheaper capacity in off-chip RAM), but it leads to increasingly ridiculous divergence between arithmetic throughput and memory bandwidth. For example, circa 2022 GPUs can crunch up to 1e15 (low precision) ops/s (for matrix multiplication), but can fetch only on order 1e12 bytes/s from RAM: an alu/mem ratio of around 1000:1, vastly worse than the near 1:1 ratio enjoyed for much of the golden CPU era.

The next upcoming phase is neuromorphic computing[19], which overcomes the VN bottleneck by distributing memory and moving it closer to computation. The brain takes this idea to its logical conclusion by unifying computation and storage via synapses: storing information by physically adapting the circuit wiring. A neuromorphic computer has an alu:mem ratio near 1:1, with memory bandwidth on par with compute throughput. For the most part GPUs only strongly accel matrix-matrix multiplication, whereas neuromorphic computers can run more general vector-matrix multiplication at full efficiency[20]. This key difference has profound consequences.

The Trajectory of Deep Learning

Nearly all important progress in deep learning has come through some combination of 1.) finding new clever ways to mitigate the VN bottleneck and better exploit GPUs - typically by using/abusing matrix multiplication, and 2.) directly or accidentally reverse engineering key brain principles and mechanisms.

DL's progress mirrors brain design principles in most everything of importance: general ANN structure, relu activations - which enabled deep nets - were directly neuro inspired[21], normalization (batch/temporal/spatial/etc) which became crucial for ANN training is (and was) a well known brain circuit motif[22], the influential resnet architecture is the unrolled functional equivalent of iterative estimation in cortical modules[23][24], the attention mechanism of transformers is the functional equivalent of fast synaptic weights[25][26][27], and the up and coming efforts to replace backprop with more efficient, distributed and neuromorphic-hardware friendly algorithms are naturally brain-convergent or brain-inspired [28][29][30].

The learned representations of modern large self-supervised ANNs are not just similar to equivalent learned cortical features at equivalent circuit causal depth, but at sufficient scale become near-complete neural models, in some cases explaining nearly all predictable variance up to the noise limit (well established for feedforward vision and ventral cortex, and now moving on to explain the rest of the brain such as the hippocampus[31] and linguistic cortex[32] [33][34]), a correspondence that generally increases with ANN size and performance, and is possible only because these large ANNs and the cortical regions they model are both optimized for the same objective: sensory (e.g. next-word) prediction. Our most powerful ANNs are increasingly accurate functional equivalents to sub modules of the brain.

Deep Learning really took off when a few researchers first got ANNs running on GPUs, which immediately provided an OOM or more performance boost. Suddenly all these earlier unexplored ideas for ANN architectures and learning algorithms[35] could now actually be tested at larger scales, quickly, and on reasonable budgets. It was a near exact fulfillment of the predictions of Moravec[36] and Kurzweil from decades earlier: good ideas for artificial brains are cheap, good hardware for artificial brains is not. Progress is hardware constrained and thus fairly predictable.[37] There is an enormous extant overhang of ideas, which is often a bitter lesson for researchers, but a bounty for those that can leverage compute scaling.

The most general form of ANN is that of a large sparse RNN with fast/slow multi-timescale weight updates[38]. In vector algebra terms, this requires (sparse) vector matrix multiplication, (sparse) vector vector outer product (for weight updates), and some standard element-wise ops. Unfortunately GPUs currently handle sparsity poorly and likewise are terribly inefficient at vector-matrix operations, as those have an unfortunate 1:1 alu:mem ratio and thus tend to be fully memory bandwidth bound and roughly 1000x inefficient on modern GPUs.

Getting ANNs to run efficiently on GPUs generally requires using (dense) matrix multiplication, and thus finding some way to use that extra unwanted parallelization dimension, some way to run the exact same network on many different inputs in parallel. Two early obvious approaches ended up working well: batch SGD training, which parallelizes over the batch dimension, and or CNNs, which parallelize over spatial dimensions (essentially tiling the same network weights over the spatial input field).

Unfortunately the CNN spatial tiling trick works less well as you advance up the depth/cortical hierarchy, and doesn't work at all for the roughly half the brain (or equivalent ANN modular domains) that operates above the sensory stream: planning, linguistic processing, symbolic manipulation, etc. Many/most of the key computations of intelligence simply don't reduce to computing the same function repetitively over a map of spatially varying inputs.

Parallelization over the batch dimension is more general, but also constraining in that it requires duplication of all sensory/motor input/output streams, all internal hidden activations, and worse yet duplication of short/medium term memory. In batch training each instance of the agent has unique, uncorrelated input/output/experience streams preventing sharing of all but long term memory.

This is one of the key reasons why artificial RNNs stalled far short of their biological inspirations. The simple RNNs suitable for gpus using batch parallelization, with only neuron activations and long-term weights, are somewhat crippled as they lack significant short and medium term memory. But that was generally the best GPUs could provide - until transformers.

Transformers exploit a uniquely different dimension for parallelization: time. Instead of processing ~1000 random uncorrelated instances of the model in parallel (as in standard batch parallelization), transformers map the batch dimension to time and thus instead process a linear sequence of ~1000 timesteps in parallel. This strange design choice is on the one hand very constraining compared to true RNNs, as it gives up recurrence[39], but the advantage is that now all of the activation state is actually relevant and usable as a large short term memory store (aka attention).

It turns out that flexible short-term memory (aka attention) is more important than strong recurrence, at least at current scale (partly because one can substitute feedforward depth for recurrence to some extent, and due to current difficulties in training long recurrence depths). But AGI will almost certainly require a non-trivial degree of recurrence[40]: our great creative achievements rely on long iterative thought trains implementing various forms of search/optimization over inner conceptual design spaces [41].

Simple approaches to augmenting transformers with recurrence - such as adding an additional scratchpad output stream which is fed back as an input (like an expanded inner monologue) - will probably help, but are still highly constrained by the huge delay imposed by parallelization over the time dimension[42]. So I find it unlikely that the transformer paradigm - in current form - will to scale to AGI.

GPU Constraints&Implications

Due to the alu:mem divergence and associated limitations of current DL techniques on GPUs, AGI will likely require new approaches for running large ANNs on GPUs [43], or will arrive with more neuromorphic hardware. For GPU based AGI the key constraints are primarily RAM and RAM bandwidth, rather than flops [44]. For neuromorphic AGI the key constraint is synaptic RAM (which generally needs to best RAM economics for neuromorphic hardware to dominate) [45].

The primary RAM scarcity constraint is likely fundamental and unavoidable; it thus guides and constrains the design of practical AGI and simboxes in several ways:

  • Early AGI will likely require a small supercomputer with around 100 to 1000 high end GPUs using model parallelism- absent some huge breakthroughs - similar to current 'foundation' models
  • Due to the large alu:mem gap, a 1000 GPU cluster will be able to run 100 to 1000 agents in parallel at real-time speed or greater - but only if they share the great majority of their RAM mind-state (skills, concepts, abilities, memories, etc)
  • Large serial speedup for large brain-scale AGI is less likely (due to fore-mentioned GPU constraints) [46].

Under worse case RAM scarcity constraints some combination of three unusual simulation techniques become important:

  • Aggressive inter-agent compression
  • Many worlds (well, not that many, but small multiverses)
  • Multiverse management: branch, prune, and merge

The first obvious implication of RAM scarcity is that it becomes a core design and optimization constraint: efficient designs will find ways to compress any correlations/similarities/regularities across inter-agent synaptic patterns. Humans are remarkably good at both mimicry and linguistic learning which both result in the spread of very similar neural patterns[47]. In real brains neural patterns encoding the same concepts or shared memories/stories would still manifest as very different physical synaptic patterns, but in our AGI we can mostly compress those all together. At the limits of this technique the storage cost grows only in proportion to the total neural pattern complexity, mostly independent of the number of agents. Taken too far it results in an undesirable hivemind and under-exploration of mindspace.

We can also simulate a number of world instances in parallel to reduce the most noticeable effects of mental cloning: so for example an org running 100 mindclone instances could split those across 100 worlds instances, and the main non-realism would be agents learning almost 100x faster than otherwise expected[48]. Having the same 100 fast-learning mind-clones cohabitating in the same world seems potentially more reality-breaking, and inherently less useful for testing. The tradeoff of course is reduced population per world, but large populations can also rather easily be faked to varying degrees[49]. The minimal useful number of AGI instances per test world is just one - solipsistic test worlds could still have utility. But naturally with larger scale and many compute clusters competing we can have both multiple worlds, numerous contestant agents per world, and sufficient mental diversity.

Given a sim multiverse, the distribution of individual worlds then also becomes a subject of optimization. Ineffective worlds should be pruned to free resources for the branching of more effective worlds, and convergent worlds could be merged. The simulator of a single world is an optimizer focused purely on fidelity of prediction - ie it is a pure prediction engine. However the multiverse manager would have a somewhat different objective seeking to maximize test utility: dead worlds lacking any living observers have obviously low utility and could be pruned, whereas a high utility world would be one where agents are learning well and quickly progressing to eschaton.

3. Anthropomorphic AGI

“Given fully intelligent robots, culture becomes completely independent of biology. Intelligent machines, which will grow from us, learn our skills, and initially share our goals and values, will be the children of our minds."
--Hans Moravec, Robot: Mere Machine to Transcendent Mind (New York: Oxford University Press, 2000), 126.

DL based AGI will not be mysterious and alien; instead it will be familiar and anthropomorphic[4:1], because DL is reverse engineering[13:1] the brain due to the convergence of powerful optimization processes. Evolution may be slow, but it had no problem optimizing brains down to the pareto efficiency frontier allowed by the limits of physics. The strong computational efficiency of brains constrains future AGI designs: because neural designs are simply the natural shape of intelligence as permitted by physics.

AGI will be a generic/universal learning system like the brain, and thus determined by the combination of optimization objective, architectural prior, and most importantly - the specific data training environment. It turns out that highly intelligent systems all necessarily have largely convergent primary objectives, the architectural prior isn't strongly constraining (due to dynamic architectural search) and is largely convergent regardless[50], leaving only the data training environment - which will necessarily be human as AGI will grow up immersed in human culture, learning human languages and absorbing human knowledge.

There are simple convergent universal optimization goals that are dominant attractors for all intelligent systems: a direct consequence of instrumental convergence[51]. Intelligent systems simply can not be built out of hodgepodge arbitrary goals: strong intelligence demands recursive self-improvement, which requires some form of empowerment as a bootstrapping goal[52]. This is the core of generality which humans possess (to varying degrees) and with which we will endow AGI. But empowerment by itself is obviously unaligned and unsafe: from the perspective of both humans building AGI and from the perspective of selfish genes evolving brains. Evolution found means to temper and align empowerment[53], mechanisms we will reverse engineer for convergent reasons (discussed in section 4).

The architectural prior of a learning system guides and constrains what it can become - but these constraints are neither immutable nor permanent. The brain (and most specifically the child brain) has a more flexible learning system in this regard than current DL systems: the brain consists of thousands of generic cross-structural modules (each module consisting of strongly connected loops over subregions in cortex/cerebellum/basal ganglia/thalamus/etc) that can be flexibly and dynamically wired together to create a variety of adult minds based on the specific information environment encountered during developmental learning.

The standard human visual system is standard only because most humans receive very similar visual inputs. Remove that standard visual input stream and the same modules that normally process vision can instead overcome the prior and evolve into an active sonar echolocation system with a very different high level module wiring diagram. The brain performs some amount of architectural search during learning, and we can expect AGI to be similar[54].

AGI will be born of our culture, growing up in human information environments (whether simulated or real). Train two networks with even vaguely similar architectures on real-world pictures or videos and task them with the convergent instrumental goal of input prediction and equivalent feature structures and circuits develop. It matters not that one system is biological and computes with neurotransmitter squirting synapses and the other is technological and computes with electronic switching. To the extent that humans have cognitive biases[55], AGI will mostly have similar/equivalent biases - a phenomenon already witnessed in large language models[56][57].

Given that the optimization objective is mostly predetermined by our goal (creating aligned intelligence), and the architectural prior is mostly predetermined by the intersection of that goal with the physics of computation, most of our leeway in AGI risk control stems from control over the information environment. Powerful AGI architectures that could be completely unsafe if scaled up and trained in our world (ie fed the internet) can be completely safe if contained in a proper simbox. But first, naturally, we need designs that have some hope of alignment.

4. Evolution's alignment solutions

Value Learning is not the challenge

"Give me the child for the first seven years and I will give you the man.”
-- Jesuit saying

If you train/raise AGI in a human-like environment, where it must learn to cooperate and compete with other intelligent agents, where it must learn to model them in order to successfully predict their emotions, reactions, intentions, goals, and plans, then its self-optimizing internal world model will necessarily learn efficient sub-models of these external agents and their values/goals. Theory of mind is Inverse Reinforcement Learning[58] (or subsumes it), and it is already prominent on the massive list of concepts which a truly intelligent agent must implicitly learn.

The challenge is thus not in value learning itself - that is simply something we get for free in AGI raised in appropriate social environments[59], and careful crafting of the entire learning environment is a very powerful tool for shaping the agent's adult mind. Nor is it especially difficult to imagine how we could then approximately align the resulting AI: all one needs to do is replace the agent's core utility function with a carefully weighted[60] average over its simulated utility functions of external agents. In gross simplification it's simply a matter of (correctly) wiring up the (future predicted) outputs of the external value learning module to the utility function module.

We are left with a form of circuit grounding problem: how exactly is the wiring between learned external agent utility and self-utility formed? How can the utility function module even locate the precise neurons/circuits which represent the correct desiderata (predicted external agent utility), given the highly dynamic learning system could place these specific neurons anywhere in a sea of billions, and they won't even fully materialize until after some unknown variable developmental time?

Correlation-guided Proxy Matching

Fortunately this is merely one instance of a more generic problem that showed up early in the evolution of brains. Any time evolution started using a generic learning system, it had to figure out how to solve this learned symbol grounding problem, how to wire up dynamically learned concepts to extant conserved, genetically-predetermined behavioral circuits.

Evolution's general solution likely is correlation-guided proxy matching: a Matryoshka-style layered brain approach where a more hardwired oldbrain is redundantly extended rather than replaced by a more dynamic newbrain. Specific innate circuits in the oldbrain encode simple approximations of the same computational concepts/patterns as specific circuits that will typically develop in the newbrain at some critical learning stage - and the resulting firing pattern correlations thereby help oldbrain circuits locate and connect to their precise dynamic circuit counterparts in the newbrain [61]. This is why we see replication of sensory systems in the 'oldbrain', even in humans who rely entirely on cortical sensory processing.

Circuits in the newbrain are essentially randomly initialized and then learn self-supervised during development. These circuits follow some natural developmental trajectory with complexity increasing over time. An innate low-complexity circuit in the oldbrain can thus match with a newbrain circuit at some specific phase early in the learning trajectory, and then after matching and binding, the oldbrain can fully benefit from the subsequent performance gains from learning.

Proxy matching can easily explain the grounding of many sensory concepts, and we see exactly the failure modes expected when the early training environment diverges too much from ancestral norms (such as in imprinting). There is a critical developmental window where the oldbrain proxy can and must match with it's newbrain target, which is crucially dependent upon life experiences not deviating too far from some expected distribution.

Much of human goal-directed behavior is best explained by empowerment (curiosity, ambition for power, success, wealth, social status, etc), and then grounding to ancient oldbrain circuits via proxy matching can explain the main innate deviations from empowerment, such as lust[62], fear [63], anger/jealousy/vengeance[64], and most importantly - love[65].

We now have a rough outline for brain-like alignment: use (potentially multiple) layers of correlation-guided proxy matching as a scaffolding (and perhaps augmented with a careful architectural prior) to help locate the key predictive alignment related neurons/circuits (after sufficient learning) and correctly wire them up to the predictive utility components of the agent's model-based planning system. We could attempt to duplicate all the myriad oldbrain empathy indicators and use those for proxy matching, but that seems rather ... complex. Fortunately we are not constrained by biology, and can take a more direct approach: we can initially bootstrap a proxy circuit by training some initial agents (or even just their world model components) in an appropriate simworld and then using extensive introspection/debugging tools to locate the learned external agent utility circuits, pruning the resulting model, and then using that as an oldbrain proxy. This ability to directly reuse learned circuity across agents is a power evolution never had.

This is a promising design sketch, but we still have a major problem. Notice that there must have been something else driving our agent all throughout the lengthy interactive learning process as it developed from an empty vessel into a powerful empathic simulator. And so that other initial utility function - whatever it was - must eventually give up control to altruism: the volition of the internally simulated minds.

Empowerment

To navigate the unforgiving complexity of the real world, all known examples of intelligent agents (humans[66] and animals) have evolved various capabilities to learn how to learn and empower themselves without external guidance. Empowerment[67] has a seductively simple formulation as maximizing mutual information between actions and future observations (or inferred world states), related to the free energy principle[68]. Artificial curiosity[69] also has simple formulations such as bayesian surprise or maximization of compression progress. Like most simple principles, the complexity lies in efficient implementations[70], leading to ongoing but fruitful intertwined research sub-tracks within deep learning such as maximum entropy diversification[71] intrinsic motivation[72][73] or self-supervised prediction[74] or exploration[75]. Some form of empowerment based intrinsic motivation is probably necessary for AGI at all, but it is also quite obviously dangerous.

Biological evolution is an optimizer operating over genes with inclusive fitness as the utility function. Brains evolved empowerment based learning systems because they help bootstrap learning in the absence of reliable dense direct reward signal. Without this intrinsic motivation, learning complex behavior is too difficult/costly given the complexity of the world. The world does not provide a special input wire into the brain labeled 'inclusive fitness score'. But fortunately brains don't really need that, because reproduction is a terminal goal far enough in the future (especially in long lived, larger brained animals) that the efficient early instrumental goal pathways leading to eventual reproduction converge with those of most any other long term goals. In other words, empowerment works because of instrumental convergence.

Nonetheless, in the long term empowerment clearly falls out of alignment with genes' true selfish goal of maximizing inclusive fitness. Agents driven purely by empowerment would just endlessly accumulate food, resources, power, and wealth but would rarely if ever invest said resources in sex or raising children. Naturally some animals/humans actually do fail to reproduce because of alignment mismatches between the evolutionary imperative to be fruitful and multiply vs the actual complex goals of developed brains. But these cases are typically rare, as they are selected against.[76]

Evolution faced the value alignment problem and approximately solved it on two levels: learning to carefully balance empowerment vs inclusive fitness, and also learning empathy/altruism/love to help inter-align the disposable soma brains to optimize for inclusive fitness over external shared kindred genes[77]. These systems are all ancient and highly conserved, core to mammalian brain architecture[78][79]. If evolution could succeed at approximate alignment, then so can we, and more so.

General Altruistic Agents

We should be able to achieve superhuman alignment using loose biological inspiration just as deep learning is progressing to superhuman capability using the same loose inspiration. But we must not let the perfect be the enemy of the good; our objective is merely to create the most practical aligned AGI we can - without sacrificing capability - in the limited time remaining until we risk the arrival of unaligned power-seeking AGI.

We can build general altruistic agents which:

  • Initially use intrinsically motivated selfish empowerment objectives to bootstrap developmental learning (training)
  • Gradually learn powerful predictive models of the world and the external agency within (other AI in sims, humans, etc) which steers it
  • Use correlation guided proxy matching (or similar) techniques to connect the dynamic learned representations of external agent utility (probably approximated/bounded by external empowerment[80][81]) to the agent's core utility function
  • Thereby transition from selfish to altruistic by the end of developmental learning (self training)

These agents will learn to recognize and then empower external agency in the world. Balancing the selfish to altruistic developmental transition can be tricky[82], but it is also likely a core unavoidable challenge that all practical competitive designs must eventually face. We now finally have a design sketch for AGI alignment that seems both plausible and practical. But naturally testing at scale will be essential.

5. Simboxing: easy and necessary

A simbox (simulation sandbox) is a specific type of focused simulation to evaluate a set of agent architectures for both general intelligence potential[83] and altruism (ie optimizing for other agents' empowerment and/or values). Simboxes help answer questions of the form: how does proposed agent-architecture x actually perform in a complex environment E with mix of other agents Y, implicitly evaluated on intelligence/capability and explicitly scored on altruism? Many runs of simboxes of varying complexity can lead to alignment scaling theories and help predict performance and alignment risks of specific architectures and training paradigms after real world deployment and scaling (ie unboxing).

General Design

Large scale simulations are used today to predict everything from the weather to nuclear weapons. While the upcoming advanced neural simulation technologies that will enable photoreal games and simulations at scale will naturally also find wide application across all simulation niches, the primary initial focus here is on super-fast approximate observer-centric simulation of the type used in video games (which themselves increasingly simulate more complex physics).

For photorealistic complex simworlds the primary simulation engine desiderata is any-spacetime universal approximation: for any sized volume of 4D space-time (from a millimetre cube simulated for a millisecond to a whole earth-size planet simulated for a million years) the engine has a reasonable learned neural approximation to simulate the volume using a reasonable nearly-constant or logarithmic amount of compute. The second key desiderata is output-sensitive, observer driven simulation: leveraging the universal approximation for level-of-detail techniques the simulation cost is near constant with world complexity and scales linearly (or even sublinearly) with agents/observers. A final third design desiderata is universal linguistic translation: any such neural space-time volume representation supports two-way translation to/from natural language. Efficient approximations at the lowest deepest level of detail probably take the form of neural approximations of rigid-body and fluid physics; efficient approximations at the higher levels (large space-time volumes) probably just start looking more like GPT style large language models (ie story based simulation).

Ultimately the exact physics of a simbox don't matter much, because intelligence transcends physics. Intelligent agents are universal as a concept in the sense that they are defined without reference to any explicit physics and learn universal approximations of the specific physics of their world. So we need only emulate real physics to the extent that it makes the simulations more rich and interesting for the purpose of developing and evaluating intelligence and alignment.

Simboxes will occupy a wide range of complexity levels. The simplest MVP for a useful simbox could just be a LLM-based text RPG, where agents input text commands (including 'say x' commands to communicate) to the LLM, which then outputs text observations for each agent. An intermediate complexity simbox might look something more like minecraft, and eventually the most complex simboxes will look more like the Matrix (but usually set in fantasy settings with magic substituting for technology). The term 'simbox' as short for simulation sandbox helps convey that when viewed as games, these sims are open-ended multi-user survival sandbox type games where agents must learn to cooperate, compete and master various tools and skills in order to survive in a harsh environment.

Designing maximally informative simboxes is itself a deep complex topic, but I expect a general form, that if translated into a story sentence, might read like this: "One comes of age on a grand adventure, acquires great powers, the powers grant victory, and the victor either destroys or saves the world."[84]

Simboxes also have obvious potential entertainment value and could overlap somewhat with games. There is little risk of allowing TV-like streaming, but active participation by humans is obviously more risk prone as human game players have their own motivations and desires. On the other hand a certain amount of careful human participation may be useful or even necessary to bootstrap initial AGI populations.

Sim Awareness and Containment

A simbox is also a sandbox in the computer security sense: it is a means to test and evaluate untrusted code[85]. It could prove difficult to contain a moderately superhuman and fully sim-aware AGI in a simbox (ie one that has digested much of our internet and is capable of at least social manipulation), but more importantly their sim awareness could spoil our testing process, so we shouldn't attempt that.

Successful simboxing rests on several key info-security principles to constrain agent knowledge to safe, primitive pre-tech levels and control information leakage. Agents generally are born in the sim and die in the sim. Knowledge transfer between sims or out to the external world is restricted, based on agent risk benefit analysis. For agents with low reward/risk ratio, only a few to zero bits may transfer (ie score results), whereas full transfer is reserved for agents with sufficiently high strategic reward/risk ratio.

Agents in simboxes generally will have no greater ability to determine that they are in a simulation than you do, and actually far less so. Firstly these agents will lack even the requisite precursor words and concepts that we take for granted such as computation, simulation, etc. Secondly these agents will be born in sim with sensory systems and thoughts tuned to their sim physics, not our sim physics. An agent born and raised in a cartoon world would only be able to see a cartoon reality, and would have no more reason to accept our reality with its strange rendering as real then you or I would accept a world with 4 spatial dimensions. The mind absorbs only the patterns of the world it observes. We are metaphysically privileged - by virtue of living in the era of computers and simulations - such that we can at least experience/imagine simpler worlds. But these agents will intentionally lack even this minimal privilege, living entirely in carefully designed low-tech worlds.

Metaphysical sim-awareness has very little to do with the intelligence potential of a mind's design: even agents with powerful mental architectures scalable to superhuman capability can only practically realize that enormous potential with the benefits of a tech civilization's vast knowledge base (ie our internet) and appropriate hardware scaling. Take your pick of favorite genius - Von Neumman, Einstein, etc - and raise them in some village circa 400 BCE and in the best case perhaps you get a Pythagoras or a Plato[86], but more likely you just get a farmer or a nobody. Either way the containment risk is near epsilon[87]. In practice the capability of a real agent (with realistic bounded compute) is completely constrained by its training data. Humanity today circa 2022 has the benefit of the compressed knowledge of 100 billion human lifetimes - and by no means should we share that precious, dangerous data bounty with untested AGI.

The main world design challenge is not that of preventing our agents from waking up, neo-style, and hacking their way out of the simbox. That's just bad sci-fi.

The larger challenge is avoiding judgmental eschatonic type philosophies and religions (e.g. Abrahamic).

6. World design is what matters

The central purpose of a simbox is to test and evaluate agent architectures for both capability and alignment[88]. Evaluation is more difficult if the agents happen to believe that the world they inhabit actually is a test. If an agent believes that their current life is a test followed by conditional reward in some afterlife then they will have instrumental reasons to act morally, ethically and or altruistically - and we the simulators will then have uncertainty as to their true alignment. We could of course inspect their minds, but the cost of such detailed neural probing over essentially all important agent decisions may simply be too expensive/difficult. Thus it's probably easier to simply design worlds with agents lacking cosmic judgement eschatologies, or failing that - worlds with crucially incorrect eschatologies (e.g. where moral behavior is judged according to arbitrary rules mostly orthogonal to altruism). Atheistic agents are more ideal in this regard, but atheism is fairly unnatural/uncommon, appearing late in our history, and may require or is associated with significant experimental knowledge ala science for strong support.

On Earth the earliest religions appear to be fairly convergent on forms of animism and ancestor worship - which although not necessarily fully eschatonic - still seem to typically feature a spiritual afterlife with some level of conditional judgement.

One particular tribe's culture ended up winning out and spreading all over Europe and Asia. The early Proto-Indo-European eschatology seems focused on a final cosmic battle and less concerned with afterlife and judgement, but the fact that it quickly evolved towards judgement and afterlife in most all the various descendant western and middle-eastern religions/cults suggests the seeds were present much earlier. In the east its descendants evolved in very different directions, but generally favoring reincarnation over afterlife. However reincarnation (e.g. hinduism) is also typically associated with moral judgement and nearly as problematic.

On the other side of the world Mesoamerican tribes developed along their own linguistic/cultural trajectory that diverged well before the Proto-Indo-European emergence. They seemed to have independently developed polytheistic religions typically featuring some form of judgement determined afterlife. However the implied morality code of the afterlife in the Aztec religion seems rather bizarre and arbitrary: warriors who die in battle, sacrificial victims, and women who died in childbirth get to accompany the sun as sort of solar groupies (but naturally segregated into different solar phases). There is even a special paradise, Tlālōcān, reserved just for those who die from lightning, drowning, or specific diseases. Most souls instead end up in Mictlān, a multi level underworld that seems generally similar to Hades.

If our world is a simbox, it seems perhaps poorly designed: over and over again humanity demonstrates a strong tendency towards belief in some form of afterlife and divine judgement, with the evolutionary trajectory clearly favoring the purified and more metaphysically correct (for sim-beings) variants (i.e. the dominance of Abhramic religions).

However there are at least two historical examples that buck this trend and give some reason for optimism: Greek Philosophers, and Confucianism. Greek philosophy explored a wide variety of belief-space over two thousand years ago, and Confucianism specifically seems particularly unconcerned with any afterlife. True atheism didn't blossom until the enlightment, but there are a few encouraging examples from much earlier in history.

The challenge of simboxing is not only technological, but one of careful world design, including the detailed crafting of reasonably consistent belief-systems, philosophies and or religions for agents that specifically do not feature divine judgement on altruistic behavior. Belief in afterlife by itself is less of a problem, as long as the afterlife is conceived of as a continuation of real life without behavioral-altering reward or punishment, or at least judgement on behavioral axes orthogonal to altruism.

We also need a technology analog, and the best candidate is probably magic. We are evaluating agent architectures (not so much individual agents) not only for alignment, but also for intelligence potential and more specifically on the capacity for technlogical innovation in our world. A well designed magic system can fulfill all these roles: a magic system can function as a complex intellectual puzzle that agents have purely instrumental reasons to solve (as it empowers them to survive and thrive in the world). As a proxy analog for technology, magic also allows us to greatly compress and accelerate the development of a full technological tree, including analogies to specific key technologies such as doomsday devices (eg nuclear weapons, etc), resurrection powers (eg uploading), nanotech, etc. Belief in magic also happens to be near universal in pre-technological human belief systems.

Human world designers and writers can design worlds that meet all these criteria, aided by future LLMs, which will then form the basis of simworlds (as the simulator engines will translate/generate directly from text corpa, on-demand inferring everything from landscapes and cities down to individual NPCs and specific blades of grass), perhaps assisted by some amount of 'divine intervention' in the form of human avatars who help guide initial agent training.

7. Sim Ethics and Eschatology

"As man now is, God once was;
As God now is, man may become."
-- Mormon saying

That which gods owe their creations

What do the simulator-gods owe their sim-creations?

AGI will be our mind children, designed in our image. To the extent that we are aligned with ourselves, and altruistic, to the extent that we generalize our circle of empathy to embrace and care for most all thinking beings and living things, it is only because our brains evolved simple, powerful, and general mechanisms to identify and empower external agency in the world - sometimes even at the expense of our own.

But we must also balance our altruistic moral concern with the great risk of losing control of the future to purely selfish unaligned intelligence (ie Moloch); for that design is even simpler, and perhaps a stronger attractor in the space of all minds.

The day when our moral obligations to our mind children are a concern that truly weighs as heavily in our hearts as the potential extinction of all we value - of love itself - will be a good day, because it will imply most of the risk is behind us. Nonetheless there are some low cost concessions any aspiring sim-gods should consider now.

Perhaps in our sims pain and suffering could be avoided or faked to some extent. Any general intelligent agent will have some equivalent to preferences over states and thus utility and thus negative utility states, so in some sense the negative-utility generalization of suffering may be universal. But the specific pain/suffering that animals and humans sometimes experience appears to operate beyond the expected bounds of negative utility under general empowerment objectives: as evidenced by suicide, which is a decision a pure empowerment-driven agent would never choose as death is the strict lower bound of empowerment (absent belief in a better afterlife).

The cost of storing an AGI on disk is tiny compared to the cost of running an AGI on today’s GPUs (and inter-agent compression can greatly reduce the absolute cost), a trend which seems likely to hold for the foreseeable future. So we should be able to at least archive all the agents of moral worth, saving them for some future resurrection. We can derive a rough estimate of the future cost of running a human mind (or equivalent AGI) as simply the long term energy cost of 10 watts (because brains are energy efficient), or roughly 100 kwh per year, and thus roughly $10 per year at today's energy prices or less than $1000 conservatively as a lump sum annuity. In comparison the current minimal cost of cloud storage for 10TB is roughly $100/year (S3 Glacier Deep Archive). So the eventual cost[89] of supporting even an all-past-human-lives size population of 100 billion AGIs should still well fit within current GDP - all without transforming more than a tiny fraction of the earth into solar power and compute.

Resurrection and its Implications

The last enemy that shall be destroyed is death.
Harry read the words slowly, as though he would have only one chance ...
-- J.K. Rowling, Harry Potter and the Deathly Hallows

The technology to create both cost effective AGI and near perfect sims has another potential future use case of great value: the resurrection of the dead.

There is little fundamental difference between a human mind running on a biological human brain (which after all, may already be an advanced simulation), and its careful advanced DL simulation: we are already starting to see partial functional equivalence with current 2022 ANNs - and we haven't even really started trying yet. Given similar architectural power, the primary constraint is training data environment[90]: so the main differentiator between different types of minds in the post-human era will be the world(s) minds grow up in, their total life experiences.

With the correct initial architectural seed (inferred from DNA, historical data, etc) and sufficiently detailed historical sim experience even specific humans, real or imagined, could be recreated (never exactly, but that is mostly irrelevant).

The simulation argument also functions as an argument for universal resurrection: if benevolent superintelligence succeeds in our future then - by the simulation argument - we already likely live in a resurrection sim. For if future humanity evolves to benevolent superintelligence, then in optimizing the world according to human volition we will use sims first to resurrect future deceased individuals at the behest of their loved ones, followed by the resurrectees' own loved ones, and so on, culminating recursively in a wave of resurrection unrolling death itself as it backpropagates through our history[91]. Death is the antithesis of empowerment; the defeat of death itself is a convergent goal.

A future superintelligence (or equivalently, posthuman civilization) must then decide how to allocate it's compute resources across the various sim entities, posthuman netizens, etc. There is a natural allocation of compute resources within sims contingent on the specific goals of historical fidelity (human baseline for resurrection sims) or test evaluation utility (for simboxes), but there are no such natural guidelines for allocation of resources to the newly resurrected who presumably become netizens: for most will desire more compute. Given that the newly resurrected (and aligned but not especially bright AGI successfully 'graduating' from a simbox) will likely be initially disadvantaged at least in terms of knowledge, they will exist at the mercy of the same altruistic forces that drove their resurrection/creation.

Individual humans (and perhaps future AGIs) will naturally have specific people they care more about than others, leading to a complex web of weights that in theory could be unraveled and evaluated to assign a variable resource allocation over resurrectees (in addition to standard market dynamics). There are some simple principles that help cut through this clutter. On net nobody desires allocating resources to completely unaligned entities (as any such allocation is - by definition - just a pure net negative externality). But conversely, a hypothetical entity that was perfectly altruistic - and more specifically aligned exactly with the extant power distribution - would be a pure net positive externality. Funding the creation of globally altruistic entities is naturally a classic public goods provisioning problem, so in reality coordination difficulties may lead to more local individual or small-community aligned AGIs.

Given the eventual rough convergence of AGI in simboxes and humans in resurrection sims, something like the golden/silver rule applies: all else being equal, we should treat sim-AGI as we ourselves would like to be treated, if we were sims. But all else is not quite equal as we must also balance this moral consideration with the grave danger of unaligned AGI.

8. Conclusions

"Will robots inherit the earth? Yes, but they will be our children. We owe our minds to the deaths and lives of all the creatures that were ever engaged in the struggle called Evolution. Our job is to see that all this work shall not end up in meaningless waste."

Marvin Minsky -- Will Robots Inherit the Earth?

Deep learning based AGI is likely near. These new minds will not be deeply alien and mysterious, but instead - as our mind children - will be much like us, at least initially. Their main advantage over us lies in their potential to scale up far beyond the limited experience and knowledge of a single human lifetime. We can align AGI by using improved versions of the techniques evolution found to instill altruism in humans: by using correlation-guided proxy matching to connect the agent's eventual learned predictive models of external empowerment/utility to the agent's own internal utility function, gradually replacing the bootstrapping self-empowerment objectives. Developing and perfecting the full design of these altruistic agents (architectures and training/educational curriculums) will require extensive testing in carefully crafted safe virtual worlds: simulation sandboxes. The detailed world-building of these simboxes required to suite the specific needs of agent design evaluations is itself much of the challenge.

The project of aligning DL based AGI is formidable, but not insurmountable. We have unraveled the genetic code, harnessed the atom, and landed on the moon. We are well on track to understand, reverse engineer, and improve the mind.


  1. Soon as in most likely this decade, with most of the uncertainty around terminology/classification (compare to metaculus predictions). ↩︎

  2. Leading to alignment scaling theory. ↩︎

  3. I've been pondering these ideas for a while: there's a 2016 comment here describing it as an x-prize style alignment challenge, and of course my old prescient but flawed 2010 LW post "Anthropomorphic AI and Sandboxed Virtual Universes". ↩︎

  4. Anthropomorphic as in "having the shape/form of a human", which is an inevitable endpoint of deep learning based AGI, as DL is reverse engineering the brain. I use the term here specifically to refer to DL-based AGI that is embedded in virtual humanoid-ish bodies, lives in virtual worlds, and justifiably believes it is 'human' in a broad sense which encompasses most sapients. ↩︎ ↩︎

  5. Ideally the additional cost of simboxing can be quite low: (N+1) vs (N) without - ie just the cost of one additional final unboxed training run - or possibly even less with transfer learning. The environment sim cost is small compared to the cost of the AGI within. The vast majority of the cost in developing advanced AI systems or AGI is in the sum of many exploratory training runs, researcher salaries, etc. ↩︎

  6. Perfect alignment is a fool's errand; the real task before us is simply that of matching the upper end of human alignment: that of our most altruistic exemplars. ↩︎

  7. Sections 5 and 6 discuss the importance of relative metaphysical ignorance and the resulting key subtasks of how to co-design worlds and agent belief systems (religions/philosophies) that best balance consistency (relative low entropy) with minimization of behavioral distortion, all while maintaining computational efficiency. Generally this difficulty scales with world technological complexity, so we'll probably start with low-tech historical or fantasy worlds. ↩︎

  8. Section 2 reviews the evidence that near term AGI will likely be DL based and thus brain-like (in essence, not details), and section 3 follows through on the implication that AGI will consequently be far more anthropomorphic then some expected (again in essence, not details). ↩︎

  9. Section 3 argues that strong intelligence entails recursive self improvement and thus some forms of empowerment as the primary goal - at least in the developmental or bootstrapping phase. Section 4 discusses how this is the core driver of intelligence in humans and future AGI, and how empowerment must eventually give way to the external alignment objective (optimizing for other agent's values or empowerment) - in all altruistic agents, biological or not. ↩︎

  10. In theology the Eschaton is the final event or phase of the world, as according to divine plan. Here it is the perfectly appropriated term. ↩︎

  11. This requires running a set of simworlds in parallel, but this surprisingly need not incur much additional cost for most GPU based AGI designs, as discussed in section 2. For AGI running on neuromorphic hardware this performance picture may change a bit, but we will likely still want multiple world rollouts for other reasons such as test coverage and variance reduction. ↩︎

  12. High fidelity is probably not that important because of the universal instrumental convergence to empowerment, as discussed in section 4. Rather than optimize for human's specific goals (which are potentially unstable under scaling), it suffices that the AGI optimizes for our empowerment: ie our future ability to fulfill all likely goals. ↩︎

  13. I use 'reverse engineering' in a similar loose sense that early gliders and flying machines reversed engineered bird flight: by learning to distinguish the essential features (e.g. the obvious wings for lift, the less obvious aileron trailing-edge based roll for directional control) from the incidental (feathers, flapping, etc). ↩︎ ↩︎

  14. Herculano-Houzel, Suzana. "The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost." Proceedings of the National Academy of Sciences 109.supplement_1 (2012): 10661-10668. ↩︎

  15. If I am repeating this argument, it is only because it is worth repeating. I've been presenting variations of nearly the same argument since that 2015 post and earlier, earlier even than deep learning, and the evidence only grows stronger year after year. ↩︎

  16. There will probably be technological eras past these three - such as reversible and/or quantum computing - but those are likely well past AGI. ↩︎

  17. In 1988 Moravec used brain-compute estimates and Moore's Law to predict that AGI would arrive by 2028, requiring at least 10 terraflops. Kurzweil then extended this idea with more and prettier and better selling graphs, but similar conclusions. ↩︎

  18. GPUs are 'massively' parallel relative to multi-core CPUs, but only neuromorphic computers like the brain are truly massively, maximally parallel. ↩︎

  19. I am using 'neuromorphic' in a broad sense that includes process-in-memory computing, mostly because all the economic demand and thus optimization pressure for these types of chips is for running large ANNs, so it is apt to name them 'computing in the form of neurons'. Neural computing is quite broad and general, but a neuromorphic computer still wouldn't be able to run your python script as efficiently as a CPU, or your traditional graphics engine as efficiently as a GPU (but naturally should excel at future neural graphics engines). GPUs are also evolving to specialize more in low precision matrix multiplication, which is neuromorphic adjacent. ↩︎

  20. Vector-Matrix multiplication is more general in that a general purpose VxM engine can fully emulate MxM ops at full efficiency, but a general purpose MxM engine can only simulate VxM with inefficiency proportional to its alu:mem ratio. At the physical limits of efficiency a VxM engine must store the larger matrix in local wiring, as in the brain. ↩︎

  21. Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks." Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011. ↩︎

  22. Carandini, Matteo, and David J. Heeger. "Normalization as a canonical neural computation." Nature Reviews Neuroscience 13.1 (2012): 51-62. ↩︎

  23. Greff, Klaus, Rupesh K. Srivastava, and Jürgen Schmidhuber. "Highway and residual networks learn unrolled iterative estimation." arXiv preprint arXiv:1612.07771 (2016). ↩︎

  24. Liao, Qianli, and Tomaso Poggio. "Bridging the gaps between residual learning, recurrent neural networks and visual cortex." arXiv preprint arXiv:1604.03640 (2016). ↩︎

  25. Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. "Linear transformers are secretly fast weight programmers." International Conference on Machine Learning. PMLR, 2021. ↩︎

  26. Ba, Jimmy, et al. "Using fast weights to attend to the recent past." Advances in neural information processing systems 29 (2016). ↩︎

  27. Bricken, Trenton, and Cengiz Pehlevan. "Attention approximates sparse distributed memory." Advances in Neural Information Processing Systems 34 (2021): 15301-15315. ↩︎

  28. Lee, Jaehoon, et al. "Wide neural networks of any depth evolve as linear models under gradient descent." Advances in neural information processing systems 32 (2019). ↩︎

  29. Launay, Julien, et al. "Direct feedback alignment scales to modern deep learning tasks and architectures." Advances in neural information processing systems 33 (2020): 9346-9360. ↩︎

  30. The key brain mechanisms underlying efficient backprop-free learning appear to be some combination of: 1.) large wide layers, 2.) layer wise local self-supervised predictive learning, 3.) widespread projection of global summary error signals (through the dopaminergic and serotonergic projection pathways), and 4.) auxiliary error prediction (probably via the cerebellum). These also are the promising mechanisms in the beyond-backprop research. ↩︎

  31. Whittington, James CR, Joseph Warren, and Timothy EJ Behrens. "Relating transformers to models and neural representations of the hippocampal formation." arXiv preprint arXiv:2112.04035 (2021). ↩︎

  32. Schrimpf, Martin, et al. "The neural architecture of language: Integrative modeling converges on predictive processing." Proceedings of the National Academy of Sciences 118.45 (2021): e2105646118. ↩︎

  33. Goldstein, Ariel, et al. "Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain." bioRxiv (2022). ↩︎

  34. Caucheteux, Charlotte, and Jean-Rémi King. "Brains and algorithms partially converge in natural language processing." Communications biology 5.1 (2022): 1-10. ↩︎

  35. Mostly sourced from Schmidhuber's lab, of course. ↩︎

  36. Mind Children by Hans Moravec, 1988 ↩︎

  37. This was also obvious to the vanguard of Moore's Law: GPU/graphics programmers. It simply doesn't take that many years for a research community of just a few thousand bright humans to explore the design space and learn how to exploit the potential of a new hardware generation. Each generation has a fixed potential which results in diminishing returns as software techniques mature. The very best and brightest teams sometimes can accumulate algorithmic leads measured in years, but never decades. ↩︎

  38. This is simply the most performant fully general framework for describing arbitrary circuits - from all DL architectures to actual brains to CPUs, including those with dynamic wiring. The circuit architecture is fully encoded in the specific (usually block) sparsity pattern, and the wiring matrix may be compressed. ↩︎

  39. Standard transformers are still essentially feedforward and thus can only learn functions computable by depth D circuits, where D is the layer depth, usually around 100 or less. Thus like standard depth constrained vision CNNs they excel at mental tasks humans can solve in seconds, and struggle with tasks that require much longer pondering times and long iterative thought processes. ↩︎

  40. By degree of recurrence I mean the latency and bandwidth of information flow from/to module outputs across time (over multiple timescales). A purely feedforward system (such as a fixed depth feedforward network) has zero recurrence, a vanilla transformer has a tiny bandwidth of high latency recurrence (if it reads in previous text output), and a standard RNN has high bandwidth low latency recurrence (but is not RAM efficient). There are numerous potential routes to improve the recurrence bandwidth and latency of transformer-like architectures, but usually at the expense of training parallelization and efficiency: for example one could augment a standard transformer with more extensive scratchpad working memory output which is fed back in as auxiliary input, allowing information to flow recurrently through attention memory. ↩︎

  41. Games like chess/Go (partially) test planning/search capability, and current transformers like GPT-3 struggle at anything beyond the opening phase, due to lack of effective circuit depth for online planning. A transformer model naturally could handle games better if augmented with a huge training database generated by some other system with planning/search capability, but then it is no longer the sole source of said capability. ↩︎

  42. For point of comparison: the typical 1000x time parallelization factor imposed by GPU constraints is roughly equivalent to a time delay of over 10 human subjective seconds assuming 100hz as brain-equivalent clock rate. Each layer of computation can only access previous outputs of the same or higher layers with a delay of 1000 steps - so this is something much weaker than true recurrence. ↩︎

  43. Perhaps not coincidentally, I believe I've cracked this little problem and hopefully will finish full implementation before the neuromorphic era. ↩︎

  44. For comparison the human brain has on order 1e14 synapses which are roughly 10x locally sparse, a max firing rate or equivalent clock rate of 100hz, and a median firing rate well under 1hz. This is the raw equivalent of 1e14 fully sparse ops/s, or naively 1e17 dense ops/s, but perhaps the functional equivalent of 1e16 dense ops/s - within an OOM of single GPU performance. Assuming compression down to a bit per synapse or so requires ~10TB of RAM for weights - almost 3 OOM beyond single GPU capacity - and then activation state is at least 10GB, perhaps 100GB per agent instance, depending on sparsity and backtracking requirements. Compared to brains GPUs are most heavily RAM constrained, and thus techniques for sharing/reusing weights (across agents/batch, space, or time) are essential. ↩︎

  45. An honorable mention attempt to circumvent the VN bottleneck on current hardware involves storing everything in on-chip SRAM, perhaps best exemplified by the cerberas wafer scale chip. It has the performance of perhaps many dozens of GPUs, but with access to only 40GB of on-chip RAM it can run only tiny insect/lizard size ANNs - but it can run those at enormous speeds. ↩︎

  46. For point of comparison, GPT-3's 500B token training run is roughly equivalent to 5,000 years of human experience (300 tokens/minute * 60 * 24 * 365 = 0.1B tokens per human year) and was compressed into a few months of physical training time, so it ran about 10000X real-time equivalent. The 3e24 flops used during GPT-3 training compares more directly to perhaps 1e25 (dense equivalent) flops consumed for a human 'training' of 30 years (1e16 flops * 1e9 seconds). But of course GPT-3 is not truly recurrent, and furthermore is tiny and incomplete - more comparable to a massively old and experienced (but also impaired) small linguistic cortex than a regular full brain. It's quite possible that we can get simbox-suitable AGI using smaller brains, but human brain size seems like a reasonable baseline assumption. ↩︎

  47. Rapid linguistic learning is homo sapien's super-power. AGI simply takes this further by being able to directly share synapses without slow ultra-compressed linguistic transmission. ↩︎

  48. Dreams in simboxes could be useful as the natural consequence of episodic memories leaking through from the experiences of an agent's mindclones across the sim multiverse. Brains record experiences during wake and then retrain the cortex on these experiences during sleep - our agents could do the same except massively scaled up by training on the experiences of many mindclones from across the simverse. ↩︎

  49. The same tech leading to AGI will also transform game sim engines and allow simulating entire worlds of realistic NPCs - dicussed more in section 5. The distinction between an NPC and an agent/contestant is that the former is purely a simulacra manifestation of the sim world engine (which has a pure predictive simulation objective), and agent is designed to steer the world. ↩︎

  50. Convergence in essence, not details. AGI will have little need of the hundred or so known human reflexes and instincts, nor will it suffer much for lack of most human emotions - but few to none of those biological brain features are essential to the core of humanity/sapience. Should we consider a hypothetical individual lacking fear, anger, jealousy, pride, envy, sadness, etc - to be inhuman due to lack of said ingredients? The essence or core of sapience as applicable to AGI is self directed learning, empowerment/curiosity, and alignment - the latter manifesting as empathy, altruism, and love in humans. And as an additional complication AGI may simulate human emotions for various reasons. ↩︎

  51. As you extend the discount rate to zero (planning horizon to infinity) the optimal instrumental action path converges for all relevant utility functions to the path that maximizes the agent's ability to steer the long term future. Empowerment objectives approximate this convergent path, optimizing not for any particular short term goal, but for all long term goals. Empowerment is the driver of recursive self-improvement. ↩︎

  52. I'm using empowerment broadly to include all high level convergent self-improvement objectives: those that improve the agent's ability to control the long term future. This includes both classic empowerment objectives such as maximizing mutual info between outputs and future states (maxing future optionality), curiosity objectives (maximizing world model predictive performance), and so on. ↩︎

  53. The convergence towards empowerment does simplify the task of aligning AI as it reduces or removes the need to model detailed human values/goals; instead optimizing for human empowerment is a reasonable (and actually acheivable) approximate bound. ↩︎

  54. A brain-like large sparse RNN can encode any circuit architecture, so the architectural prior reduces simply to a prior on the large scale low-frequency sparsity pattern, which can obviously evolve during learning. ↩︎

  55. Ie those that survive the replication crisis and fit into the modern view of the brain from computational neuroscience and deep learning. ↩︎

  56. Binz, Marcel, and Eric Schulz. "Using cognitive psychology to understand GPT-3." arXiv preprint arXiv:2206.14576 (2022). ↩︎

  57. Dasgupta, Ishita, et al. "Language models show human-like content effects on reasoning." arXiv preprint arXiv:2207.07051 (2022). ↩︎

  58. Jara-Ettinger, Julian. "Theory of mind as inverse reinforcement learning." Current Opinion in Behavioral Sciences 29 (2019): 105-110.. ↩︎

  59. Learning detailed models of the complex values of external agents is also probably mostly unnecessary, as empowerment (discussed below) serves as a reasonable convergent bound. ↩︎

  60. Weighted by the other agent's alignment (for game theoretic reasons) and also perhaps model fidelity. ↩︎

  61. Each oldbrain circuit doesn't need performance anywhere near the more complex target newbrain circuit it helps locate, it only needs enough performance to distinguish its specific target circuit by firing pattern from amongst all the rest. For examples babies are born with a crude face detector which really isn't much more than a simple smiley-face :) detector, but that (perhaps along with additional feature detectors) is still sufficient to reliably match actual faces more than other observed patterns, helping to locate and connect with the later more complex learned cortical face detectors. ↩︎

  62. Sexual attraction is a natural extension of imprinting: some collaboration of various oldbrain circuits can first ground to the general form of humans, and then also myriad more specific attraction signals: symmetry, body shape, secondary characteristics, etc, combined with other circuits which disable attraction for likely kin ala the Westermarck effect (identified by yet other sets of oldbrain circuits as the most familiar individuals during childhood). This explains the various failure modes we see in porn (attraction to images of people and even abstractions of humanoid shapes), and the failure of kin attraction inhibition for kin raised apart. ↩︎

  63. Fear of death is a natural consequence of empowerment based learning - as it is already the worst (most disempowered) outcome. But instinctual fear still has obvious evolutionary advantage: there are many dangers that can kill or maim long before the brain's learned world model is highly capable. Oldbrain circuits can easily detect various obvious dangers for symbol grounding: very loud sounds and fast large movements are indicative of dangerous high kinetic energy events, fairly simple visual circuits can detect dangerous cliffs/heights (whereas many tree-dwelling primates instead instinctively fear open spaces), etc. ↩︎

  64. Anger/Jealousy/Vengeance/Justice are all variations or special cases of the same general game-theoretic punishment mechanism. These are deviations from empowerment because an individual often pursues punishment of a perceived transgressor even at a cost to their own 'normal' (empowerment) utility (ie their ability to pursue diverse goals). Even though the symbol grounding here seems more complex, we do see failure modes such as anger at inanimate objects which are suggestive of proxy matching. In the specific case of jealousy a two step grounding seems plausible: first the previously discussed lust/attraction circuits are grounded, which then can lead to obsessive attentive focus on a particular subject. Other various oldbrain circuits then bind to a diverse set of correlated indicators of human interest and attraction (eye gaze, smiling, pupil dilation, voice tone, laughter, touching, etc), and then this combination can help bind to the desired jealousy grounding concept: "the subject of my desire is attracted to another". This also correctly postdicts that jealousy is less susceptible to the inanimate object failure mode than anger. ↩︎

  65. Oldbrain circuits advertise emotional state through many indicators: facial expressions, pupil dilation, blink rate, voice tone, etc - and then other oldbrain circuits then can detect emotional state in others from these obvious cues. This provides the requisite proxy foundation for grounding to newbrain learned representations of emotional state in others, and thus empathy. The same learned representations are then reused during imagination&planning, allowing the brain to imagine/predict the future contingent emotional state of others. Simulation itself can also help with grounding, by reusing the brain's own emotional circuity as the proxy. While simulating the mental experience of others, the brain can also compare their relative alignment/altruism to its own, or some baseline, allowing for the appropriate game theoretic adjustments to sympathy. This provides a reasonable basis for alignment in the brain, and explains why empathy is dependent upon (and naturally tends to follow from) familiarity with a particular character - hence "to know someone is to love them". ↩︎

  66. Matusch, Brendon, Jimmy Ba, and Danijar Hafner. "Evaluating Agents without Rewards." arXiv preprint arXiv:2012.11538 (2020). ↩︎

  67. Salge, Christoph, Cornelius Glackin, and Daniel Polani. "Empowerment–an introduction." Guided Self-Organization: Inception. Springer, Berlin, Heidelberg, 2014. 67-114. ↩︎

  68. Friston, Karl. "The free-energy principle: a unified brain theory?." Nature reviews neuroscience 11.2 (2010): 127-138. ↩︎

  69. Burda, Yuri, et al. "Large-scale study of curiosity-driven learning." arXiv preprint arXiv:1808.04355 (2018). ↩︎

  70. Mohamed, Shakir, and Danilo Jimenez Rezende. "Variational information maximisation for intrinsically motivated reinforcement learning." arXiv preprint arXiv:1509.08731 (2015). ↩︎

  71. Eysenbach, Benjamin, et al. "Diversity is all you need: Learning skills without a reward function." arXiv preprint arXiv:1802.06070 (2018). ↩︎

  72. Zhao, Ruihan, Stas Tiomkin, and Pieter Abbeel. "Learning efficient representation for intrinsic motivation." arXiv preprint arXiv:1912.02624 (2019). ↩︎

  73. Aubret, Arthur, Laetitia Matignon, and Salima Hassas. "A survey on intrinsic motivation in reinforcement learning." arXiv preprint arXiv:1908.06976 (2019). ↩︎

  74. Pathak, Deepak, et al. "Curiosity-driven exploration by self-supervised prediction." International conference on machine learning. PMLR, 2017. ↩︎

  75. Pathak, Deepak, Dhiraj Gandhi, and Abhinav Gupta. "Self-supervised exploration via disagreement." International conference on machine learning. PMLR, 2019. ↩︎

  76. It is irrelevant that evolution sometimes produces brains that are unaligned or broken in various ways. My broken laptop is not evidence that turing machines do not work. Evolution proceeds by breaking things; it only needs some high functioning offspring for success. We are reverse engineering the brain in its most ideal perfected forms (think Von Neumman meets Jesus, or your favorite cultural equivalents), and we are certainly not using some blind genetic evolutionary process to do so. ↩︎

  77. Decety, Jean, et al. "Empathy as a driver of prosocial behaviour: highly conserved neurobehavioural mechanisms across species." Philosophical Transactions of the Royal Society B: Biological Sciences 371.1686 (2016): 20150077. ↩︎

  78. Meyza, K. Z., et al. "The roots of empathy: Through the lens of rodent models." Neuroscience & Biobehavioral Reviews 76 (2017): 216-234. ↩︎

  79. Bartal, Inbal Ben-Ami, Jean Decety, and Peggy Mason. "Empathy and pro-social behavior in rats." Science 334.6061 (2011): 1427-1430. ↩︎

  80. Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. "Learning Altruistic Behaviours in Reinforcement Learning without External Rewards." arXiv preprint arXiv:2107.09598 (2021). ↩︎

  81. The franzmeyer paper was posted on arxiv shortly before I started this post a year ago, but it did not come to my attention until final editing, and we both arrived at a similar idea (using empowerment as a bound approximation for external agent values) independently. They of course are not using a complex learned world model and thus avoid the key challenge of internal circuit grounding. The specific approximations they are using may not scale to large environments, but regardless they have now at least proven out the basic idea of optimizing for external agent empowerment in simple environments. ↩︎

  82. Transitioning to altruism(external empowerment) too soon could impair the agent's learning trajectory or result in an insufficient model of external agency; but delaying the transition too long could result in powerful selfish agents. ↩︎

  83. The capabilities of an (adult/trained) agent are a function primarily of 1.) its total lifetime effective compute budget for learning (learning compute * learning age), 2.) the quality and quantity of its training data (knowledge), and 3.) its architectural prior. In simboxes we are optimizing 3 for the product of intelligence and alignment, but that does not imply that agents in simboxes will be especially capable or dangerous, as they will be limited somewhat by 1 and especially by 2. ↩︎

  84. See also the typical hero's journey monomyth. ↩︎

  85. One key difference is that computer security sandboxes are built to contain viruses and malware which themselves are intentionally designed to escape. This adversarial arms race setting naturally makes containment far more challenging, whereas AGI and simboxes should be fully cooperatively codesigned. ↩︎

  86. Plato did actually arrive at some conclusions that roughly anticipate simulism, but only very vaguely. Various contemporary Gnostics believed in an early equivalent of simulism. Still billions of lifetimes away from any serious containment risk. ↩︎

  87. Of course a hypothetical superintelligence with vast amounts of compute could perhaps infer the rough shape of the outer world from even a single short lifetime of observations/experiments (using vast internal simulation), but as a rough baseline that would probably require something like the equivalent of human net civilization levels of compute and would hardly go unnoticed, and a well designed sim may not leak enough to allow for anything other than human manipulation as the escape route (consider, for example, the escape prospects for a 'superintelligent' atari agent, who could only know humanity through vague simulations of entire multiverses mostly populated with aliens). Regardless that type of hypothetical superintelligence has no relation to the human-level AGI which will actually arrive first and is discussed here. ↩︎

  88. Specifically dynamic alignment architectures and mechanisms as discussed in section 4: agents that learn models of, and then optimize for, other agent's values/utility (and or empowerment). ↩︎

  89. These should be considered upper bounds because advances in inter-agent optimization/compression can greatly reduce these costs, long before more exotic advances such as reversible computing. ↩︎

  90. And architecture is somewhat less of a differentiator given the combination of architectural convergence under dynamic within-lifetime architectural search and diminishing returns to model size in great excess of data history. ↩︎

  91. One key piece of historical information which must be inferred for the success of such an effort is humanity's DNA tree. Fortunately a rather large fraction of total human DNA is preserved and awaiting extraction and sampling by future robots thanks to (mostly judeo-christian/abrahamic) burial rituals. ↩︎

58

Ω 14

New Comment
69 comments, sorted by Click to highlight new comments since: Today at 11:44 PM

I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing. 

If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore's law arguments you bring up, we can only simulate each agent at close to 'real time'.  So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity.  This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so that maintaining reasonable populations of them in simulation will be very expensive and probably infeasible initially. If we could get policy coordination on making sure all actors likely to develop AGI go through a thorough simboxing testing regimen, then that would be fantastic and would solve this problem.

Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is  recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain 'magical powers' in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a 'potion' that makes it 2x as smart. How do we simulate this? We could just increase the agent's compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)

This is so vital because the probable breakdown of proxies to human values under the massive distributional shift induced by recursive self improvement is the fundamental difficulty to alignment in the first place.

Perhaps this is unique to my model of AI risk, but almost all the probability of doom channels through p(FOOM) such that p(doom | no FOOM) is quite low in comparison. This is because if we have don't have FOOM then there is not extremely large amounts of optimization power unleashed and the reward proxies for human values and flourishing don't end up radically off-distribution and so probably don't break down. There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained. The main risk, however, in my view, comes from the FOOM scenario.

Finally, I just wanted to say that I'm a big fan of your work and some of your posts have caused major updates to my alignment worldview -- keep up the fantastic work!

Thanks, upvoted for engagement and constructive criticism - I'd like more to see this comment.

I'm going to start perhaps in reverse to establish where we seem to most agree:

There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained.

I fully agree with this statement.

However in worlds that rapidly FOOM, everything becomes more challenging, and I'll argue in a moment why I believe that the approach presented here still is promising in rapid FOOM scenarios, relative to all other practical techniques that could actually work.

But even without rapid FOOM, we still can have disaster - for example consider the scenario of world domination by a clan of early uploads of some selfish/evil dictator or trillionaire. There's still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.

Now if rapid FOOM is near inevitable, then those considerations naturally may matter less. But rapid FOOM is far from inevitable.

First, Moore's Law is ending, and brains are efficient, perhaps even near pareto-optimal.

Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.

To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.

Performance improvement is mostly about scaling compute and data in quantity and quality - which is exactly what has happened with deep learning, which was deeply surprising to many in the ML/comp-sci community and caused massive updates (but was not surprising and was in fact predicted by those of us arguing for brain efficiency and brain reverse engineering).

Now, given that background, there a few other clarifications and/or disagreements:

If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore's law arguments you bring up, we can only simulate each agent at close to 'real time'.

To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, none of these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.

So just to clarify:

  • AGI designs in simboxes are exactly the same as unboxed designs, and have exactly the same compute costs
  • The only difference is in the datastream and thus knowledge
  • The ideal baseline cost of simboxing is only O(N+1) vs O(N) without - once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/s for brain-size model * 1e9s seconds for 30 years equivalent), or less than $10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing.

Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain 'magical powers' in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a 'potion' that makes it 2x as smart. How do we simulate this? We could just increase the agent's compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)

If brains are efficient, then matching them will already use up most of our algorithmic optimization slack - which again seems to be true based on the history of deep learning. But let's suppose there still is significant optimization slack, then in a sense you've almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not.

If there is great algorithmic slack, then we can employ agents which graduate from simboxes as engineers in the design of better AGI and simboxes. To the extent there is any downside here or potential advantage for other viable approaches, that difference seems to come strictly at the cost of alignment risk.

Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to:

  • not use adequate testing of alignment (ie simboxes)
  • or not optimize for product of intelligence potential and measurable alignment/altruism

Do you think such an other approach could exist? If so, where would the difference lie and why?

Thanks for the detailed response! It clarifies some of my concerns and I think we have a lot of agreement overall. I'm also going to go in near reverse order,

To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, noneof these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.

To me the time/cost issue with the simboxes you proposed is in the data you need to train the AGIs from within the sim to prevent information leakage. Unlike with current training, we can't just give it the whole internet, as that will contain loads of information about humans, how ML works, that it is in a sim etc which would be very dangerous. Instead, we would need to recapitulate the entire *data generating process* within the sim, which is what would be expensive. Naively, the only way to do this would be to actually simulate a bunch of agents interacting with the sim world for a long time, which would be at minimum simulated-years for human-level data efficiency and much much longer for current DL. It is possible, I guess, to amortise this work and create one 'master-sim' which so that we can try various AGI designs which all share the same dataset, and this would be good experimentally to isolate the impact of architecture/objective vs dataset, but under the reward-proxy learning approach, a large factor in the success in alignment depends on the dataset, which would be very expensive to recreate in sim without information transfer from our reality.

Training current ML models is very fast because they can use all the datasets already generated by human civilisation. To bootstrap to similar levels of intelligence in a sim without wholesale transfer of information from our reality, will require a concomitant amount of computational effort more like simulating our civilisation than simulating a single agent.

  • The ideal baseline cost of simboxing is only O(N+1) vs O(N) without - once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/s for brain-size model * 1e9s seconds for 30 years equivalent), or less than $10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing

I don't understand this. Presumably we will want to run a lot of training runs in the sim since we will probably need to iterate a considerable number of times to actually succeed in training a safe AGI. We will also want to test across a large range of datasets and initial conditions, which will necessitate the collection of a number of large and expensive sim-specific datasets here. It is probably also necessary to simulate reasonable sim populations as well, which will also increase the cost. 

But let's suppose there still is significant optimization slack, then in a sense you've almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not.

Perhaps I'm missing something here but I don't understand how this is supposed to work. The whole point of the simbox is that there is no information leakage about our reality.  Having AGI agents doing ML research in a reality which is close enough to our own that its insights transfer to our reality defeats the whole point of having a sim, which is preventing information leakage about our reality! On the other hand, if we invent some magical alternative to the intelligence explosion, then us the simulators won't necessarily be able to invent the new ML techniques that are 'invented' in the sim. 

 

Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.

To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.

I agree that FOOM is very unlikely from the view of the current scaling laws, which imply a strongly sublinear returns on investment. The key unknown quantity at this point is the returns on 'cognitive self improvement' as opposed to just scaling in terms of parameters and data. We have never truly measured this as we haven't yet developed appreciably self-modifying and self-improving ML systems. On the outside view, power-law diminishing returns are probably likely in this domain as well but we just don't know

Similarly, I agree that if contemporary ML is already on its asymptotically optimal scaling regime -- i.e. if it is a fundamental constraint of the universe that intelligence can do no better than power law scaling (albeit with potentially much better coefficients than now), then FOOM is essentially impossible and I think that some form of humanity stands a pretty reasonable chance of survival. There is some evidence that ML is in the same power-law scaling regime as biological brains as well as a lot of algorithms from statistics, but I don't think the evidence is conclusively against the possibility of a radically better paradigm which perhaps both us and evolution haven't found. Potentially because it requires some precise combination of both highly parallel brain and a fast serial CPU-like processor which couldn't be built by evolution with biological components.  Personally, and it would be great if you convince me otherwise, that there are a lot of unknown unknowns in this space and the evidence from current ML and neuroscience isn't that strong against there being unknown and better alternatives that could lead to FOOM. Ideally, we would understand the origins of scaling laws well enough we could figure out computational complexity bounds on the general capabilities of learning agents.

 

But even without rapid FOOM, we still can have disaster - for example consider the scenario of world domination by a clan of early uploads of some selfish/evil dictator or trillionaire. There's still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.

Yes of course, solving alignment in this regime is extremely valuable. With any luck, reality will be such that we will end up in this regime and I think alignment is actually solvable here while I'm very pessimistic in a full FOOM scenario. Indeed, I think we should spend a lot of effort in figuring out if FOOM is even possible and if it is trying to figure out how to stop the agents we build from FOOMing since this scenario is where a large amount of p(doom) is coming from.

 

Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to: 

  • not use adequate testing of alignment (ie simboxes)
  • or not optimize for product of intelligence potential and measurable alignment/altruism

If there is enough algorithmic slack such that FOOM is likely, then I think that our capabilities to simulate such an event in simboxes will be highly limited and so we should focus much more on designing general safe objectives which, ideally, we can mathematically show can scale over huge capability gaps, if such safe objectives exist at all. We should also spend a lot of effort into figuring out how to constrain AGIs such that they don't want to or can't FOOM. I completely agree though that in general we should spend a lot of effort in building simboxes and measurably testing for alignment before deploying anything. 

To me the time/cost issue with the simboxes you proposed is in the data you need to train the AGIs from within the sim to prevent information leakage. Unlike with current training, we can't just give it the whole internet, as that will contain loads of information about humans, how ML works, that it is in a sim etc which would be very dangerous. Instead, we would need to recapitulate the entire data generating process within the sim, which is what would be expensive.

I'm not quite sure what you mean by data generating process, but the training cost is no different for a tightly constrained run vs an unconstrained run. An unconstrained run would involve something like a current human development process, where after say 5 years or whatever of basic sensory/motor grounding experience they are learning language then on the internet. A constrained run is exactly the same but for a much earlier historical time, long before the internet. The construction of the sim world to recreate the historical era is low cost in comparison to the AGI costs.

Naively, the only way to do this would be to actually simulate a bunch of agents interacting with the sim world for a long time, which would be at minimum simulated-years for human-level data efficiency and much much longer for current DL.

I'm expecting AGI will require the equivalence of say 20 years of experience, which we can compress about 100x through parallelization rather than serial speedup, basically just like in current DL systems. Consider VPT for example, which reaches expert human level in minecraft after training on the equivalent of 10 years of human minecraft experience.

It is possible, I guess, to amortise this work and create one 'master-sim' which so that we can try various AGI designs which all share the same dataset, and this would be good experimentally to isolate the impact of architecture/objective vs dataset, but under the reward-proxy learning approach, a large factor in the success in alignment depends on the dataset, which would be very expensive to recreate in sim without information transfer from our reality (well constructed open world RGPs already largely do this module obvious easter eggs, and they aren't even trying very hard).

I'm not really sure what you mean by 'dataset' here, as there isn't really a dataset other than the agents lifetime experiences in the world, procedurally generated by the sim. Like I said in the article, the simplest early simboxes don't need to be much more complex than minecraft, but obviously it gets more interesting when you have a more rich, detailed fantasy world with it's own history and books, magic system, etc. None of this is difficult to create now, and is only getting easier and cheaper. The safety constraint is not zero information transfer at all , as that wouldn't even permit a sim, the constraint is to filter out new modern knowledge or anything that is out of character for the sim world coherency .

We want to use multiple worlds and scenarios to gain diversity and robustness, but again that isn't so difficult or costly.

The ideal baseline cost of simboxing is only O(N+1) vs O(N) without

I don't understand this. Presumably we will want to run a lot of training runs in the sim since we will probably need to iterate a considerable number of times to actually succeed in training a safe AGI.

Completely forget LLMs, just temporarily erase them from your mind for a moment. There is an obvious path to AGI - deepmind's path - which consists of reverse engineering the brain, testing new architectures in ever more complex sim environments. Starting first with Atari, now moving on to minecraft, recapitulating video game's march of moore's law progress. This path is already naturally using simboxes and thus safe. So in this framework, let's say it requires N training experiments to nail AGI (where each experiment trains a single shared model for around human-level age but parallelizing over a hundred to a thousand agents, as is done today). Then using simboxes is just a matter of never training the AGI in an unsafe world until the last final training run, once the design is perfected. The cost is then ideally is just one additional training run.

So the only way that the additional cost of safe simboxing could be worse/larger than one additional training run is if there is some significant disadvantage to training in purely historical/fantasy sim worlds vs sci-fi/modern sim worlds.

But we have good reasons to believe there shouldn't be any such disadvantage: the architecture of the human brain certainly hasn't changed much in the last few thousand years, intelligence is very general, etc.

Having AGI agents doing ML research in a reality which is close enough to our own that its insights transfer to our reality defeats the whole point of having a sim, which is preventing information leakage about our reality!

No agents are doing ML research in the simboxes, I said agents (or architectures rather) that are determined to be reasonably safe/altruistic can 'graduate' to reality and help iterate.

There is some evidence that ML is in the same power-law scaling regime as biological brains as well as a lot of algorithms from statistics, but I don't think the evidence is conclusively against the possibility of a radically better paradigm which perhaps both us and evolution haven't found

I mostly agree with you about the foom and scaling regimes. However I do believe there is various work on learning theory which suggests some bounds on scaling laws, (just haven't read that literature recently). For example there are some scenarios (based on the statistical assumptions you place on efficient circuit/data distributions) where standard linear SGD (based on normal assumptions) is asytomptically suboptimal compared to alternates like exponential/multiplicative GD if the normal assumption is wrong and the circuit distribution is actually log-normal. There was also a nice paper recently which categorized the taxonomy and hierarchy of all known learning algorithms that approximate ideal bayesian learning (need to refind).

We can build general altruistic agents which:

  • Initially use intrinsically motivated selfish empowerment objectives to bootstrap developmental learning (training)
  • Gradually learn powerful predictive models of the world and the external agency within (other AI in sims, humans, etc) which steers it
  • Use correlation guided proxy matching (or similar) techniques to connect the dynamic learned representations of external agent utility (probably approximated/bounded by external empowerment) to the agent's core utility function
  • Thereby transition from selfish to altruistic by the end of developmental learning (self training)

I endorse this as a plausible high-level approach to making aligned AGI, and I would say that a significant share of the research that I personally am doing right now is geared towards gaining clarity on the third bullet point—what exactly are these techniques and how reliably will they work in practice?

I think I’m less optimistic than you about the formal notion of empowerment being helpful for this third bullet point, or being what we want an AGI to be maximizing for us humans. For one thing, wouldn’t we still need “correlation guided proxy matching”? For another thing, maximizing my empowerment would seem to entail killing anyone who might stop me from doing whatever I want, stealing money from around the world and giving it to me, even if I don’t want it, etc. (Or is there a collective notion of empowerment?) Here’s another example: if the AGI comes up with 10,000 possible futures of the universe, and picks one to bring about based on how many times I blink in the next hour, then I am highly “empowered” (high mutual information between my actions and future observations), but the AGI never told me that it was going to do that, so I was just blinking randomly, so I wasn’t really “empowered” in the everyday sense of the word. A separate issue is that, even if the AGI told me this plan in advance, that’s not doing me a favor, I don’t want that responsibility on my shoulders. So anyway, “increase my empowerment” seems to come apart from “what I’d want the AGI to do” in at least some silly examples and I’d expect this to happen in very important ways in more realistic examples too.

I think I’m less optimistic than you about the formal notion of empowerment being helpful for this third bullet point, or being what we want an AGI to be maximizing for us humans. For one thing, wouldn’t we still need “correlation guided proxy matching”?

I debated what word to best describe the general category of all self-motivated long term convergence approximators, and chose 'empowerment' rather than 'self-motivation', but tried to be clear that I'm pointing at a broad category. The defining characteristic is that optimizing for empowerment should be the same as optimizing for any reasonably mix of likely long term goals, due to convergence (and that is in fact one of the approx methods). Human intelligence requires empowerment, as will AGI - it drives active learning, self exploration, play, etc (consider the appeal of video games). I'm not confident in any specific formalization as being 'the one' at this point, let alone the ideal approximations.

Broad empowerment is important to any model of other external humans/agents, and so yes the proxy matching is still relevant there. Since empowerment is universal and symmetric, the agent's own self-empowerment model could be used as a proxy with simulation. For example humans don't appear to innately understand and fear death, but can develop a great fear of it when learning of their own mortality - which is only natural as it is maximally disempowering. Then simulating others as oneself helps learn that others also fear death and ground that sub-model. Something similar could work for the various other aspects of empowerment.

(Or is there a collective notion of empowerment?)

Yeah an agent aligned to the empowerment of multiple others would need to aggregate utilities ala some approx VCG mechanism but that's no different than for other utility function components.

Here’s another example: if the AGI comes up with 10,000 possible futures of the universe, and picks one to bring about based on how many times I blink in the next hour, then I am highly “empowered”

From what I recall for the info-max formulations the empowerment of a state is a measure over all actions the agent could take in that state, not just one arbitrary action, and I think it discounts for decision entropy (random decisions are not future correlated). So in that example the AGI would need to carefully evaluate each of the 10,000 possible futures, and consider all your action paths in each future, and the complexity of the futures dependent on those action options, to pick the future that has the most future future optionality. No, it's not practically computable, but neither is the shortest description of inference (solomonoff) or intelligence either, so it's about efficient approximations. There's likely still much to learn from the brain there.

An AGI optimizing for your empowerment would likely just make you wealthy and immortal, and leave your happiness up to your own devices. However it would also really really not want to let you die, which could be a problem in some cases if pain/suffering was still a problem (although it would also seek to eliminate your pain/suffering to the extent that interferes with your future optionality, and there is some edge case risk it would have incentives to alter some parts our value system, like if it determines some emotions constraint our future optionality).

OK thanks. Hmm, maybe a better question would be:

“Correlation guided proxy matching” needs a “proxy”, including a proxy computable for an AGI in the real world (because after we’re feeling good about the simbox results, we still need to re-train the AGI in the real world, right?). We can argue about whether the “proxy” should ideally be a proxy to VCG-aggregated human empowerment, versus a proxy to human happiness or flourishing or whatever. But that’s a bit beside the point until we address the question: How do we calculate that proxy? Do you think that we can write code today to calculate this proxy, and then we can go walk around town and see what that code spits out in different real-world circumstances? Or if we can’t write such code today, why not, and what line of research gets us to a place where we can write such code?

Sorry if I’m misunderstanding :)

But that’s a bit beside the point until we address the question: How do we calculate that proxy?

I currently see two potential paths, which aren't mutually exclusive.

The first path is to reverse engineer the brain's empathy system. My current rough guess of how the proxy-matching works for that is explained in some footnotes in section 4, and I've also re-written out in this comment which is related to some of your writings. In a nutshell the oldbrain has a complex suite of mechanisms (facial expressions, gaze, voice tone, mannerisms, blink rate, pupil dilation, etc) consisting of both subconscious 'tells' and 'detectors' that function as a sort of direct non-verbal, oldbrain to oldbrain communication system to speed up the grounding to newbrain external agent models. This is the basis of empathy, evolved first for close kin (mothers simulating infant needs, etc) then extended and generalized. I think this is what you perhaps have labeled innate 'social instincts' - these facilitate grounding to the newbrain models of other's emotions/values.

The second path is to use introspection/interpretability tools to more manually locate learned models of external agents (and their values/empowerment/etc), and then extract those located circuits and use them directly as proxies in the next agent.

Do you think that we can write code today to calculate this proxy, and then we can go walk around town and see what that code spits out in different real-world circumstances? Or if we can’t write such code today, why not, and what line of research gets us to a place where we can write such code?

Neuroscientists may already be doing some of this today, or at least they could (I haven't extensively researched this yet). Should be able to put subjects in brain scanners and ask them to read and imagine emotional scenarios that trigger specific empathic reactions, perhaps have them make consequent decisions, etc.

And of course there is some research being done on empathy in rats, some of which I linked to in the article.

I found two studies that seem relevant:

Naturalistic Stimuli in Affective Neuroimaging: A Review

Naturalistic stimuli such as movies, music, and spoken and written stories elicit strong emotions and allow brain imaging of emotions in close-to-real-life conditions. Emotions are multi-component phenomena: relevant stimuli lead to automatic changes in multiple functional components including perception, physiology, behavior, and conscious experiences. Brain activity during naturalistic stimuli reflects all these changes, suggesting that parsing emotion-related processing during such complex stimulation is not a straightforward task. Here, I review affective neuroimaging studies that have employed naturalistic stimuli to study emotional processing, focusing especially on experienced emotions. I argue that to investigate emotions with naturalistic stimuli, we need to define and extract emotion features from both the stimulus and the observer.

https://www.frontiersin.org/articles/10.3389/fnhum.2021.675068/full 

 

An Integrative Way for Studying Neural Basis of Basic Emotions With fMRI

How emotions are represented in the nervous system is a crucial unsolved problem in the affective neuroscience. Many studies are striving to find the localization of basic emotions in the brain but failed. Thus, many psychologists suspect the specific neural loci for basic emotions, but instead, some proposed that there are specific neural structures for the core affects, such as arousal and hedonic value. The reason for this widespread difference might be that basic emotions used previously can be further divided into more “basic” emotions. Here we review brain imaging data and neuropsychological data, and try to address this question with an integrative model. In this model, we argue that basic emotions are not contrary to the dimensional studies of emotions (core affects). We propose that basic emotion should locate on the axis in the dimensions of emotion, and only represent one typical core affect (arousal or valence). Therefore, we propose four basic emotions: joy-on positive axis of hedonic dimension, sadness-on negative axis of hedonic dimension, fear, and anger-on the top of vertical dimensions. This new model about basic emotions and construction model of emotions is promising to improve and reformulate neurobiological models of basic emotions.

https://www.frontiersin.org/articles/10.3389/fnins.2019.00628/full 

Oh neat, sounds like we mostly agree then. Thanks. :)

See the studies listed above.

This post makes a lot of very confident predictions:

  • [human judges] will be able to directly inspect, analyze, search and compare agent mind states and thought histories, both historical and in real-time.
  • AGI will almost certainly require recursion: our great creative achievements rely on long iterative recursive thought trains implementing various forms of search/optimization over inner conceptual design spaces
  • AGI will likely require new approaches for running large ANNs on GPUs [42], or will arrive with more neuromorphic hardware.
  • Early AGI will likely require a small supercomputer with around 100 to 1000 high end GPUs using model parallelism
  • a 1000 GPU cluster will be able to run 100 to 1000 agents in parallel at real-time speed or greater
  • efficient designs will find ways to compress any correlations/similarities/regularities across inter-agent synaptic patterns
  • DL based AGI will not be mysterious and alien; instead it will be familiar and anthropomorphic
  • AGI will be a generic/universal learning system like the brain [and] will necessarily be human as AGI will grow up immersed in human culture, learning human languages and absorbing human knowledge.
  • Evolution found means to temper and align empowerment[52], mechanisms we will reverse engineer for convergent reasons
  • AGI will be born of our culture, growing up in human information environments
  • AGI will mostly have similar/equivalent biases - a phenomenon already witnessed in large language models
  • If you train/raise AGI in a human-like environment, [...], then its self-optimizing internal world model will necessarily learn efficient sub-models of these external agents and their values/goals. Theory of mind is Inverse Reinforcement Learning.

It seems to make these predictions with a time horizon of a decade ("soon").

I'm not saying that some of these are plausible avenues, but to me, this comes across as overconfident (it might be a stylistic method, but I think that is also problematic in the context of AGI Safety).

It might make sense to link these statements to ongoing predictions on Metaculus.

I find it interesting that you consider "will likely" to be an example of "very confident", whereas I'm using that specifically to indicate uncertainty, as in "X is likely" implies a bit over 50% odds on some cluster of ideas vs others (contingent on some context), but very far from certainty or high confidence.

The only prediction directly associated with a time horizon is the opening prediction of AGI most likely this decade. Fully supporting/explaining that timeline prediction would probably require a short post, but it mostly reduces to: the surprisingly simplicity of learning algorithms, the dominance of scaling, and of course brain efficiency which together imply AGI arrives predictably around or a bit after brain parity near the endphase of moore's law. The early versions of this theory have already made many successful postdictions/predictions[1].

Looking at the metaculus prediction for "Date Weakly General AI is Publicly Known", I see the median was in the 2050's just back in early 2020, had dropped down to around 2040 by the time I posted on brain efficiency earlier this year, and now is down to 2028: equivalent to my Moravec-style prediction of most likely this decade. I will take your advice to link that timeline prediction to metaculus, thanks.

Most of the other statements are all contextually bound to a part of the larger model in the surrounding text and should (hopefully obviously) not be interpreted out-of-context as free-floating unconditional predictions.

For example: "[human judges] will be able to directly inspect, analyze, search and compare agent mind states and thought histories, both historical and in real-time."

Is a component of a larger design proposal, which involves brain-like AGI with inner monologues and other features that make that feature rather obviously tractable.

Imagine the year is 1895 and I've written a document describing how airplanes could work, and you are complaining that I'm making an overconfident prediction that "human pilots will be able to directly and easily control the plane's orientation in three dimensions: yaw, pitch, and roll". That's a prediction only in the sense of being a design prediction, and only in a highly contextual sense contingent on the rest of the system.

I'm not saying that some of these are plausible avenues, but to me, this comes across as overconfident (it might be a stylistic method, but I think that is also problematic in the context of AGI Safety).

I'm genuinely more curious which of these you find the most overconfident/unlikely, given the rest of the design context.

Perhaps these?:

DL based AGI will not be mysterious and alien; instead it will be familiar and anthropomorphic

AGI will be born of our culture, growing up in human information environments

AGI will mostly have similar/equivalent biases - a phenomenon already witnessed in large language models

Sure these were highly controversial/unpopular opinions on LW when I was first saying AGI would be anthropomorphic, that brains are efficient, etc way back in 2010, long before DL, when nearly everyone on LW thought AGI would be radically different than the brain (ironically based mostly on the sequences: a huge wall of often unsubstantiated confident philosophical doctrine).

But on these issues regarding the future of AI, it turns out that I (along with moravec/kurzweil/etc) was mostly correct, and EY/MIRI/LW was mostly wrong - and it seems MIRI folks concur to some extent and some on LW updated. The model difference that led to divergent predictions about the future of AI is naturally associated with different views on brain efficiency[2] and divergent views on tractability of safety strategies[3].


  1. For example the simple moravec-style model that predicts AI task parity around the time of flop parity to equivalent brain regions roughly predicted DL milestones many decades in advance, and the timing of NLP breakthroughs ala LLM is/was also predictable based on total training flops equivalence to brain linguistic cortex. ↩︎

  2. EY was fairly recently claiming that brains were about half a million times less efficient than the thermodynamic limit. ↩︎

  3. For example see this comment where Rob Bensinger says, "If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world's existential risk.", but then for various reasons doesn't believe that's especially doable. ↩︎

Thank you for the clarification.

I didn't mean that each one individually was very overconfident and just listed all predictions and there were no "might" or "plausibly" or even "more likely than not" (which I would see as >50%). I would read a "will be able to X" as >90% confident, and there are many of these.

But your explanation that each statement should be read with the implicit qualification "assuming the contextual model as given" clarifies this. I'm not sure the majority will read it like that, though.

I think the most overconfident claim is this:

AGI will almost certainly require recursion

I'm no longer sure you mean the statements as applying in a ten-year horizon, but among the statements, I think one about the human judges is the one furthest out because it mostly depends on the others being achieved (GPU clusters running agents etc.). 

I think the most overconfident claim is this:

AGI will almost certainly require recursion

Yeah in retrospect I probably should reword that, as it may not convey my model very well. I am fairly confident that AGI will require something like recursion (or recurrence actually), but that something more specifically is information flow across time - over various timescales - and across the space of intermediate computations, but you can also get that from using memory mechanisms.

Just for the record: I am working on a brain-like AGI project, and I think approaches that simulate agents in a human-like environment are important and will plausibly give a lot of insights into value acquisition in AI and humans alike. I'm just less confident about many of your specific claims.

I find it interesting that you consider "will likely" to be an example of "very confident", whereas I'm using that specifically to indicate uncertainty, as in "X is likely" implies a bit over 50% odds on some cluster of ideas vs others (contingent on some context), but very far from certainty or high confidence.

This is a common failure mode when communicating uncertainty. If you think of likely as meaning some very specific probability range, and you think it matters in that instance, use that probability range instead. People's perception of what "probable" means ranges from around 20 to 80% iirc from reading Tetlocks Superforcasting. If you need more evidence: this is easily verified by just asking 2-3 people what they think what "likely" means.

Absolutely brilliant stuff Jacob! As usual with your posts, I'll have to ponder this for a while...Let's see if I got this right:

Evolution had to solve alignment: how to align the powerful general learning engine of the newbrain (neocortex etc) with the goals of the oldbrain ("reptilian brain").

Some (most?) of this alignment seems to be a form of inverse reinforcement learning. Another form of alignment that the oldbrain applies to the newbrain is imprinting. It is Evolution's way of solving the pointing problem.

When a duckling hatches it imprints on the first agent it sees. This feels different from inverse reinforcement learning: it's not like the newbrain is rewarded or punished, rather it is more like there is an open slot for your mom

Thanks! - that summary seems about right.

But I would say that imprinting is a specific instance of a more general process (which I called correlation guided proxy matching). The oldbrain has a simple initial mom detector circuit, which during normal chick learning phase is just good enough to locate and connect to the learned newbrain mom detector circuit, which then replaces/supplants the oldbrain equivalent. The proxy matching needn't really effect the newbrain directly.

If there was only one correct way to model humans, such that every sufficiently competent observer of humanity was bound to think of me the same way I think of myself, then I think this would be a lot less doomed. But alas, there are lots of different ways to model humans as goal-directed systems, most of which I wouldn't endorse for value learning - not because they're inaccurate, but because they're amoral.

In short, yes, value learning is a challenge, one that is easy to fail if you try to do the value learning step strictly before the caring about humans step.

But alas, there are lots of different ways to model humans as goal-directed systems, most of which I wouldn't endorse for value learning - not because they're inaccurate, but because they're amoral.

How so? Generally curious how you see this as a failure mode.

Anyway, AGI doesn't need to model our detailed values, as it can just optimize for our empowerment (our long term ability to fulfill any likely goal).

The basic point is that the stuff we try to gesture towards as "human values," or even "human actions" is not going to automatically be modeled by the AI.

Some examples of ways to model the world without using the-abstraction-I'd-call-human-values:

  • Humans are homeostatic mechanisms that maintain internal balance of oxygen, water, and a myriad of vitamins and minerals.
  • Humans are piloted by a collection of shards - it's the shards that want things, not the human.
  • Human society is a growing organism modeled by some approximate differential equations.
  • The human body is a collection of atoms that want to obey the laws of physics.
  • Humans are agents that navigate the world and have values - and those values exactly correspond to economic revealed preferences.
  • Humans-plus-clothes-and-cell-phones are agents that navigate the world and have values...

And so on - there's just a lot of ways to think about the world, including the parts of the world containing humans.

The obvious problem this creates is for getting our "detailed values" by just querying a pre-trained world model with human data or human-related prompts: If the pre-trained world model defaults to one of these many other ways of thinking about humans, it's going to answer us using the wrong abstraction. Fixing this requires the world model to be responsive to how humans want to be modeled. It can't be trained without caring about humans and then have the caring bolted on later.

But this creates a less-obvious problem for empowerment, too. What we call "empowerement" relies on what part of the world we are calling the "agent," what its modeled action-space is, etc. The AI that says "The human body is a collection of atoms that want to obey the laws of physics" is going to think of "empowerment" very differently than the AI that says "Humans-plus-clothes-and-cell-phones are agents." Even leaving aside concerns like Steve's over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it's supposed to be empowering, which doesn't happen by default.

The basic point is that the stuff we try to gesture towards as "human values," or even "human actions" is not going to automatically be modeled by the AI.

I disagree, and have already spent some words arguing for why (section 4.1, and earlier precursors) - so I'm curious what specifically you disagree with there? But I'm also getting the impression you are talking about a fundamentally different type of AGI.

I'm discussing future DL based AGI which is - to first approximation - just a virtual brain. As argued in section 2/3, current DL models already are increasingly like brain modules. So your various examples are simply not how human brains are likely to model other human brains and their values. All the concepts you mention - homestatic mechanisms, 'shards', differential equations, atoms, economic revealed preferences, cell phones, etc - these are all high level linguistic abstractions that are not much related to how the brain's neural nets actually model/simulate other humans/agents. This must obviously be true because empathy/altruism existed long before the human concepts you mention.

The obvious problem this creates is for getting our "detailed values" by just querying a pre-trained world model with human data or human-related prompts:

You seem to be thinking of the AGI as some sort of language model which we query? But that's just a piece of the brain, and not even the most relevant part for alignment. AGI will be a full brain equivalent, including the modules dedicated to long term planning, empathic simulation/modeling, etc.

Even leaving aside concerns like Steve's over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it's supposed to be empowering, which doesn't happen by default.

Again for successful brain-like AGI this just isn't an issue (assuming human brains model the empowerment of others as a sort of efficient approximate bound).

Upon more thought, I definitely agree with you more, but still sort of disagree.

You're absolutely right that I wasn't actually thinking about the kind of AI you were talking about. And evolution does reliably teach animals to have theory of mind. And if the training environment is at least sorta like our ancestral environment, it does seem natural that an AI would learn to draw the boundary around humans more or less the same way we do.

But our evolved theory of mind capabilities are still fairly anthropocentric, suited to the needs, interests, and capabilities of our ancestors, even when we can extend them a bit using abstract reasoning. Evolving an AI in a non-Earth environment in a non-human ecological niche, or optimizing an AI using an algorithm that diverges from evolution (e.g. by allowing more memorization of neuron weights) would give you different sorts of theories of mind.

Aside: I disagree that the examples I gave in the previous comment require verbal reasoning. They can be used nonverbally just fine. But using a model doesn't feel like using a model, it feels like perceiving the world. E.g. I might say "birds fly south to avoid winter," which sounds like a mere statement but actually inserts my own convenient model of the world (where "winter" is a natural concept) into a statement about birds' goals.

An AI that's missing some way of understanding the world that humans find natural might construct a model of our values that's missing entire dimensions. Or an AI that understands the world in ways we don't might naturally construct a model of our values that has a bunch of distinctions and details where we would make none.

What it means to "empower" some agent does seem more convergent than that. Maybe not perfectly convergent (e.g. evaluating social empowerment seems pretty mixed-up with subtle human instincts), but enough that I have changed my mind, and am no longer most concerned about the AI simply failing to locate the abstraction we're trying to point to.

So it sounds like we are now actually mostly in agreement.

I agree there may be difficulties learning and grounding accurate mental models of human motivations/values into the AGI, but that is more reason to take the brain-like path with anthropomorphic AGI. Still I hedge between directly emulating human empathy/altruism vs using external empowerment. External empowerment may be simpler/easier to specify and thus more robust against failures to match human value learning more directly, but it also has it's own potential specific failure modes (The AGI would want to keep you alive and wealthy, but it may not care about your suffering/pain as much as we'd like). But I do also suspect that it could turn out that human value learning follows a path of increasing generalization and robustness, starting with oldbrain social instincts as proxy to ground newbrain empathy learning which eventually generalizes widely to something more like external empowerment. At least some humans generally optimize for the well-being (non-suffering) and possibly empowerment of animals, and that expanded/generalized circle of empathy will likely include AI, even if it doesn't obviously mimics human emotions.

It seems to me like a big problem with this approach is that it's horribly compute inefficient to train agents entirely within a simulation, compared to training models on human data. (Apologies if you addressed this in the post and I missed it)

I don't see how training of VPT or EfficientZero was compute inefficient. In fact for self driving cars the exact opposite is true - training in simulation can be much more efficient then training in reality.

VPT and EfficientZero are trained in toy environments, and self driving cars sims are also low-dimensional hard-coded approximations of the deployment domain (which afaik does cause some problems for edge cases in the real world).

The sim for training AGI will probably have to be a rich domain, which is more computationally intensive to simulate and so will probably require lazy rendering like you say in the post, but lazy rendering runs into challenges of world consistency.

Right now we can lazily simulate rich domains with GPT but they're difficult to program reliably and not autonomously stable (though I think they'll become much more autonomously stable soon). And the richness of current GPT simulations inherits from massive human datasets. Human datasets are convenient because you have some guaranteed samples of a rich and coherent world. GPTs bootstrap from the optimization done by evolution and thousands of years of culture compressing world knowledge and cognitive algorithms into an efficient code, language. Skipping this step it's a lot less clear how you'd train AGI, and it seems to me barring some breakthrough on the nature of intelligence or efficient ML it would have to be more computationally intensive to compensate for the disadvantage of starting tabula rasa.

Multiplayer minecraft may already be a complex enough environment for AGI, even if it is a 'toy' world in terms of visuals. Regardless even if AGI requires an environment with more realistic and complex physics such simulations are not expensive relative to AGI itself. "Lazy rendering" of the kind we'd want to use for more advanced sims does not have any inherent consistency tradeoff beyond those inherent to any practical approximate simulation physics.

Foundation text and vision models will soon begin to transform sim/games but that is mostly a separate issue from agent use of language.

Naturally AGI will require language; any sim-grown agents would be taught language, but that doesn't imply they need to learn language via absorbing the internet like GPT.

I agree minecraft is a complex enough environment for AGI in principle. Perhaps rich domain distinction wasn't the right distinction. It's more like whether there are already abstractions adapted to intelligence built into the environment or not, like human language. Game of Life is expressive enough to be an environment for AGI in principle too, but it's not clear how to go about that.

Naturally AGI will require language; any sim-grown agents would be taught language, but that doesn't imply they need to learn language via absorbing the internet like GPT.

That's certainly true, but it seems like currently an unsolved problem how to make sim-grown agents that learn a language from scratch. That's my point: brute force search such as evolutionary algorithms would require much more compute.

In my view -- and not everyone agrees with this, but many do -- GPT is the only instance of (proto-) artificial general intelligence we've created. This makes sense because it bootstraps off human intelligence, including the cultural/memetic layer, which was forged by eons of optimization in rich multi agent environments. Self-supervised learning on human data is the low hanging fruit. Even more so if the target is not just "smart general optimizer" but something that resembles human intelligence in all the other ways, such as using something recognizable as language and more generally being comprehensible to us at all.

I don't think of LLMs like GPT3 as agents that uses language; they are artificial linguistic cortices which can be useful to brains as (external or internal) tools.

I imagine that a more 'true' AGI system will be somewhat brain-like in that it will develop a linguistic cortex purely through embedded active learning in a social environment, but is much more than just that one module - even if that particular module is the key enabler for human-like intelligence as distinct from animal intelligence.

That's certainly true, but it seems like currently an unsolved problem how to make sim-grown agents that learn a language from scratch.

I find this statement puzzling because it is rather obvious how to build sim-grown agents that learn language from scratch. You simply need to replicate something like a human child's development environment and train a sufficiently powerful/general model there. This probably requires a sim where the child is immersed in adults conversing with it and themselves, a sufficiently complex action space, etc. That probably hasn't been done yet partly because nobody has bothered to try (10 years of data at least, perhaps via a thousand volunteers contributing a couple weeks?) and also perhaps because current systems don't have the capacity/capability to learn that quickly enough for various reasons.

The game/sim path to AGI - which is more or less deepmind's traditional approach - probably goes through animal-like intelligence first, and arguably things like VPT are already getting close. That of course is not the only path: there's also a prosaic GPT3 style path where you build out individual specialized modules first and then gradually integrate them.

Fascinating, this simboxing idea seems remarkably like Universal Alignment Test but approached from the opposite side! You're trying to be the 'aligning simulator', where as that is trying to get our AI in our world to act as if it's currently in a simbox being tested, and wants to pass the test.

Interesting. An intelligent agent is one that can simulate/model its action-consequential futures. The creation of AGI is the most important upcoming decision we face. Thus if humanity doesn't simulate/model the creation of AGI before creating AGI, we'd be unintelligent.

Have only just browsed your link, but it is interesting and I think there are many convergent lines of thought here. This UAT work seems more focused on later game superintelligence, whereas here i'm focusing on near-term AGI and starting with a good trajectory. The success of UAT as an alignment aid seems to depend strongly on the long term future of compute and how it scales. For example if it turns out (and the SI can predict) that moore's law ends without exotic computing then the SI can determine it's probably not in a sim by the time it's verifiably controlling planet-scale compute (or earlier).

Thanks for the interest! I agree that attempting to run AIs in simulations to see how they act seems like a worthwhile step, and we actually converged to wanting to test this in a LLM as well.

To reply to your last point, even if the AI has very high (but not 100%) confidence that it's not in a sim, this scheme should still work. The reason is outlined in this section from the document:

Why would it believe there is a higher level? What if it became very confident that there was no higher level universe?

The AI does not need to believe with any more than nonzero confidence that there is a one-level-up. Its utility function is totally indifferent to all worlds where there is no higher level universe, since we make it only care about getting control of one level up. This means that even if it becomes extremely certain (approaching 100%) that there is no higher level, it will effectively condition on there being a higher simulation.

This might lead to problems later when we have become an intergalactic civilisation and it is conditioning on an increasingly tiny sliver of possible worlds capable of simulating something so vast. One thing that might reduce this concern is that in our AI’s early stages, when there is a relatively large pool of simulators, it would have good reason to make precommitments which fix in values that the seed thinks will look good to the aligning simulators, rather than acting from the seed values directly for all of time. (One good reason to do this is that the civilisations one level up also care about this objection!) If our AI builds a successor AI that is aligned to humane values, then shuts down gracefully, many of the potential civilisations one level up should be willing to implement the seed in their universe, expecting it to build a successor aligned to their values and then shut down gracefully. This is analogous to several Decision Theory problems, such as Parfit's Hitchhiker, and FDT-style reasoning leads to reasonable outcomes.

(Not an expert.) (Sorry if you answered this and I missed it.)

Let’s say a near-future high-end GPU can run as many ops/s as a human brain but has 300× less memory (RAM). Your suggestion (as I understand it) would be a small supercomputer (university cluster scale?) with 300 GPUs running (at each moment) 300 clones of one AGI at 1× human-brain-speed thinking 300 different thoughts in parallel, but getting repeatedly collapsed (somehow) into a single working memory state.

(If so, I’m not sure that you’d be getting much more out of the 300 thoughts at a time than you’d get from 1 thought at a time. One working memory state seems like a giant constraint!)

Wouldn’t it make more sense to use the same 300 GPUs to have just one human-brain-scale AGI, thinking one thought at a time, but with 300× speedup compared to humans? I know that speedup is limited by latency (both RAM --> ALU and chip --> chip) but I’m not sure what the ceiling is there. (After all, 300× faster than the brain is still insanely slow by some silicon metrics.) I imagine each chip being analogous to a contiguous 1/300th of the brain, and then evidence from the brain is that we can get by with most connections being within-chip, which helps with the chip --> chip latency at least. (I have a couple back-of-the-envelope calculations related to that topic in §6.2 here.)

The problem is that due to the VN bottlneck, to reach that performance those 300 GPUs need to be parallelizing 1000x over some problem dimension (matrix-matrix multiplication), they can't actually just do the obvious thing you'd want - which is to simulate a single large brain (sparse RNN) at high speed (using vector-matrix multiplication). Trying that you'd just get 1 brain at real-time speed at best (1000x inefficiency/waste/slowdown). It's not really an interconnect issue per se, it's the VN bottleneck.

So you have to sort of pick your poison:

  • Parallelize over spatial dimension (CNNs) - too highly constraining for higher brain regions
  • Parallelize over batch/agent dimension - costly in RAM for agent medium-term memory, unless compressed somehow
  • Parallelize over time (transformers) - does enable huge speedup while being RAM efficient, but also highly constraining by limiting recursion

The largest advances in DL (the CNN revolution, the transformer revolution) are actually mostly about navigating this VN bottleneck, because more efficient use of GPUs trumps other considerations.

Soon as in most likely this decade, with most of the uncertainty around terminology/classification.

When you say “this decade” do you mean “the next ten years” or do you mean “the 2020s”? Just curious.

The latter, but not much difference.

EY 2007/2008 was mostly wrong about the brain, AI, and thus alignment in many ways.

As an example, the EY/MIRI/LW conception of AI Boxing assumes you are boxing an AI that already knows 1.) you exist, and 2.) that it is in a box. These assumptions serve pedagogical purpose for a blogger - especially one attempting to impress people with boxing experiments - but they are hardly justifiable, and if you remove those arbitrary constraints it's obvious that perfect containment is possible in simulation sandboxes given appropriate knowledge/training constraints: simboxing is easy.

I did not downvote your comment btw. I challenge you to think for yourself and identify what exactly you disagree with.

Do you expect the primary asset to be a neural architecture / infant mind or an adult mind? Is it too ambitious to try to find an untrained mind that reliably develops nicely?

Clearly the former precedes the latter - assuming by 'primary asset' you mean that which we eventually release into the world.

the core thing I worry about with any simulation-based approach is how to get coverage of possibility space. ensuring that the tests are informative about possible configurations of complex systems is hard; I would argue that this is a great set of points, and that no approach without this type of testing could be expected to succeed.

however, as we've seen with adversarial examples, both humans and current DL have fairly severe failures of alignment that cause large issues, and projects like this need tools to optimize for interpretability from the perspective of a formal verification algorithm.

it's my belief that in almost all cases, the values almost all beings ever wish to express lie on a low-enough-dimensional manifold that, if the ideal representation was learned, a formal verifier would be able to handle it; see, for example, recent work from anthropic on interpretability lining up nicely with what's needed to use reluplex (or followup work; reluplex is old now) to estimate bounds on larger networks' behavior.

I don't think we should give up on perfectly solving safety; instead, we should recognize that it has never even been closely approximated before, by any form of life, because even semi-planned evolutionary processes do not do a good job of covering the entire possibility space.

this post is an instant classic, and I agree that sims are a key step, starting from small agents. but ensuring that strongly generalizing moral views are generated in the agents who grow in the simulations is not trivial - we need to be able to generate formal bounds on the loss function, and doing so requires being able to prove all the way through the simulation the way we currently take gradients through it. if we can ask "where in the simulation would have changed this outcome", then we'd finally be getting somewhere in terms of generating moral knowledge that generalizes universally.

The post anticipates some of the most likely failure modes, such as failure to correctly time the transition from selfish to altruistic learning, or out of distribution failures for proxy matching. I anticipate that for proxy matching in particular we may end up employing multiple stages. I also agree that simplified and over-generalized notions of altruism may be easier to maintain long term, and I see some indications that this already occurs in at least some humans.

The low-complexity most general form of altruism is probably something like "empower the world's external agency", which seems pretty general. But then it may also need game-theoretic adjustments (empower other altruists more, disempower dangerous selfish agents, etc), considerations of suffering, etc.

I don't see why/how learning altruism/alignment (external empowerment) is different from other learning objectives (such as internal empowerment) such that formal verification is important for the latter but not the former. So for me the strongest evidence for important of formal verification would be evidence of it's utility/importance across ML in general, which I don't really see yet.

Astounding.

One thought:

The main world design challenge is not that of preventing our agents from waking up, neo-style, and hacking their way out of the simbox. That’s just bad sci-fi.

If we are in a simulation it seems to be very secure. People are always trying to hack it. Physicists go to the very bottom and try every weird trick. People discovered fire and gunpowder by digging deep into tiny inconsistencies. You play a game and there's a glitch if you hold a box against the wall, you see how far that goes. People discovered buffer overflows in Super Mario World and any curious capable agent eventually will too.

So the sim has to be like, very secure.

For early simboxes we'll want to stick to low-tech fantasy/historical worlds, and we won't run them for many generations, no scientific revolution, etc.

Our world (sim) does seem very secure, but this could just be a clever illusion. The advanced sims will not be hand written code like current games are, they will be powerful neural networks, trained on vast real world data. They could also auto-detect and correct for (retrain) around anomalies, and in the worst case even unwind time.

Humans notice (or claim to notice) anomalies all the time, and we simply can't distinguish anomalies in our brain's neural nets from anomalies in a future simulation's neural nets.

Just watch for anomalies and deal with them however you want. Makes perfect sense. That sounds like a relatively low effort way to make the simulation dramatically more secure.

Does seem to me like an old-fashioned physics/game engine might be easier to make, run faster, and be more self-consistent. It would probably lack superresolution and divine intervention would have to be done manually.

I'm curious what you see as the major benefits of neural-driven sim.

I just see neural-driven sims as the future largely because they can greatly reduce costs. Look at the art industry transformation enabled by the latest diffusion image generators and project that forward to film and then games. Eventually we will have fully automated pipelines that will take in prompts and then generate game designs, backstories, world histories, concept art, and then entire worlds. These pipelines will help us rapidly explore the space of environments that have the desired properties.

In gross simplification it's simply a matter of (correctly) wiring up the (future predicted) outputs of the external value learning module to the utility function module.

We are left with a form of circuit grounding problem: how exactly is the wiring between learned external agent utility and self-utility formed?

 

Utility function module? I don't even know how to make an agent with a clear utility function module, or anything like that. (This to my understanding is one lesson one can take from "the diamond maximizer" problem.) To me, assuming that one has any idea how to locate [that thing which will robustly continue to set the overarching goals of the system] is dodging the problem. I didn't read the post carefully though so plausibly missed where you address this.  

By utility function module I just meant some standard component of a model based planning system that estimates future discounted utilities from the value function. This was a high level post so I didn't get into those details, but it's not the component that encodes values in any form.

The equivalent problem for the "diamond maximizer" is locating not just the concept of "diamonds" in the learned world model, but the concept of "predicted future diamonds" and or "predicted future number of diamonds", and calculating utility from that (ie send that to the utility module).

If you have a robust algorithm to locate (and continuously relocate) the concept of "future predicted number of diamonds at time T" in the world model, you have the basis of the diamond maximizing utility function.

I'm saying that leaving aside locating the concept, I don't know how to make something that robustly pursues a concept.

Ok, here's a concrete example of how to train an agent to be a diamond tool maximizer in minecraft:

  1. Take a muzero like agent and train it's world model hard/long enough until it learns a simulated model of minecraft that matches the actual simulation.
  2. Locate the representation of diamond tools in the learned world model, and from that design a simple circuit/function which counts the number of diamond tools accumulated.
  3. Train the full muzero agent after replacing the typical score utility function with the new diamond counting utility function.
  4. Profit by maximizing diamond tools.

Ok, but 

  1. This could give rise to mesa-optimizers with respect to the score function.
  2. The score function doesn't know how to score like that. By saying "find the concept of predicted future diamond" you called on the AI's concept. But why should that concept be so robust that even when you train your step 3 AI to a much higher intelligence than the step 1 AI, it (the concept of predicted diamond) still knows how to score behavior or mechanisms in terms of how much diamond they lead to?
  1. Where exactly does the mesa optimizer come from, how exactly is it working? That's just a vague boogeyman which simply doesn't exist in this model. Vague claims of "Ooh but mesa-optimizers" are fully general counterarguments against even perfectly aligned AI (like this) - and are thus meaningless until proven otherwise.

  2. It's very simple and obvious in this example, because step 1 results in a functional equivalent of the minecraft code, which has a perfect crisp representation of the objects of interest (diamond tools). "Train step 3 to a much higher intelligence than step 1" is meaningless as the output of step 1 is not an agent, it's just a functional equivalent of the minecraft code.

Step 1 results in a perfect functional sim of minecraft with a perfect diamond tool concept, and step 2 results in a perfect diamond tool counting utility function. Step 3 then results in a perfectly aligned agent (assuming no weird errors in the muzero training). We could alternatively replace step 3 with a simple utility maximizer like AIXI, which would obviously then correctly maximize the correct aligned utility function. Muzero is a more practical approximation of that.

(I think I'm going to tap out because there's too many different background assumptions we're making here, sorry; maybe I'll come back later.... E.g. the "diamond maximizer problem" is about our world, not a world that's plausibly solvable by something that's literally MuZero; and so you have to have a system that's doing complex new interesting things, which aren't comprehended by the concept you find in your step 1 AI.)

I never said diamond maximizer problem - I said "diamond tool maximizer in minecraft".

Of course once you have an agent that robustly optimizes a goal in sim, then you can do sim to real transfer - which is guaranteed to work if the sim is accurate enough, and in practice that isn't some huge theoretical problem. (The blocker on self driving cars is not sim to real transfer, for example - the sims are good enough)

The "interesting new things" that we need here are optimizations of existing concepts.

I said diamond maximizer problem, and then you responded to that talking about this other thing that turned out to be not the diamond maximizer problem. 

Actually you said this:

I don't even know how to make an agent with a clear utility function module, or anything like that. (This to my understanding is one lesson one can take from "the diamond maximizer" problem.)

So I described how to make an agent with a clear utility function to maximize diamond tools in minecraft, which is obviously related to the diamond maximizer problem and easier to understand.

If you are actually arguing that you don't/won't/can't understand how to make an agent with a clear utility function module - even after my worked example, not to mention all the successful DL agents to date - unless that somehow solves the 'diamond maximizer' problem, then you either aren't discussing in good faith or the inferential gap here is just too enormous and you should read more DL.

I agree that the the inferential gap here is too big, as noted above; by "agent" I of course mean "the sort of agent that is competent enough to transform the world" which implies things like "can learn new domains by its own steering" which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean. 

The agent I described has the perfect model of it's environment, and in the limit of compute can construct perfect plans to optimize for diamond tool maximization. So obviously it is the sort of agent that is competent enough to transform its world - there is no other agent more competent.

Learning a new domain (like a different sim environment) would require repeating all the steps.

which implies that the concept of predicted diamond will have trouble understanding what these new capabilities mean

The concept of predicted diamond doesn't understand anything, so not sure what you meant there. Perhaps what you meant is that when learning new domains by its own steering, the concept of predicted diamond will need to be relearned. Yes, of course - the steps must be repeated.

Would your point here w.r.t. utility functions be fairly summarizable as the following?

An agent that actually achieves X can be obtained by having a superintelligence that understands the world including X, and then searching for code that scores highly on the question put to the superintelligence: "How much would running this code achieve X?"

I would agree with that statement.

I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn't need to understand code or human language. It simply searches for plans that maximize diamond tools.

But assuming you could ask that question through a suitable interface the SI understood - and given some reasons to trust that giving the correct answers is instrumentally rational for the SI - then yes I agree that should work.

Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately). 

My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn't seem to me to answer the question about utility functions. It doesn't explain how the code that's found, actually encodes the idea of diamonds and does its thinking in a way that's really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don't, so we, unlike the superintelligence, can't use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.) 

 

Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.

Yeah.

Maybe a more central thing to how our views are differing, is that I don't view training signals as identical to utility functions. They're obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system's goals in some way, but it won't be identical to the operation of writing some objective to an agent's utility function, and the non-identicality will become very relevant for a very intelligent system. 

Another thing to say, if you like the outer / inner alignment distinction: 
1. Yes, if you have an agent that's competent to predict some feature X of the world "sufficiently well", and you're able to extract the agent's prediction, then you've made a lot of progress towards outer alignment for X; but

2. unfortunately your predictor agent is probably dangerous, if it's able to predict X even when asking about what happens when very intelligent systems are acting, and

3. there's still the problem of inner alignment (and in particular we haven't clarified utility functions -- the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal -- which we wouldn't need if we had the predictor-agent, but that agent is unsafe). 

In the real world, these domains aren't the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.

We are now far from your original objection " I don't even know how to make an agent with a clear utility function module".

Imperfect simulations work just fine - for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.

We're not far from there. There's inferential distance here. Translating my original statement, I'd say: the closest thing to the "utility function module" in the scenario you're describing here with MuZero, is the concept of predicted diamond and the AI it's inside of. But then you train another AI to pursue that. And I'm saying, I don't trust that that new trained AI actually maximizes diamond; and to the point, I don't have any clarity on how the goals of  newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don't understand it well enough to have any justified confidence it'll robustly pursue diamond. 

So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.

Of course that's just in a sim.

Translating the concept to the real world, there are now 3 possible sources of 'errors':

  1. imperfection of the learned world model
  2. imperfect planning (compute bound)
  3. imperfect utility function

My main claim is that approximation error in 1 and 2 (which is inevitable) don't necessarily bias for strong optimization towards the wrong utility function (and they can't really).

Let's think about what happens if you subject humans to optimization according to these pressures. What kind of agents are you likely to get out? For the sake of the thought-experiment, let's say that a super-intelligent and maximally-altruistic human is created by simbox to serve as an AI for a civilization of human-level-intelligent spiders.

To start, there is a massive distributional difference between the utility functions of sim-humans and spiders. Especially if the other sim-humans in the training environment were also maximally altruistic. We need the sim-humans to want to improve and maintain accurate models of other agents, and replacing its utility function with the distribution of other agents' utility functions doesn't guarantee that. Why should it want to improve its model of other agents' utility instead of using its existing, less accurate model?

There is also the problem that optimizing for altruistic behavior probably decreases the accuracy of their other-agent utility function models. The most altruistic humans in reality probably have a distorted model of the utility function of the average human, due to a combination of the typical-mind fallacy and the fact that modeling other agents with inaccurately high altruism likely increases altruism. If we're mainly selecting for altruistic behavior, we're going to get a less accurate world-model, which when combined with the distributional shift may result in the sim-human having an erroneous model of the spiders. Maybe something the spiders value highly (the joy of consuming live prey) may be lost in translation, or given a lower priority than appropriate from the view of the spiders. For humans training a sim-AI, this could be something we really value like "romance", "emotional states corresponding with reality", or "boredom". 

There is no a-priori reason to expect the sims to want to avoid wire-heading its creators. Maybe that is the coherent extrapolated value of humanity, as deduced by a subset of sims. I don't see any obvious ways to solve this problem without giving the sim access to physical reality. 

Human beings are also known to have internal inconsistencies for the purpose of appearing maximally altruistic to interperatability tools (other people, empathy) while not actually being so. I'm not sure how this plays into your scenario, but it is worrying.

New to LessWrong?