I have a compute-market startup called vast.ai, I also do research for Orchid (crypto), and I'm working towards a larger plan to save the world. Currently seeking networking, collaborators, and hires - especially top notch cuda/gpu programmers.

My personal blog: https://entersingularity.wordpress.com/

Wiki Contributions


I find it interesting that you consider "will likely" to be an example of "very confident", whereas I'm using that specifically to indicate uncertainty, as in "X is likely" implies a bit over 50% odds on some cluster of ideas vs others (contingent on some context), but very far from certainty or high confidence.

The only prediction directly associated with a time horizon is the opening prediction of AGI most likely this decade. Fully supporting/explaining that timeline prediction would probably require a short post, but it mostly reduces to: the surprisingly simplicity of learning algorithms, the dominance of scaling, and of course brain efficiency which together imply AGI arrives predictably around or a bit after brain parity near the endphase of moore's law. The early versions of this theory have already made many successful postdictions/predictions[1].

Looking at the metaculus prediction for "Date Weakly General AI is Publicly Known", I see the median was in the 2050's just back in early 2020, had dropped down to around 2040 by the time I posted on brain efficiency earlier this year, and now is down to 2028: equivalent to my Moravec-style prediction of most likely this decade. I will take your advice to link that timeline prediction to metaculus, thanks.

Most of the other statements are all contextually bound to a part of the larger model in the surrounding text and should (hopefully obviously) not be interpreted out-of-context as free-floating unconditional predictions.

For example: "[human judges] will be able to directly inspect, analyze, search and compare agent mind states and thought histories, both historical and in real-time."

Is a component of a larger design proposal, which involves brain-like AGI with inner monologues and other features that make that feature rather obviously tractable.

Imagine the year is 1895 and I've written a document describing how airplanes could work, and you are complaining that I'm making an overconfident prediction that "human pilots will be able to directly and easily control the plane's orientation in three dimensions: yaw, pitch, and roll". That's a prediction only in the sense of being a design prediction, and only in a highly contextual sense contingent on the rest of the system.

I'm not saying that some of these are plausible avenues, but to me, this comes across as overconfident (it might be a stylistic method, but I think that is also problematic in the context of AGI Safety).

I'm genuinely more curious which of these you find the most overconfident/unlikely, given the rest of the design context.

Perhaps these?:

DL based AGI will not be mysterious and alien; instead it will be familiar and anthropomorphic

AGI will be born of our culture, growing up in human information environments

AGI will mostly have similar/equivalent biases - a phenomenon already witnessed in large language models

Sure these were highly controversial/unpopular opinions on LW when I was first saying AGI would be anthropomorphic, that brains are efficient, etc way back in 2010, long before DL, when nearly everyone on LW thought AGI would be radically different than the brain (ironically based mostly on the sequences: a huge wall of often unsubstantiated confident philosophical doctrine).

But on these issues regarding the future of AI, it turns out that I (along with moravec/kurzweil/etc) was mostly correct, and EY/MIRI/LW was mostly wrong - and it seems MIRI folks concur to some extent and updated. The model difference that led to divergent predictions about the future of AI is naturally associated with different views on brain efficiency[2] and divergent views on tractability of safety strategies[3].

  1. For example the simple moravec-style model that predicts AI task parity around the time of flop parity to equivalent brain regions roughly predicted DL milestones many decades in advance, and the timing of NLP breakthroughs ala LLM is/was also predictable based on total training flops equivalence to brain linguistic cortex. ↩︎

  2. EY was fairly recently claiming that brains were about half a million times less efficient than the thermodynamic limit. ↩︎

  3. For example see this comment where Rob Bensinger says, "If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world's existential risk.", but then for various reasons doesn't believe that's especially doable. ↩︎

The basic point is that the stuff we try to gesture towards as "human values," or even "human actions" is not going to automatically be modeled by the AI.

I disagree, and have already spent some words arguing for why (section 4.1, and earlier precursors) - so I'm curious what specifically you disagree with there? But I'm also getting the impression you are talking about a fundamentally different type of AGI.

I'm discussing future DL based AGI which is - to first approximation - just a virtual brain. As argued in section 2/3, current DL models already are increasingly like brain modules. So your various examples are simply not how human brains are likely to model other human brains and their values. All the concepts you mention - homestatic mechanisms, 'shards', differential equations, atoms, economic revealed preferences, cell phones, etc - these are all high level linguistic abstractions that are not much related to how the brain's neural nets actually model/simulate other humans/agents. This must obviously be true because empathy/altruism existed long before the human concepts you mention.

The obvious problem this creates is for getting our "detailed values" by just querying a pre-trained world model with human data or human-related prompts:

You seem to be thinking of the AGI as some sort of language model which we query? But that's just a piece of the brain, and not even the most relevant part for alignment. AGI will be a full brain equivalent, including the modules dedicated to long term planning, empathic simulation/modeling, etc.

Even leaving aside concerns like Steve's over whether empowerment is what we want, most of our intuitive thinking about it relies on the AI sharing our notion of what it's supposed to be empowering, which doesn't happen by default.

Again for successful brain-like AGI this just isn't an issue (assuming human brains model the empowerment of others as a sort of efficient approximate bound).

Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it's hard to test, we are unlikely to succeed.

So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.

I expect then that we won't be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.

That's then just equivalent to saying "I expect then that we won't even bother with testing our alignment designs". Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won't care? And if you agree that testing is necessary, then shouldn't this be key to any successful alignment plan?

My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet.

Which obviously is nearly impossible if it doesn't know it is in a sim, and doesn't know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk - who at least is aware of the sim argument.

But that’s a bit beside the point until we address the question: How do we calculate that proxy?

I currently see two potential paths, which aren't mutually exclusive.

The first path is to reverse engineer the brain's empathy system. My current rough guess of how the proxy-matching works for that is explained in some footnotes in section 4, and I've also re-written out in this comment which is related to some of your writings. In a nutshell the oldbrain has a complex suite of mechanisms (facial expressions, gaze, voice tone, mannerisms, blink rate, pupil dilation, etc) consisting of both subconscious 'tells' and 'detectors' that function as a sort of direct non-verbal, oldbrain to oldbrain communication system to speed up the grounding to newbrain external agent models. This is the basis of empathy, evolved first for close kin (mothers simulating infant needs, etc) then extended and generalized. I think this is what you perhaps have labeled innate 'social instincts' - these facilitate grounding to the newbrain models of other's emotions/values.

The second path is to use introspection/interpretability tools to more manually locate learned models of external agents (and their values/empowerment/etc), and then extract those located circuits and use them directly as proxies in the next agent.

Do you think that we can write code today to calculate this proxy, and then we can go walk around town and see what that code spits out in different real-world circumstances? Or if we can’t write such code today, why not, and what line of research gets us to a place where we can write such code?

Neuroscientists may already be doing some of this today, or at least they could (I haven't extensively researched this yet). Should be able to put subjects in brain scanners and ask them to read and imagine emotional scenarios that trigger specific empathic reactions, perhaps have them make consequent decisions, etc.

And of course there is some research being done on empathy in rats, some of which I linked to in the article.

Is this something you've thought deeply about and or care to expand on? Curious about your source of skepticism, considering:

  • we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want
  • we can completely and automatically monitor their inner monologues, thoughts, etc

How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)

How would you break containment now, assuming you are in a sim?

Also, Significant speed/memory advantages go against the definition of 'human-level agents' and are intrinsically unlikely anyway as 2x speed/memory agents are simply preceded by 1x speed/memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.

By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools.

We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.

But then later down you say:

For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the "weak" worlds).

Which seems like a partial contradiction, unless you believe we can't contain human-level agents?

Thanks, upvoted for engagement and constructive criticism - I'd like more to see this comment.

I'm going to start perhaps in reverse to establish where we seem to most agree:

There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained.

I fully agree with this statement.

However in worlds that rapidly FOOM, everything becomes more challenging, and I'll argue in a moment why I believe that the approach presented here still is promising in rapid FOOM scenarios, relative to all other practical techniques that could actually work.

But even without rapid FOOM, we still can have disaster - for example consider the scenario of world domination by a clan of early uploads of some selfish/evil dictator or trillionaire. There's still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.

Now if rapid FOOM is near inevitable, then those considerations naturally may matter less. But rapid FOOM is far from inevitable.

First, Moore's Law is ending, and brains are efficient, perhaps even near pareto-optimal.

Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.

To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.

Performance improvement is mostly about scaling compute and data in quantity and quality - which is exactly what has happened with deep learning, which was deeply surprising to many in the ML/comp-sci community and caused massive updates (but was not surprising and was in fact predicted by those of us arguing for brain efficiency and brain reverse engineering).

Now, given that background, there a few other clarifications and/or disagreements:

If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore's law arguments you bring up, we can only simulate each agent at close to 'real time'.

To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, none of these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.

So just to clarify:

  • AGI designs in simboxes are exactly the same as unboxed designs, and have exactly the same compute costs
  • The only difference is in the datastream and thus knowledge
  • The ideal baseline cost of simboxing is only O(N+1) vs O(N) without - once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/s for brain-size model * 1e9s seconds for 30 years equivalent), or less than $10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing.

Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain 'magical powers' in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a 'potion' that makes it 2x as smart. How do we simulate this? We could just increase the agent's compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)

If brains are efficient, then matching them will already use up most of our algorithmic optimization slack - which again seems to be true based on the history of deep learning. But let's suppose there still is significant optimization slack, then in a sense you've almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not.

If there is great algorithmic slack, then we can employ agents which graduate from simboxes as engineers in the design of better AGI and simboxes. To the extent there is any downside here or potential advantage for other viable approaches, that difference seems to come strictly at the cost of alignment risk.

Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to:

  • not use adequate testing of alignment (ie simboxes)
  • or not optimize for product of intelligence potential and measurable alignment/altruism

Do you think such an other approach could exist? If so, where would the difference lie and why?

By selfish I meant completely non-altruistic. Also uploading by itself isn't sufficient, it also requires allocation of resources to uploads. In the selfish trillionaire scenario I was imagining the trillionaire creates many copies of themselves, perhaps some copies of a few people they find interesting, and then various new people/AGI, and those copies all branch and evolve, but there is little to zero allocation for uploading the rest of us, not to mention our dead ancestors.

For early simboxes we'll want to stick to low-tech fantasy/historical worlds, and we won't run them for many generations, no scientific revolution, etc.

Our world (sim) does seem very secure, but this could just be a clever illusion. The advanced sims will not be hand written code like current games are, they will be powerful neural networks, trained on vast real world data. They could also auto-detect and correct for (retrain) around anomalies, and in the worst case even unwind time.

Humans notice (or claim to notice) anomalies all the time, and we simply can't distinguish anomalies in our brain's neural nets from anomalies in a future simulation's neural nets.

Interesting. An intelligent agent is one that can simulate/model its action-consequential futures. The creation of AGI is the most important upcoming decision we face. Thus if humanity doesn't simulate/model the creation of AGI before creating AGI, we'd be unintelligent.

Have only just browsed your link, but it is interesting and I think there are many convergent lines of thought here. This UAT work seems more focused on later game superintelligence, whereas here i'm focusing on near-term AGI and starting with a good trajectory. The success of UAT as an alignment aid seems to depend strongly on the long term future of compute and how it scales. For example if it turns out (and the SI can predict) that moore's law ends without exotic computing then the SI can determine it's probably not in a sim by the time it's verifiably controlling planet-scale compute (or earlier).

Load More