This is part of the work done at Conjecture.
This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback.
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.
Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.
We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.
The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.”
CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it.
Logical, Not Physical Emulation
We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. 
In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why.
CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum.
Predict, Track and Bound Capabilities
In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have.
One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime.
Exploit the Human Regime
We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involves building systems that involves humans should allow you to swap those humans for CoEms without breaking or drastically altering their behavior.
We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime. The problem of why e.g. normal laws and regulations will not work for AGI is that we have no way of ensuring that the AGI that gets built will obey the capabilities constraints that are implicitly assumed in our social and legal mechanism design. By the definition of the paradigm, CoEms have the property of being understandable enough that we can ensure they implement human level capabilities and allow the user to ensure this regime isn’t accidentally exited.
We observe that humans are sorta, kinda, sometimes, corrigible-ish. They can focus their attention on many different things and optimize decently hard. Speedrunners can put massive amounts of effort into optimizing for relatively arbitrary numbers going down, scientists can design experiments and generate new knowledge, workers can generally be paid to perform many kinds of tasks.
We expect that this retargetability is something that can rather naturally be achieved in the process of implementing human-like cognition, and that many other factors, such as emotions, values, desires, etc are mostly contingent and can be factored out to a large extent.
By combining these five factors, we should be left with a system that:
- Is built on understandable, discoverable and implementable ML and computational building blocks.
- Does not have so much Magic inside of it that we cannot even put bounds on its possible consequences and capabilities.
- Can be sufficiently understood and bounded to ensure it does not suddenly dramatically shift its behaviors, properties and capabilities.
- Is well situated in the human(ish) capabilities regime and, when in doubt, will default to human-like failure modes rather than completely unpredictable behaviors.
- Is retargetable enough to be deployed to solve many useful problems and not deviate into dangerous behavior, along as it is used by a careful user.
Instead of building black box, end-to-end Magical systems, we suggest composing simpler systems and reintegrate human’s knowledge into the development process. While this is a slower path to get to AGI, we believe it to be much safer.
There is a massive amount of alignment insights that can be gained purely from mining current level systems, and we should focus on exhausting those insights before pushing the capabilities frontier further.
CoEms, if successful, would not be strongly aligned CEV agents that can be left unsupervised to pursue humanity’s best interests. Instead, CoEms would be a strongly constrained subspace of AI designs that limit systems from entering into regimes of intelligence and generality that would violate the assumptions that our human-level systems and epistemology can handle.
Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.
We think this is a promising approach to ending the acute risk period before the first AGI is deployed.
Similar ideas have been proposed by people and organizations such as Chris Olah, Ought and, to a certain degree, John Wentworth, Paul Christiano, MIRI, and others.
When we use the word “Magic” (capitalized), we are pointing at something like “blackbox” or “not-understood computation”. A very Magical system is a system that works very well, but we don’t know why or how it accomplishes what it does. This includes most of modern ML, but a lot of human intuition is also (currently) not understood and would fall under Magic.
While Robin Hanson has a historical claim to the word “em” to refer to simulations of physical human brains, we actually believe we are using the word “emulation” more in line with what it usually means in computer science.
In other words, if we implement some kind of human reasoning, we don’t care whether under the hood it is implemented with neural networks, or traditional programming, or whatever. What we care about is that a) its outputs and effects emulate what the human mind would logically do and b) it does not “leak”. By “leak” we mean something like “no unaccounted for weirdness happens in the background by default, and if it does, it’s explicit.” For example, in the Rust programming language, by default you don’t have to worry about unsafe memory accesses, but you have a special “unsafe” keyword you can use to mark a section of code as no longer having these safety guarantees, this way you can always know where the Magic is happening, if it is happening. We want similar explicit tracking of Magic.
The Safety Juice™ that makes e.g. Eliezer like Ems/Uploads as a “safe” approach to AGI comes from a fundamentally different source than in CoEms. Ems are “safe” because we trust the generating process (we trust that uploading results in an artifact that faithfully acts like the uploaded human would), but the generated artifact is a black box. In CoEms, we aim to make an artifact that is in itself understandable/”safe”.
Roughly defined as something like “big pretrained models + finetuning + RL + other junk.”
Note that we are not saying humans are “inherently aligned”, or robust to being put through 100000 years of RSI, or whatever. We don’t expect human cognition to be unusually robust to out of distribution weirdness in the limit. The benefit comes from us as a species being far more familiar with what regimes human cognition does operate ok(ish) in…or at least in which the downsides are bounded to acceptable limits.
This is a good litmus test for whether you are actually building CoEms, or just slightly fancier unaligned AGI.
And other constraints, e.g. emotional, cultural, self-preservational etc.
A malicious or negligent users could still absolutely fuck this all up, of course. CoEms aren’t a solution to misuse, but instead a proposal for getting us from “everything blows up always” to “it is possible to not blow things up”.
For example, with GPT3 many, many capabilities were only discovered long after it was deployed, and new use cases (and unexplainable failure modes) for these kinds of models still are being discovered all the time.
As we have already observed with e.g. unprecedented GPT3 capabilities and RL misbehavior.
In the same way that AlphaZero is a more powerful, and in some sense "simpler" chess system than Deep Blue, which required a lot of bespoke human work, and was far weaker.
Or rather “allow the user to limit”.
What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g. Microsoft, with little or no controls on how they're used). You yourself wrote:
How is this compatible with the quote above?!
Well, we are not very good at it, but generally speaking, however capitalism seems to be acting to degrade our food, food companies are not knowingly routinely putting poisonous additives in food.
And however bad medicine is, it does seem to be a net positive these days.
Both of these things are a big improvement on Victorian times!
So maybe we are a tiny bit better at it than we used to be?
Not convinced it actually helps, mind....
Can you list a concrete research path which you’re pursuing in light of this strategy? This all sounds ok in principle, but I’d bet alignment problems show up in concrete pathways.
Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?
My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they're made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I've done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear (!) single-layer (!!) regression models, not anything like a LLM. How much progress does Conjecture expect to really make? What are other papers our study group should read?
This update massively reduces my expectation for Conjecture's future value. When you're a small player in the field, you produce value through transferrable or bolt-on components, such as Conjecture's interpretability and simulator work. CoEm on the other hand is completely disconnected from other AGI or ai safety work, and pretty much only has any impact if Conjecture is extraordinarily successful.
We mostly don’t know how to do alignment, so I take “not obviously bad, and really different from other approaches” to be a commendable quality for a research proposal. I also like research that is either meh, or extraordinarily successful, first because these pathways are going to almost always be neglected in a field, and second because I think most really great things in general come from these high risk of doing nothing (if you don’t have inside knowledge), high return if you do something strategies.
If you want to make a competitive agi from scratch (even if you only want "within 5 years of best ai"), you just have to start way earlier. If this project was anounced 7 years ago I'd like it much more, but now is just too late, you'd need huge miracles to finish in time.
Could you elaborate a bit more about the strategic assumptions of the agenda? For example,
1. Do you think your system is competitive with end-to-end Deep Learning approaches?
1.1. Assuming the answer is yes, do you expect CoEm to be preferable to users?
1.2. Assuming the answer is now, how do you expect it to get traction? Is the path through lawmakers understanding the alignment problem and banning everything that is end-to-end and doesn't have the benefits of CoEm?
2. Do you think this is clearly the best possible path for everyone to take right now or more like "someone should do this, we are the best-placed organization to do this"?
PS: Kudos to publishing the agenda and opening up yourself to external feedback.
I understood the proposal as "let's create relatively small, autonomous AIs of multi-component cognitive architecture instead of unitary DNNs becoming global services (and thus ushering cognitive globalisation)". The key move seems to be that you want to increase the AI architecture's interpretability by pushing some intelligence into the multi-component interaction from the "DNN depths". In this setup, components may remain relatively small (e.g., below 100B parameter scale, i.e., at the level of the current SoTA DNNs), while their interaction leads to the emergence of general intelligence, which may not be achievable for unitary DNNs at these model scales.
The proposal seems to be in some ways very close to the recent Eric Drexler's Open Agency Model. Both your proposal and Drexler's "open agencies" seem to noticeably allude to the (classic) approaches to cognitive architecture, a la OpenCog, or LeCun's H-JEPA.
As well as Drexler, it seems that you take for granted that multi-component cognitive architectures will have various "good" properties, ranging from interpretability to retargetability (which Michael Levin calls persuadability, btw). However, as I noted in this comment, none of these "good properties" are actually granted for an arbitrary multi-component AI. It must be demonstrated for a specific multi-component architecture why it is more interpretable (see also in the linked comment my note that "outputting plans" != interpretability), persuadable, robust, and ethical than an alternative architecture.
Your approach to this seems to be "minimal", that is, something like "at least we know the level of interpretability, persuadability/corrigibility, robustness, and ethics of humans, so let's try to build AI 'after humans' so that we get at least these properties, rather than worse properties".
As such, I think this approach might not really solve anything and might not "end the acute risk period", because it amounts to "building more human-like minds", with all other economic, social, and political dynamics intact. I don't see how building "just more humans" prevents corporations from pursuing cognitive globalisation in one form or another. The approach just seems to add more "brain-power" to the engine (either via building AI "like humans, but 2-3 sigmas more intelligent", or just building many of them and keeping them running around the clock), without targeting any problems with the engine itself. In other words, the approach is not placed within a frame of a larger vision for civilisational intelligence architecture, which, I argued, is a requirement for any "AI safety paradigm": "Both AI alignment paradigms (protocols) and AGI capability research (intelligence architectures) that don’t position themselves within a certain design for civilisational intelligence are methodologically misguided and could be dangerous."
Humans have rather bad interpretability (see Chater's "The Mind is Flat"), bad persuadability (see Scott Alexander's "Trapped Priors"), bad robustness (see John Doyle's hijackable language and memetic viruses), poor capability for communication and alignment (as anyone who tried to reliably communicate any idea to anyone else or to align with anyone else on anything can easily attest; cf. discussion of communication protocols in "Designing Ecosystems of Intelligence from First Principles"), and, of course, poor ethics. It's also important to note that all these characteristics seem to be largely uncorrelated (at least, not strictly pegged) with raw general intelligence (GI factor) in humans.
I agree with you and Friston and others who worry about the cognitive globalisation and hyperscaling approach, but I also think that in order to improve the chances of humanity, we should at least aim at better than human architecture from the beginning. Creating many AI minds of the architecture "just like humans" (unless you count on some lucky emergence and that even trying to target "at humans" will yield architecture "better than humans") doesn't seem to help in civilisational intelligence and robustness, just accelerate the current trends.
I'm also optimistic because I see nothing impossibly hard or intractable in designing cognitive architectures that would be better than humans from the beginning and getting good engineering assurances that the architecture will indeed yield these (better than human) characteristics. It just takes time and a lot of effort (but so as architecturing AI "just like humans" does). E.g., (explicit) Active Inference architecture seems to help significantly at least with interpretability and the capacity for reliable and precise communication and, hence, belief and goal alignment (see Friston et al., 2022).
Thank you. You phrased the concerns about "integrating with a bigger picture" better than I could. To temper the negatives, I see at least two workable approaches, plus a framing for identifying more workable approaches.
As an aside, I think CogEms are a perfectly valid strategy for creating aligned AI. It doesn't matter if most humans have bad interpretability, persuadability, robustness, ethics, or whatever else. As long as it's possible for some human (or collection of humans) to be good at those things, we should expect that some subclass of CogEms (or collection of CogEms) can also be good at those things.
I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors' response to various questions people have.
The main idea:
This post doesn't make me actually optimistic about conjeture actually pulling this off, because for that I would have to see details but it does at least look like you understand why this is hard and why the easy versions like just telling gpt5 to imitate a nice human won't work. And I like that this actually looks like a plan. Now maybe it will turn out to not be a good plan but at least is better than openAI's plan of
"well figure out from trial and error how to make the Magic safe somehow".
This is interesting.
I'm curious if you see this approach as very similar to Ought's approach? Which is not a criticism, but I wonder if you see their approach as akin to yours, or what the major differences would be.
Doesn't that require understanding why humans have (or don't have) certain safety properties? That seems difficult.
To be frank, I have no idea what this is supposed to mean. If “make non-magical, humanlike systems” were actionable, there would not be much of an alignment problem. If this post is supposed to indicate that you think you have an idea for how to do this, but it's a secret, fine. But what is written here, by itself, sounds like a wish to me, not like a research agenda.
Outside of getting pregnant, I suppose.
On the surface level, it feels like an approach with a low probability of success. Simply put, the reason is that building CoEm is harder than building any AGI.
I consider it to be harder not only because it is not what everyone already does but also because it seems to be similar to AI people tried to create before deep learning and it didn't work at all until they decided to switch to Magic which [comparatively] worked amazingly.
Some people are still trying to do something along the lines (e.g. Ben Goertzel) but I haven't seen anything working at least remotely comparable with deep learning yet.
I think that the gap between (1) "having some AGI which is very helpful in solving alignment" and (2) "having very dangerous AGI" is probably quite small.
It seems very unlikely that CoEm will be the first system to reach (1), so probably it is going to be some other system. Now, we can either try to solve alignment using this system or wait until CoEm is improved enough so it reaches (1). Intuitively, it feels like we will go from (1) to (2) much faster than we will be able to improve CoEm enough.
So overall I am quite sceptical but I think it still can be the best idea if all other ideas are even worse. I think that more obvious ideas like "trying to understand how Magic works" (interoperability) and "trying to control Magic without understanding" (things like Constitutional AI etc.) are somewhat more promising, but there are a lot of efforts in this direction, so maybe somebody should try something else. Unfortunately, it is extremely hard to judge if it's actually the case.
What do you see as the key differences between this and research in (theoretical) neuroscience? It seems to me like the goals you've mentioned are roughly the same goals as those of that field: roughly, to interpret human brain circuitry, often through modelling neural circuits via artificial neural networks. For example, see research like "Correlative Information Maximization Based Biologically Plausible Neural Networks for Correlated Source Separation".
I had thoughts of doing something very like this a few years ago, back when I still thought we had around 20 years until AGI. Now I think we have <5 years until AGI, and I suspect you don't have time for this. Do you also have a plan in mind for delaying the deployment of dangerous AGI to give humanity more time for working on alignment?
I do not ask this question rhetorically. I have thoughts along this line and would like to discuss them with you.
Not an actual objection to this proposal, but important note is that we don't know upper limits of human cognitive capabilities. Like, humans were bad at arithmetics before invention of positional numerical systems and after that they became surprisingly good. We know that somewhere within human abilities is a capability to persuade you to let them out of the box and probably other various forms of "talk-control". If we could have look in the mind of Einstein while having no clues about classical mechanics, we would have understood nothing. I agree that there is a possibility to not blow up with certainity, but I would like to invent corrigibility measures for CoEms before implementing them.
Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average "civilization", with "thousands of years" to plan, that wants to break out of the box.
This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn't want to commit species-cide on humans.
As fas as I understand, the point on this proposal is that "human-like cognitive architecture ≈ cognitive containability ≈ sort of safety", not "human-like cognitive architecture ≈ human values". I just want to say that even human can be cognitively uncontainable relatively to another human, because they can learn mental tricks that look to another human as Magic.
Looking forward to more details. I generally agree that building AIs that make "the right decisions for the right reasons" by having their thought processes parallel ours is a worthwhile direction.
You give a reason for not sharing technical details as “other actors are racing for as powerful and general AIs as possible.” I don’t understand. If your methods are for controlling powerful AIs, why wouldn’t you want these methods released?
I notice I am really confused here. Besides what I've already listed, I have learned Conjecture is a for profit company. The ability for a CoEm to replace a human task already makes it much more general than current models, yet you seem to imply that they will be less general but more aligned? Are these models your intended product, or is it the ability to align these models?
Im struggling to understand how this is is different from “we will build aligned ai to align ai”. specifically: Can someone explain to me how human-like and AGI are different? Can someone explain to me why human-like AI avoids typical x-risk scenarios (given those human-likes could say clone themselves, speed up themselves and rewrite their own software and easily become unbounded)? Why isnt an emulated cognitive system a real cognitive system… i don’t understand how you can emulate a human-like intelligence and it not be the same as fully human-like.
currently my reading of this is we will build human-like AI because humans are bounded so it will be too, those bounds are: (1) sufffiecent to prevent xrisk (2) helpful for (and maybe even the reason for) alignment. Isnt a big wide open unsolved part of the alignment problem “how do we keep itelligent systems bounded”? What am I missing here?
I guess one maybe supplementary question as well is: how is this different from normal NLP capabilities research which is fundamentally about developing and understanding the limitations of human like intelligence? Most folks in the field say who publish in ACL conferences would explicitly think of this as what they are doing and not trying to build anything more capable than humans.
I hate to do it, but can't resist the urge to add a link to my article First human upload as AI Nanny.
The idea is that human-like AI is intrinsically more safe and can be used to control AI development
As there are no visible ways to create safe self-improving superintelligence, but it is looming, we probably need temporary ways to prevent its creation. The only way to prevent it, is to create special AI, which is able to control and monitor all places in the world. The idea has been suggested by Goertzel in form of AI Nanny, but his Nanny is still superintelligent and not easy to control, as was shown by Bensinger at al. We explore here the ways to create the safest and simplest form of AI, which may work as AI Nanny. Such AI system will be enough to solve most problems, which we expect the AI will solve, including control of robotics, acceleration of the medical research, but will present less risk, as it will be less different from humans. As AI police, it will work as operation system for most computers, producing world surveillance system, which will be able to envision and stop any potential terrorists and bad actors in advance. As uploading technology is lagging, and neuromorphic AI is intrinsically dangerous, the most plausible way to human-based AI Nanny is either functional model of the human mind or a Narrow-AI empowered group of people.
On the chess thing, the reason why I went from 'AI will kill our children' to 'AI will kill our parents' shortly after I understood how AlphaZero worked was precisely because it seemed to play chess like I do.
I'm an OK chess player (1400ish), and I when I'm playing I totally do the 'if I do this and then he moves this and then ....' thing, but not very deep. And not any deeper than I did as a beginner, and I'm told grandmasters don't really go any deeper.
Most of the witchy ability to see good chess moves is coming from an entirely opaque intuition about what moves would be good, and what positions are good.
You can't explain this intuition in any way that allows it to move from mind to mind, although you can sometimes in retrospect justify it, or capture bits of it in words.
You train it through doing loads of tactics puzzles and playing loads of games.
AlphaZero was the first time I'd seen an AI algorithm where the magic didn't go away after I'd understood it.
The first time I'd looked at something and thought: "Yes, that's it, that's intelligence. The same thing I'm doing. We've solved general game playing and that's probably most of the way there."
Human intelligence really does, to me, look like a load of opaque neural nets combined with a rudimentary search function.
What interfaces are you planning to provide that other AI safety efforts can use? Blog posts? Research papers? Code? Models? APIs? Consulting? Advertisements?
(in case you want to copy-paste and share this)
Article by Conjecture, from february 25th 2023.
Title: `Cognitive Emulation: A Naive AI Safety Proposal`
(Note on this comment: I posted (something like) the above on Discord, and am copying it to here because I think it could be useful. Though I don't know if this kind of non-interactive comment is okay.)
Contains a typo.
along as it is==>
as long as it is