Summary / Preamble

In AGI Ruin: A List of Lethalities, Eliezer writes “A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.”

I have larger error-bars than Eliezer on some issues, but I share many of his concerns (thanks in large part to being influenced by his writings).

In this series I will try to explore if we might:

  • Start out with a superintelligent AGI that may be unaligned (but seems superficially aligned)
  • Only use the AGI in ways where it's channels of causal influence are minimized (and where great steps are taken to make it hard for the AGI to hack itself out of the "box" it's in)
  • Work quickly but step-by-step towards a AGI-system that probably is aligned, enabling us to use it in more and more extensive ways (as we get more assurances that it's aligned)

The reasons for exploring these techniques are two-fold:

  • Maybe we wont solve alignment prior to getting superintelligence (even though it would be better if we did!)
  • Even if we think we have solved alignment prior to superintelligence, some of the techniques/strategies outlined here could be encouraged as best practice, so that we get additional layers of alignment-assurance.

The strategy as a whole involves many iterative and contingency-dependent steps working together. I don't claim to have a 100% watertight and crystalized plan that would get us from A to B. Maybe some readers could be inspired to build upon some of the ideas and/or analyze them more comprehensively.

I intend to write at least 4 parts:

  • Part 1 (this one): A preamble of sorts.
  • Part 2 (half-finished draft here): Discusses how an AGI could be made to output so-called "generators". These are programs that are specialized/narrow compared to an AGI, and score high in terms of transparency and verifiability. Discusses how such programs can be used and combined in powerful and iterative ways to help with alignment. Many of the strategies/techniques involve trying to make it hard for the AGI to pretend that it's giving us what we want without actually giving us what we want.
  • Part 3: Focuses on AI-generated computational proofs/argument-trees, and methods/strategies for verifying such proofs/arguments. Outlines ideas for how a formalism could try to incorporate human cluster-like concepts. And how a formalism might account for (1) vagueness of concepts and (2) formalism-to-reality mappings, from within itself.
  • Part 4 (not written): Outlines/discusses how different pieces/strategies might be put together. Can we get from unaligned AGI to aligned AGI without being "tricked" along the way, and without being stopped by chicken-or-egg problems? A desirable intermediate step could be to make a system consisting of various "siloed" AGI-systems (that are aligned based on different alignment methodologies, letting us see if they have converging outputs/answers). Discusses strategic considerations, and brainstorms possible first steps when using AGI-system to do things outside of the digital realm.

Are these ideas new?

TLDR: Many ideas in this series have been written about before by others. But it also seems to me that some of the ideas are new and important, or if not new then at least under-discussed. But I could be wrong.

 

There are lots of big and small ideas in this series. Many of them are are mentioned or alluded to elsewhere. But there are also ideas in this series that (1) seem to me like they're probably important and (2) I cannot remember having seen described elsewhere.
 

I've had a hobby interest in the topic superintelligence since 2009, and in the topic of AI Alignment since 2014. So I've read a lot of what has been written, but there is also a lot that I haven't read. I could be presenting ideas in this series that seem new/interesting to me, but that actually are old. 

Here are some writings I am aware of that I think are relevant to ideas in this series (some of them have influenced me, and some of them I know overlap quite a bit):

  • Eric Drexler has written extensively about principles and techniques for designing AGI-systems that don't have agent-like behavior. In Comprehensive AI Services as General Intelligence he - well, I'm not goanna do his 200-page document justice, but one of the things he writes about is having AGI-systems consisting of more narrowly intelligent sub-systems (that are constrained and limited in what they do).
  • Paul Christiano and others have written about concepts such as Iterated Distillation and Amplification and AI safety techniques revolving around debate.
  • Several people have talked about using AIs to help with work on AI safety. For example, Paul Christiano writes: "I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research."
    • To me both Eliezer's and Paul's intuitions about FOOM seem like they plausibly could be correct. However, I think my object-level intuitions are more similar to Eliezer's than Paul's. Which partly explains my interest in how we might go about getting help from an AGI-system with alignment after it has become catastrophically dangerous. If AI-systems help us to more or less solve alignment before they become catastrophically dangerous, then that would of course be preferable - and, I dunno, perhaps they will, but I prefer to contemplate scenarios where they don't.
  • Several people have pointed out that verification often is easier than generation, and that this is a principle we can make heavy use of in AI Alignment. Paul Christiano writes: "Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. (...) I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."
    • It seems to me as well that Eliezer is greatly underestimating the power of verifiability. At the same time, I know Eliezer is really smart and has thought deeply about AI Alignment. Eliezer seems to think quite differently from me on this topic, but the reasons why are not clear to me, and this irks me.
  • Steve Omohundro (a fellow enthusiast of proofs, verifiability, and of combining symbolic and connectionist systems) has written about Safe-AI Scaffolding. He writes: "Ancient builders created the idea of first building a wood form on top of which the stone arch could be built. Once the arch was completed and stable, the wood form could be removed. (...) We can safely develop autonomous technologies in a similar way. We build a sequence of provably-safe autonomous systems which are used in the construction of more powerful and less limited successor systems. The early systems are used to model human values and governance structures. They are also used to construct proofs of safety and other desired characteristics for more complex and less limited successor systems."
  • There are at least a couple of posts on LessWrong with titles such as Bootstrapped Alignment.
  • As was pointed out to me, Stuart Armstrong has written about Low-Bandwidth Oracles.

I am reminded a bit of experiments where people are told to hum a melody, and how it's often quite hard for others to guess what song you are humming. AI Alignment discussion feels to me a bit like that sometimes - it's hard to convey exactly what I have in mind, and hard to guess exactly what others have in mind. Often there is a lot of inferential distance, and we are forced to convey great networks of thoughts and concepts over the low bandwidth medium of text/speech.

I am also reminded a bit about Holden Karnofsky's article Thoughts on the Singularity Institute (SI), where he wrote:
 

One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a "tool" and giving arguments for why AGI is likely to work only as an "agent."


In his response to this, Eliezer wrote the following:

Tool AI wasn't the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today's leading AI textbook, Artificial Intelligence: A Modern Approach - where you can learn all about A* search, by the way - discusses Friendly AI and AI risk for 3.5 pages but doesn't mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn't converge on that solution.

In conclusion, I wouldn't write this series if I didn't think it could be useful. But as a fellow hairless monkey trying to do his best, it's hard for me to confidently distinguish ideas that are new/helpful from ideas that are old/obvious/misguided. I appreciate any help in disabusing me of false beliefs.


Start of inner dialogue about AGI-assisted alignment

Me: I have some ideas about how how to make use of an unaligned AGI-system to make an aligned AGI-system.

Imaginary friend: My system 1 is predicting that a lot of confused and misguided ideas are about to come out of your mouth.

Me: I guess we’ll see. Maybe I'm missing the mark somehow. But do hear me out.

Imaginary friend: Ok.

Me: First off, do we agree that a superintelligence would be able to understand what you want when asking for something, presuming that it is given enough information?

Imaginary friend: Well, kind of. Often there isn’t really a clear answer to what you want.

Me: Sure. But it would probably be good at predicting what looks to me like good answers. Even if it isn’t properly aligned, it would probably be extremely good at pretending to give me what I want. Right?

Imaginary friend: Agreed.

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful - with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned. Right?

Imaginary friend: I guess I sort of agree. Like, if it answered your request it would probably look really convincing. But maybe it doesn’t answer your question. It could find a security vulnerability in the OS and hack itself onto the internet somehow, and it would be game over before you even got to ask it any questions. Or maybe you didn’t even try to box it in the first place, since you didn’t realize how capable your AI-system was getting, and it was hiding its capabilities from you.

Imaginary friend: Or maybe it socially manipulated you in some really clever way, or “hacked” your neural circuitry somehow through sensory input, or figured out some way it could affect the physical world from within the digital realm (e.g. generating radio waves by “thinking” in ways that sends electrons in a particular pattern, or maybe exploiting principles of physics that humans aren't aware of).

Imaginary friend: When dealing with a system that may prefer to destroy us (for instrumentally convergent reasons), and that system may be orders of magnitude smarter than ourselves - well, it's better to be too careful than not paranoid enough..

Me: I agree with all that. But it’s hard to cover all the branches of things that should be considered in one conversation-path. So for the time being, let’s assume a hypothetical situation where the AI is “boxed” in. And let’s assume that we know it’s extremely capable, and that it can’t “hack” itself out of the box in some direct way (like exploiting a security flaw in the operating system). Ok?

Imaginary friend: Ok.

Me: I presume you agree that there are more and less safe ways to use a superintelligent AGI-system. To take an exaggerated example: There is a big difference between “letting it onto the internet” and “having it boxed in, only giving it multiplication questions, and only letting it answer yes or no”.

Imaginary friend: Obviously. But even if you only give it multiplication-questions, some less scrupulous team will sooner or later develop a superintelligence and use it in less careful ways..

Me: Sure. But still, we agree that there are more and less safe to try to use an AGI? There is a “scale” of sorts?

Imaginary friend: Of course.

Me: Would you also agree that there is a “scale” for how hard it is for an oracle/genie to “trick” you into falsely believing that it has provided you with what you want? For example, if I ask it to prove a mathematical conjecture, that is much harder to “pretend” to do the way I want it without actually doing it (compared to most things)?

Imaginary friend: Sure.

Me: What I want to talk about are ways of asking an AGI genie/oracle for things in ways where it’s hard for it to “pretend” that it’s giving us what we want without doing it. And ways we might leverage that to eventually end up with an aligned AGI-system, while trying to keep the total risk (of all the steps we take) low.

Imaginary friend: My system 1 suspects I am about to hear some half-baked ideas.

Me: And your system 1 may have a point. I don’t claim to have detailed and watertight plans/arguments bringing us every step of the way from A to B. What I have is an outline of how we might get from an unaligned superintelligent AGI-system to an aligned superintelligent AGI-system, while minimizing risk along the way (not to 0%, but to way less than 50% if suggestions are executed faithfully). The different parts of the strategy vary in terms of how crystalized they are in my mind.

Me: But it seems to me, based on my limited understanding, that this kind of approach might be underprioritised. And, believe me, I don’t have any kind of full overview or understanding of everything that has been explored in the AI alignment community…

Imaginary friend: I believe you.

Me: …but it seems to me that techniques and paths to get from unaligned AGI-systems to aligned AGI-systems is an area where there is much low-hanging fruit.

Imaginary friend: It sounds like you are looking for ways to avoid having to solve the alignment problem, so that you can focus on other things instead of solving it beforehand.

Me: I guess it can be interpreted that way. I don’t want it to come across that way though! If we find reliable methods to align a superintelligent AGIs before we get superintelligent AGIs, then that would improve our odds of success! But maybe we don’t. If we don’t then it’s better to have some plans ready, so that less improvisation is necessary. AFAIK there is no fire alarm for AGI, and it certainly doesn't seem too early to start analyzing these kinds of scenarios in a more comprehensive way..

Me: And also, even if we think we have solved alignment in a reliable way prior to superintelligence, some of the strategies/principles I’ll discuss in this series can be used as an additional layer of security/alignment-assurance. Hence me thinking that it may be a good idea to develop these ideas further, and to encourage them as best practice - but as something that is done in addition to other work on alignment, not instead.

Restrictions in expressivity of AI

Me: So the first part of this discussion will focus on things that we can do while the AGI stays in the digital realm. We will get to the question of how to do complicated things in the non-digital realm, but that’s for later.

Imaginary friend: You understand that any use of an AGI enables it to affect the non-digital realm, right? If you let it write or modify code, that is a way for the AGI to communicate with humans, and communicating with humans is a way to affect the non-digital realm.

Me: Yes, I agree with that. But I would point out that it is up to us how much “expressivity” we give the AI-system. For example, if we get help from it to point out potential mistakes/bugs in some piece of code, maybe we don’t let it write text to explain itself (limiting instead to things such as pointing out areas of code to take an extra look at, highlighting examples of input that may produce unintended behavior, etc). And if we do let it write text, then it is up to us how we limit the expressivity of what it writes in text.

Imaginary friend: Ok, did you by the way notice that you spoke about not letting it write text - but letting it write example input? Well, example input can be text! Those kinds of things can be easy to miss, and often you may not be aware of what you’re missing. And also, any way to point out bugs or potential problems is a way for the AI to affect how you think and what you think about. Take this test of selective attention for example. Or look at how magicians focus and misdirect people's attention to trick them. There are lots of things like this.

Me: I may miss such things sometimes, but sometimes I will also leave them out even though I’m aware of them, so as to not get into tangents all the time while I’m trying to outline main concepts. I did notice that one though. Partly thanks to you helping me to get better at noticing those sorts of things.

Imaginary friend: That’s what imaginary friends are for.

Me: I’m certainly not claiming that the techniques I’ll describe are risk free. At the same time, it’s important to consider how different techniques can come together to reduce risk. For example, one thing that will be touched upon again and again is testing the convergence of the results from different methods and architectures and pieces of code. Preferably, if an AI-system writes the code for some module, then the overall system should ensure that we also have that module be written by other AI-systems, and also by other humans if possible.

Me: In regards to e.g. AIs pointing out potential bugs/issues, the system as a whole (a system of systems of systems..) would preferably keep track of the changes made due to such suggestions, and also try to keep track of what the output of the system would be if those changes hadn’t been made.

Imaginary friend: Ok. But just because something makes you more safe, and you can handwave to a lot of techniques where each thing helps a bit with safety - this does not show that the the system as a whole will be safe. And I am a bit worried about you proposing ideas that are hard to criticize because they’re complicated/vague, instead of getting to the “core” of the problem. Also, adding on lots of complication to increase security can in some cases be counter-productive.

Me: Some good points there. But I feel uncertain about the degree to which there is a “core” of the problem. I want this series to outline a path towards a more "complete" solution, but I want to start by focusing more on techniques and principles that "building-blocks" of the solution may make use of.

Verifying code equivalence

Me: One tool that we will come to at various points (and combine with other techniques) is the verification of equivalence in code behavior. By code behavior I mean what it takes as input, and what it gives as output (either the values themselves, or properties/claims about the values, such as them being within a certain range/”space”).

Me: But sometimes it can make sense for the proofs to also take into account run-time, properties of how the code affects the system that it runs on, etc.

Imaginary friend: What would those proofs look like?

Me: The proofs should of course be computable, and preferably it would be possible to display them in a format that makes the “logic” of them as clear as possible to humans. If they can be made to look visual in some user-friendly GUI then all the better.

Me: It could be fine for some part/parts of the AI-system to take part in defining the format. But thinking beforehand of what the proof-format should look like and the properties it should have could lead to more clarity of thought, which is better than having to improvise after or right before the creation of AGI.

Me: And the AI-system should be used when verifying the proof-format itself, and the various “inference-steps” that are allowed. How to verify the validity of “inference-steps” - well, I don’t know the exact details, but the more “proofs” (that each seems sufficient by itself to a human) the better. After all, this follows from the principle of “Why not both?”.

Me: Preferably the verification-process would include proofs that are “mathematical”, but it is also important to consider that the behavior of code is something that can be tested (while more or less staying within the digital realm). One counterexample is enough to disprove the validity of a given set of accepted inference-steps.

Me: Btw, if we can prove the equivalence of code we can use that as a building-block towards proving what is different between the behavior of two pieces of code (e.g. “these pieces of code behave the same, with the exception of input that is within the range x”).

Topics I'll cover in this series

Imaginary friend: So where do you see this discussion going?

Me: I guess we’ll see. But I do have some plans.

Me: One thing I want to discuss is the concept of “solution-generators” (and maybe also “solution-generators-generators”, etc). That is to say, when asking an oracle/genie for something, it may often make sense to not ask the question directly, but to instead ask for a “solution-generator” - that is to say, a function that generates solutions within the domain in question. These “generators” should be optimized for being narrow/specialized/modular/verifiable (and techniques for trying to ensure that the system is faithfully optimizing the “generator” for these properties is a topic by itself).

Me: For example, if we want to do a physics-simulation, then we might want to use a “simulation-generator” or “simulation-generator-generator”. One merit of this is that the code can be easier to inspect and understand than the code of a superintelligent AGI. Another merit is that the “generator” can be tested against existing data (and what data we want to test “generators” again is something that it is possible to plan and prepare for before we have AGI).

Me: But of course, you can also ask it to answer the question directly (in addition to asking for the “solution-generator”). That follows from the principle of “Why not both?”.

Imaginary friend: Isn’t it better to not ask it the specific question, so that it’s harder for it to guess what you’re after?

Me: I will mostly not be mentioning this sort of thing explicitly, but yes I agree, you shouldn’t ask both questions of the same instance of the AGI-system. The instance that is asked for A should not know that you also have asked it for B. Though of course, you could have additional instances that are asked both questions, and check if results/output converge.

Me: Another topic I want to speculate a bit about is the topic of brain emulations, and also other “generators” that try to predict the behavior/thinking of humans. I’m interested in the question of “Can we bootstrap from AGI to either brain emulations and/or something that can predict human behavior/thinking while more or less staying within the digital realm?”. Obviously there are lots of ethical considerations here, and it’s very important to avoid suffering subroutines insofar as possible!

Me: I also want to discuss techniques involving computable argument-trees/”proofs”, where for every step of inference the inference-rule (how you get output/conclusion from input/arguments) is made explicit and computable. As well as outlining possible techniques to have the concepts/propositions of such proofs represent more or less any thought that is sufficiently “clear/crisp/explicit” (blurring the distinction between “mathematical” proofs and any other argument about anything). Included in the discussion will be outlines of ideas for how to deal with “vagueness” and cluster-like concepts within such argument-trees/”proofs”.

Me: And I’ll be outlining thoughts about capabilities that I think will help with verifying that instructions for doing things in the real world (developing new types of machines and that sort of thing) work as intended. Such as for example copying a strawberry at the molecular level without unintended consequences. Among other things there will be some focus on “generators” for mappings between (1) models/ontologies, and (2) data-structures representing geometric structures (e.g. some sort of physics-simulation), and (3) real things in the actual world that the models are meant to refer to.

Me: There are various other things I plan to touch upon as well. Some of it having to do with verification-techniques, but also much about various other things. And maybe more will come up as I write. For a preview it is possible to take a look at a half-finished early draft for an earlier version of this article-series, which contains some of the stuff for future sections (but as mentioned, it’s half-finished).

Me: The more people there are who (1) are smart and have thought a lot about something and (2) see things differently from you, the more reason for self-doubt about your own judgment. And this is for me a significant source of uncertainty about my ideas in regards to alignment (and AI more generally). But it seems best to me to just try to describe my perspective as well as I can, and then people can do with that what seems best to them.

Imaginary friend: Talk to you later then.


Any feedback or comments (be that positive or negative or neither) would be received with interest.

6

4 comments, sorted by Click to highlight new comments since: Today at 6:14 PM
New Comment

I skimmed the post, here is some feedback (context):

  • I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines. I'm also a bit on the optimistic side about alignment more generally -- it might be better to talk to a pessimist.

  • This sounds a bit like the idea of a "low-bandwidth oracle".

  • I think the biggest difficulty is the one you explicitly acknowledged -- boxing is hard.

  • But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI's output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI's output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there's always a risk that there's a flaw that the superintelligent AGI sees (or can infer) that you don't see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.

  • The idea of emitting machine-checkable proofs is interesting. I'm not sure such proofs are very useful though. "Finding the right spec is one of the biggest challenges in formal methods." - source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.

I think you'll find it easier to get feedback if you keep your writing brief. Assume the reader's time is valuable. Sentences like "I will mention some stuff later that maybe will make it more clear how I’d think about such a question." should simply be deleted -- make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they're talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.

Personally I'm more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don't have much time to do feedback right now unfortunately.

This sounds a bit like the idea of a "low-bandwidth oracle".

Thanks, that's interesting. Hadn't seen that (insofar as I can remember). Definitely overlap there.
 

I think probabilistic safety measures can be good if we stack a lot of them together in the right way.

Same, and that's a good/crisp way to put it. 
 

Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Will edit at some point so as to follow the first part of that suggestion. Thanks!
 

I think you'll find it easier to get feedback if you keep your writing brief. (..) I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post.

Some things in that bullet-list addresses stuff I left out to cut length, and stuff I though I would address in future parts of series. Found also those parts of bullet-list helpful, but still this exemplifies dilemmas/tradeoffs regarding length. Will try to make more effort to look for things to make shorter based on your advice. And I should have read through this one more before publishing. 

Me: So if I said to it “show me the best source code you can come up with for an aligned AGI-system, and write the code in such a way that it’s as easy as possible to verify that it works as it should”, then what it gave me would look really helpful - with no easily way for me to see a difference between what I’m provided and what I would be provided if it was aligned.

Is this really true? I would guess that we might be so far from solving alignment that nothing would look particularly helpful? Or, even worse, the only thing that would look helpful is something completely wrong?

Thanks for commenting :)

> I would guess that we might be so far from solving alignment that nothing would look particularly helpful?

My thinking is that using reinforcement-learning-like methods will select towards systems that look like they are aligned / optimized for what we are trying to optimize them for. If the system gives answers/solutions/etc where humans can see that it doesn't really optimize well for what we want it to optimize for, then I presume it would be tweaked further until that no longer was the case. For example, suppose that we get it to write code for us, and we select how easy the code is to read/understand for humans as an optimization-criteria, and then it returns code that has clear examples of things that could be improved - well, then it would presumably be tweaked further (and it would presumably be a bad strategy for it if it was trying to trick us into thinking it was aligned).

That being said, it is a possibility that a system could be pretending that it's less capable than it really is. That way it could return answers that didn't look particularly helpful. Which I guess sort of makes the answer to your question "yes". As in, there could be scenarios with AGI/superintelligence where it is "pretending" to not be an AGI/superintelligence. And then there could be other scenarios where it's not hiding that it's really capable, but where it does pretend to be less capable than it really is at certain types of things. But there would presumably be incentives for it to not do that to such a degree that it became easy for humans to notice that it isn't doing its best. (Btw, I am consciously leaving out a few types of conceivable scenarios so as to make this comment less jumbled.)

Most of this series will focus on scenarios where we know that the system in question is superintelligent, and where we had the foresight to "box" it in (making it hard to hack itself out through security vulnerabilities in operating system and that sort of stuff) well before it acquired anything resembling superintelligence (or something that plausibly might self-improve in a self-reinforcing feedback-loop towards anything resembling superintelligence).

> Or, even worse, the only thing that would look helpful is something completely wrong?

Yes, that is a major worry. And that is one of the main concerns/problems/difficulties that I'm hoping to address in this series (not the only one, but one of the major ones). That is to say, I want to outline strategies/techniques that (I think) can be put together to make it as hard and as unlikely as possible for the AI-system to be able to trick us into thinking it is being helpful (giving us what we want) while it really isn't.