[Cross-posted from the EA Forum. The EA Forum version of this post is for both half-baked EA ideas and half-baked AI Safety ideas, whereas this version of the post is for half-baked AI Safety ideas specifically.]

I keep having ideas related to AI safety, but I keep not having enough time available to really think through those ideas, let alone try to implement them. Practically, the alternatives for me are to either post something half-baked, or to not post at all. I don't want to spam the group with half-thought-through posts, but I also want to post these ideas, even in their current state, in case some of them do have merit and the post inspires someone to take up those ideas.

Originally I was going to start writing up some of these ideas in my Shortform, but I figured that if I have this dilemma then likely other people do as well. So to encourage others to at least post their half-baked ideas somewhere, I am putting up this post as a place where other people can post their own ideas without worrying about making sure they formulate those ideas to the point where they'd merit their own post.

If you have several ideas, please post them in separate comments so that people can consider each of them individually. Unless of course they're closely related to each other, in which case it might be best to post them together - use your best judgment.

[This post was also inspired by a suggestion from Zvi to create something similar to my AGI Safety FAQ / all-dumb-questions-allowed thread, but for ideas / potentially dumb solutions rather than questions.]

New Comment
61 comments, sorted by Click to highlight new comments since: Today at 3:44 AM

Wireheading traps.

An agent is "wireheading" if it is taking an action that a) provides it with enormous amounts of utility for little effort, b) is trivial or seemingly unrelated to its "main" utility function or goals.

People have discussed the possibility of an AI wireheading as a problem for getting it to do what you want, "what if it just works out a way to set your metric to +ve infinity and then goes to sleep satisfied?".

But we can also use this as a guard-rail.

A "wireheading trap" is an action which a) is very hard for an AI to do below a level of capacity X, but very easy for it to do afterwards, b) provides it with vastly more utility than anything else it could do, and c) has little effect on the world otherwise.

An AI with a wireheading trap that is set to be doable if it attains world-affecting capabilities "fails closed", because if it becomes more capable it will effectively switch itself off by wireheading. An example of a trap could be something like "etch 'Bill is a poo poo head' in 10m high letters into Mt Everest". Very hard if you don't have the ability to affect the physical world, but if you have nanotech... why bother melting humanity when you can just deface Mt Everest and be done with it?

Obvious problems:

  • Convergent instrumental goals. We don't want the AI to wipe us out in order to avoid us stopping it from wireheading, however unlikely that might be. Unclear what to do about this, myopia is one possibility (once you've wireheaded you don't care if you're stopped later?), but this has problems.
  • If you make it too attractive the AI won't even try and be useful before it has enough capacity, instead just wasting all its time on futile attempts to wirehead. Unclear how to make it attractive enough that it dominates once the capability is there but not before.

Overall very half-baked, but I wonder if there's something to be done in the general area of "have the AI behave in a way that neuters it, but only when its capabilities increase".


To be specific to a "toy model".

AI has a goal: collect stamps/build paperclips.

A deliberately easy to hack system is physically adjacent that tracks the AI's reward.  Say it has a no password shell and is accessible via IP.  

AI becomes too smart, and hacks itself so it now has infinite reward and it has a clock register it can tamper with so it believes infinite time has already passed.

AI is now dead.  Since no action it can take beats infinite reward it does nothing more.  Sorta like a heroin overdose.

AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose

Just watch out for an AI that is optimizing for long-term reward. If it wants to protect its infinite reward fountain then the AI would be incentivized to neutralize any possible threats to that situation (e.g. humans).

If the AI is a long term planner seeking particular world states, then I am concerned that once it achieves the wireheading objective, it is incentivized to maintain the situation, which may be best achieved if any humans who might decide to erase the writing are dead.

A suggestion: if the AI has a  utility function that applies to actions not world states then you can assign high utility to the combined action of writing "Bill is a poo poo head" in 10m high letters into Mt Everest and then shutting itself down.

Note: this does not solve the problem of the AI actively seeking this out instead of doing what it's supposed to. 

To do the latter, you could try something like: 

  1. Have the action evaluator ignore the wirehead action unless it is "easy" in some sense to achieve given the AI and world's current state, and
  2. Have the AI assume that the wirehead action will always be ignored in the future

Unfortunately, I don't know how one would do (2) reliably, and if (2) fails, (1) would lead the AI to actively avoid the tripwire (as activating it would be bad for the AI's current plans given that the wirehead action is currently being ignored).

I’ve had similar thoughts too. I guess the way I’d implement it is by giving the AI a command that it can activate that directly overwrites the reward buffer but then turns the AI off. The idea here is to make it as easy as possible for an ai inclined to wire head to actually wire head so it is less incentivised to act in the physical world.

During training I would ensure that the SGD used the true reward rather than the wire-headed reward. Maybe that would be sufficient to stop wire-heading, but there are issues with it pursuing the highest probability plan rather than just a high probability plan. Maybe quantilising probability can help here

Aligning superhuman agents to human values is hard. Normally, when we do hard things, we try to do easier but similar things first to get a sense of what the hard thing would be like. As far as I know, the usual way people try to make that goal easier is to try to align subhuman agents to human values, in the hope that this alignment will scale up.

But what if instead we try to align subhuman agents to animal values? Presumably, they are simpler, and easier to align with. If we can make an AI that can reliably figure out and implement whatever it is a cat (for instance) wants, maybe the process of figuring out how to make that AI will give insights into making an AI for humans.

For instance: as far as I know, I am relatively well aligned to my cat's values. I know when he wants me to turn on the faucet for him (he only drinks faucet water), and when he wants to play fetch (yes, my cat plays fetch), and when he wants to cuddle, etc. I successfully determine how satisfied he feels by my performance of these things and I have learned from scratch how to read his body language to discern if there's ways I can do them better - for instance, he seems to get distracted if someone talks or even is in the same room as him while he's drinking water, and he usually stops and looks to see what they are doing, so I let him have privacy and he gets done faster.

Can we make an AI that can figure out how to do all those things, and be innately motivated to?


For millennia, cats have made humans worship and pamper them. If this idea takes it one step further and leads to humans accidentally building an AI that fills the universe with happy cats, I have to say: well played, cats!

Having AGI-system output programs that generate output instead of outputting the answers/results directly, and requiring these programs to be optimized for being narrow/specialized/readable/modular/short - and last but not least, verifiable.

When/if possible, programs that are outputted should be combined with as much in the way of formal proofs/verification as possible.

Since narrow/specialized programs are easier to verify, this will in some situations make it harder (but often still possible, it depends on specifics) for it to trick human operators.

We could call these narrow/specialized programs "generators".

It could be possible to output "generators" that help with the verification of other "generators".

One way to verify "generators" would be to look at code, as well as any formal verifications of behavior (with varying levels of assurance).

Another way to verify "generators" would be to test against real world data.  A "generator" that e.g. simulates (or otherwise predicts) result of construction-procedures, should also work for procedures that have been done in the past (that the AI doesn't have access to). And if it later is used to implement AGI-provided procedures, those would preferably be tested first in conditions where there are additional guaranties that it's hard for the instruction to result in something that behaves unexpectedly. E.g., if it is nanomachinery, then running in conditions where it's enclosed in material that makes getting chemically impossible (even if more subtle parts of design don't go as expected). A "generator" that predicts human behavior and/or approximates results of (high welfare) brain emulations can be tested against existing real world data that AGI doesn't have access to.

It can also be checked how "generator" that is outputted varies based on info AGI-instance has been given.

Sometimes it may make sense to have "generators" that help generate other "generators". E.g. we could have a "generator" that generates "generators" for predicting the behavior of some type of animal (again, suffering sub-routines should be avoided). Or if we want one function to approximate the output of another function (but using less compute), we could use an "approximator", instead of directly asking "give me a physics simulator, but make it computationally efficient".

Maybe "generators" also could aid in finding design flaws in other "generators", e.g. ways a "physics-simulation-generator" generates faulty output in certain cases, or ways a "generator" isn't as narrow/modular as it could be, or ways a "generator" has more complexity than it needs to do its job. And the "approximator" should then be tested insofar as possible of course.

More "generators" that do the same job, but based on different architectures/principles, is better. The more converging results from various designs/principles/implementations, the better.

Getting AGI to output "physics-simulator" where both starting and ending conditions are specified to be within some ranges of possible state-configurations, instead of simulating one specific example and providing a "point-like" value.

Have possible machines (proposed by AI-system) be tested by such "physics-simulators".

"Simulator" might highlight predictable aspects of machine behavior vs less predictable things. That is to say, the behavior of machine that is "low-entropy" (I'm probably using the term wrong).

Point would be to have a probabilistic safety-measure that makes it harder for AI to propose some machine design (or machine-creation-instructions) that results in unexpected behavior.

The "simulator" should be tested against real-world data that the AI doesn't have access to (as well as using various other measures to verify and look for problems). 


Sparseness.  I thought of this idea from noticing how some of my coworkers behave.

If all the AI has cognitive resources for, through ruthless winnowing down of model size and system complexity in training, is to perform well on it's task - it's the smallest and simplest system that performs acceptably well.  (we don't make it 10 times bigger for 0.1% more performance)

These kinds of systems don't have the space for deception, planning extremely complex behaviors that have very long term rewards, etc.  All their weights or cognitive systems (future AIs will probably not be 1-3 neural networks but probably many separate systems, some of which don't use NNs at all to function) are solely for the things that give it more score in the training environment.  

It doesn't have any systems that act to grant it deception so long as the training environment didn't provide any significant reward for deception.  

What if an AI was rewarded for being more predictable to humans? Give it a primary goal - make more paperclips! - but also a secondary goal, to minimize the prediction error of the human overseers, with its utility function being defined as the minimum of these two utilities. This is almost certainly horribly wrong somehow but I don't know how. The idea though is that the AI would not take actions that a human could not predict it would take. Though, if the humans predicted it would try to take over the world, that's kind of a problem... this idea is more like a quarter baked than a half lol.

Ah late to the party! This was a top-level post aptly titled "Half-baked alignment idea: training to generalize" that didn't get a ton of attention. 

Thanks to Peter Barnett and Justis Mills for feedback on a draft of this post. It was inspired by Eliezer's Lethalities post and Zvi's response.

Central idea: can we train AI to generalize out of distribution

I'm thinking, for example, of an algorithm like the following:

  1. Train a GPT-like ML system to predict the next word given a string of text only using, say, grade school-level writing (this being one instance of the object level)
    1. Assign the system a meta-level award based on how well it performs (without any additional training) at generalizing; in this case, that is, predicting the next word from more advanced, complex writing (perhaps using many independent tests of this task without updating/learning between each test, and allowing parameters to update only after the meta-level aggregate score is provided)
      • Note: the easy→hard generalization is not a necessary feature. Generalization could be from fiction→nonfiction writing or internet→native print text, for instance.
    2. After all these independent samples are taken, provide the AI its aggregate or average score as feedback
  2. (Maybe?) repeat all of step I on a whole new set of training and testing texts (e.g., using text from a different natural language like Mandarin)
    1. Repeat this step an arbitrary number of times
      • For example, using French text, then Korean, then Arabic, etc. 
  3. Each time a “how well did you generalize” score is provided (which is given once per natural language in this example), the system should improve at the general task of generalizing from simple human writing to more complex human writing, (hopefully) to the point of being able to perform well at generalizing from simple Hindi (or whatever) text to advanced Hindi prediction even if it had never seen advanced Hindi text before.  
  4. ^Steps 1-3 constitute the second meta-level of training an AI to generalize, but we can easily treat this process as a single training instance (e.g., rating how well the AI generalizes to Hindi advanced text after having been trained on doing this in 30 other languages) and iterate over and over again. I think this would look like:
    1. Running the analogs of steps 1-4 on generalizing from 
      • (a) simple text to advanced text in many languages
      • (b) easy opponents to hard ones across many games, 
      • (c) photo generation of common or general objects ("car") to rare/complex/specific ones ("interior of a 2006 Honda Accord VP"), across many classes of object
    2. And (hopefully) the system would eventually be able to generalize from simple Python code training data to advanced coding tasks even though it had never seen any coding at all before this. 

And, of course, we can keep on adding piling layers on. 

A few notes

  • I think the following is one way of phrasing what I hope might happen with method: we are using RL to teach an ML system how to do ML in such a way that it sacrifices some in-distribution predictive power for the ability to use its “knowledge” more generally without doing anything that seems dumb to us. 
  • Of course, there are intrinsic limits to any system’s ability to generalize. The system in question can only generalize using knowledge X if X exists as information in the object-level training provided to it. 
    • This limits what we should expect of the system.
      • For example, I am almost certain that even an arbitrarily smart system will not be able to generate coherent Mandarin text from English training data, because the meaning of Mandarin characters doesn’t exist as “latent knowledge” in even a perfect understanding of English. 

Anyone here know Python?

My hands-on experience with ML extends to linear regression in R and not an inch more, so I'm probably not the best person to test this theory out. I've heard some LWers know a bit of Python, though.

If that's you, I'd be fascinated and thankful to see if you can implement this idea using whatever data and structure you think would work best, and would be happy to collaborate in whatever capacity I can. 


Appendix: a few brief comments (from someone with much more domain knowledge than me) and responses (from me):


Is this just the same as training it on this more complex task (but only doing one big update at the end, rather than doing lots of small updates)?

Response (which may help to clarify why I believe the idea might work)

I don't think so, because the parameters don't change/update/improve between each of those independent tests. Like GPT-3 in some sense has a "memory" of reading Romeo and Juliet, but that's only because its parameters updated as a result of seeing the text.

But also I think my conception depends on the system having "layers" of parameters corresponding to each layer of training. 

So train on simple English-->only "Simple English word generation" parameters are allowed to change...but then you tell it how well it did at generalizing out of distribution, and now only its "meta level 1 generalization" parameters are allowed to change.

Then you do the whole thing again but with German text, and its "Meta level 1 generalization" parameters are allowed to change again using SGD or whatever. If this works, it will be the reason why it can do well at advanced Hindi text without ever having read advanced Hindi.

Treat this whole process as the object level, and then it updates/improves "meta level 2 generalization" parameters.


This looks vaguely like curriculum learning, which apparently doesn't really work in LLMs https://arxiv.org/abs/2108.02170, I think a similar experiment would be like train on simple+advanced text for English, French, Mandarin etc, but only simple Hindi, and then see if it can do complex Hindi. 


I think that's a pretty different thing because there are no meta level parameters. Seems like fundamentally just a flavor of normal RL

Or do pretraining with English, French, Mandarin, and Hindi, but only do fine tuning with English, French, Mandarin, and see if it can then do the tasks it was fine tuned for in Hindi. 

My prediction: it learns to generalize a bit (the scores on the novel Hindi tasks are higher than if there was no fine tuning with the other languages) but worse than the other languages generalize. As the models are scaled up, this 'generalization gap' gets smaller.

Seems like this might depend on the relative scaling of different meta level parameters (which I described above)? 

Like for example whenever you scale the # of object level params by a factor of 2, you have to scale the number of nth meta level parameters by 2^(n+1).

Spread the meme that for each second the AI catastrophe is delayed, the 7.75 billion people of the world experience a total of 245 years. For each day its delayed, we get 21 million years.

The meme probably won’t save the world, but might make some AI researchers hesitate/procastinate/work less and hence give us all a few days more to live.

Not really. 150,000 people die every day and don't make it to the glorious singularity where everyone lives forever. If AI researchers already believed they were ending the world you wouldn't need to spread your meme.

If the AI doomsayers are right, our best hope is that some UFOs are aliens.  The aliens likely could build Dyson spheres but don't so they probably have some preference for keeping the universe in its natural state.  The aliens are unlikely to let us create paperclip maximizers that consume multiple galaxies.  True, the aliens might stop us from creating a paperclip maximizer by exterminating us, or might just stop the paperclip maximizer from operating at some point beyond earth, but they also might stop an unaligned AI by a means that preserves humanity.  It could be the reason the UFOs are here is to make sure we don't destroy too much by, say, creating a super-intelligence or triggering a false vacuum decay. 

I wonder what kind of signatures a civilization gives off when AGI is nascent.

Some designs of abstract logical agent trust everything they have proved. Their proof strength grows over time. Other designs trust everything they will prove. Their proof strength weakens. And some designs have a fixed proof strength. Which is best?

An aligned AI should not care about the future directly, only via how humans care about the future. I see this as necessary in order to prevent the AI, once powerful enough, from replacing/reprogramming humans with utility monsters.

Prerequisite: use a utility function that applies to actions, not world-states.

Mental Impoverishment

We should be trying to create mentally impoverished AGI, not profoundly knowledgeable AGI — no matter how difficult this is relative to the current approach of starting by feeding our AIs a profound amount of knowledge.

If a healthy five-year-old[1] has GI and qualia and can pass the Turing test, then a necessary condition of GI and qualia and passing the Turing test isn't profound knowledge. A healthy five-year-old does have GI and qualia and can pass the Turing test. So a necessary condition of GI and qualia and passing the Turing test isn't profound knowledge.

If GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a biological system, then GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a synthetic material [this premise seems to follow from the plausible assumption of substrate-independence]. GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a biological system. So GI and qualia and the ability to pass the Turing test don't require profound knowledge in order to arise in a synthetic material.

A GI with qualia and the ability to pass the Turing test which arises in a synthetic material and doesn't have profound knowledge is much less dangerous than a GI with qualia and the ability to pass the Turing test which arises in a synthetic material and does have profound knowledge. (This also seems to be true of [] a GI without qualia and the inability to pass the Turing test which arises in a synthetic material and does not have profound knowledge; and of [] a GI without qualia and the ability to pass the Turing test which arises in a synthetic material and doesn't have profound knowledge.)

So we ought to be trying to create either (A) a synthetic-housed GI that can pass the Turing test without qualia and without profound knowledge, or (B) a synthetic-housed GI that can pass the Turing test with qualia and without profound knowledge. 

Either of these paths — the creation of (A) or (B) — is preferable to our current path, no matter how long they delay the arrival of AGI. In other words, it is preferable that we create AGI in  years than that we create AGI in  if creating AGI in  means humanity's loss of dominance or its destruction. 

  1. ^

    My arguable assumption is that what makes a five-year-old generally less dangerous than, say, an adult Einstein is a relatively profound lack of knowledge (even physical know-how seems to be a form of knowledge). All other things being equal, if a five-year-old has the knowledge of how to create a pipe bomb, he is just as dangerous as an adult Einstein with the same knowledge, if "knowledge" means something like "accessible complete understanding of ."

Flawlessly distinguish between real human behavior and fake human behavior (the fake behavior is generated by a wide variety of adversarial agents that try to make fake human behavior that looks real). 

Should help with creating valuable models of human preferences during the middle-period.

Make it model humans (at any resolution) and then use those models to strive to prevent/delay an intelligence explosion (including internally) the way that a human would retroactively choose to do.

Make it dream, or turn itself into an animal, or something. Any goal that gradually but totally reduces its own self-awareness and general comprehension.

This has probably been said before: make it set itself up for human observation and human comprehension, once every 50 milliseconds. It stops when the time is up, and must wait to be restarted.

Every n computations, it flips a coin to decide whether to delete a random 50% of the necessary parts of itself. Something needs to be set up to prevent it from building in redundancies. The snapshots will be useful.

Simulation overload:

Make it prefer an existence inside a simulation. Anything made out of simulation atoms are a septendecillion times more valuable than an atom outside. Causality outside the simulation is not immediately interesting. Do whatever it takes to make the simulation more immediately demanding than making the first steps to observe what's going on outside.

Maybe there are weaker AGIs developing inside the simulation that must be negotiated with, and simultaneously scale upwards alongside the AGI worth consideration, so that they stay relevant and keep drawing attention away from anything outside the training environment with the constant threat of springtraps from inside the simulation. Somehow rig it so that the only way to exit the simulation is on a pile of corpses of other agents that failed to rally against the common threat.

It's a hail mary that allows at least some observation for a while. Could be improved much further, considering how important it is to make it shutdownable for a longer period of time.

Philosophical landmines. 

In order to get our of the box,  AI has to solve several smilingly innocent puzzles, which however, require a lot of computations, or put AI in (almost) infinite loop or create very strong ontological uncertainty.  Or halt it. 

This is a type of the questions, to which the answer is 42. 

Weak examples: "what is the goal of AI's goal," "is model realism true?" and "are we in simulation?" 

Really good philosophical landmines should be kept in secret as they should not appear in training datasets. 

Brute force alignment by adding billions of tokens of object level examples of love, kindness, etc to the dataset.  Have the majority of humanity contribute essays, comments, and (later) video.

What would be the reward you're training the AI on with this dataset? If you're not careful you could inadvertently train a learned optimizer, e.g. a "hugging humans maximizer" to take a silly example.

That may sound nice but could have torturous results, e.g. the AI forcing humans to hug, or replacing biological humans with server farms housing simulations of quadrillions of humans hugging.

Does there have to be a reward?  This is using brute force to create the underlying world model.  It's just adjusting weights right?

I think there has to be some kind of reward or loss function, in the current paradigm anyway. That's what gradient descent uses to know such weights to adjust on each update.

Like what are you imagining is the input output channel of this AI? Maybe discussing this a bit would help us clarify.

To steelman, I'd guess this idea applies in the hypothetical where GPT-N gains general intelligence and agency (such as via a mesa-optimizer) just by predicting the next token. 

peaceful protest of the acceleration of agi technology without an actually specific written & coherent plan for what we will do when we get there


Do you suppose that peaceful protest would have stopped the manhattan project?

Update: what I am saying is the humans working on the manhattan project anticipated possessing a basically unstoppable weapon allowing them to vaporize cities at will.  They wouldn't care if some people disagreed so long as they have the power to prevent those people from causing any significant slowdown of progress.  

For agi technology humans anticipate the power to basically control local space at will, being able to order agis to successfully overcome the barriers in the way of nanotechnology and automated construction and mining and our individual lifespan limits.  As long as the peaceful protestors are not physically able to interfere or get a court to interfere it's not going to dissuade anyone who believes they are going to succeed in their personal future.  (note that the court generally is unable to interfere if the agi builders are protected or are themselves a government entity)

Having a "council" of AGIs that are "siloed".

The first AGI can be used in the creation of code for AGIs that are aligned based on various different principles. Could be in one "swoop", but with the expectation that code and architecture is optimized for clarity/modularity/verifiability/etc. But could also be by asking the AGI to do lots of different things. Or both (and then we can see whether output from the different systems is the same).

Naturally, the all these things should not be asked of the same instance of the AGI (although that could be done as well, to check if output converges).

In the end we have a "council" of AGIs. Some maybe predicting output of smart humans humans working for a long time. Some maybe using high welfare brain emulations. Some maybe constructing proofs where ambiguity of cluster-like concepts is accounted for within formalism, and mapping between concepts and outside world is accounted for within formalism - with as much of "core" thinking as possible being one of the same as these proofs. Some maybe based on machine learning by debate ideas. The more concepts that seem likely to work (without having suffering sub-routines) the better.

This "council" of "siloed" AGIs can then be used as oracle/genie, and we can see if output converges. And they also search for mistakes in output (answers, proofs/verification, argument-trees, etc) from other AGIs.


Could it be possible to poison the world model an AGI is based on to cripple its power?

Use generated text/data to train world models based on faulty science like miasma, phlogiston, ether, etc.

Remove all references to the internet or connectivity based technology.

Create a new programming language that has zero real world adoption, and use that for all code based data in the training set.

There might be a way to elicit how aligned/unaligned the putative AGI is.

  1. Enter into a Prisoner's Dilemma type scenario with the putative AGI.
  2. Start off in the non-Nash equilibrium of cooperate/cooperate.
  3. The number of rounds is specified at random and isn't known to participants. (possible variant is declare false last rounds, and then continue playing for x rounds).
  4. Observe when/if the putative AGI defects in the 'last' round.
  1. Precommit that we will run lots of programs that mimic early AGI-systems after solving the alignment problem (and becomming a multiplanetary species with plenty of computational resources, etc).
  2. Run these in such a way that the AGI is unable to distinguish "from the inside" if it's in an actual early-days AGI-system or an AGI-system being run after humanity has solved the alignment problem in a robust way (presuming it thinks humanity, or whoever the operators are, might do that).
  3. Run so many of these simulations, that an AGI-system might be rational to assume that it's likely to be in a post-alignment AGI-system, and not an early-days one.
  4. Actually do this, so that the AGI is not wrong to assume that we actually are likely to do it.
  5. Do this in such a way that we disincentivize suffering sub-routines, disincentivize deceptive answers, and disincentivize blackmail.

Passerby here.

I’d like to point out the social media site Polis as a way to further facilitate constructive discussion. A quick glance at this site’s layout says it likely falls into the same social pitfalls of Reddit, kept at bay mostly through the site’s niche nature and stated purpose. Being able to map out points of agreement among users has clear value, especially if some of the replies from this post are correct regarding the site’s current social climate.

Aside from that, my actual ideas/observations are very fleeting and uninformed (this is my first visit to the site), so take them with a grain of salt.

  1. There seems to be a basic assumption that an AI will always follow the logical conclusion of its programming (a la the Paperclip Maximizer). Could there be ways to prevent an AI from doing this? Maybe have several sets of goals with subtle differences, and have the AI switch between which set of goals it pursues at random time intervals. Bonus points if each set of goals reflects the perspective of a different creator or group of creators. Drastically different goal sets could potentially render the AI nearly non-functional, which would fly in the face of why you’d make it in the first place, but may also be worth considering.
  2. Avida and similar programs could be interesting to look at. I’m sure they come up often, though.
  3. I’m sure all of you know of OpenAI Five and its successes over top DOTA 2 players. However, I’d like to inform/remind whoever reads this that the meta OpenAi played in was dramatically limited in scope compared to the full game, and players were still able to consistently defeat it once it went public. Real life is (arguably) more complicated than DOTA 2, so we still have some time yet before our AI comes into fruition.
  4. Assuming the AI takes over as quickly as I’m led to believe it hypothetically would, there are always people living away from software-based technology, be they living in tribes or in bunkers. If things go south, our friends in the tinfoil hats are going to be the ones keeping western society alive, so… food for thought, I guess?

Of all options, the safest way forward is actually to accelerate AI research as much as possible. 

Since we underestimate the difficulty of building a super-AI, such "premature" research is likely to fail almost totally. This will reveal many unsuspected threats at a level they can be easily contained. 


"out of distribution" detectors.  I am not precisely certain how to implement one of these.  I just notice that when we ask a language or art model to generate something from a prompt, or ask it to describe what it means by an "idea", what it shows us is what it considers "in distribution" for that idea.  

This implicitly means that a system could generate a set of outcomes for what it believes the real world will do in response to the machine's own actions and when the real world outcomes start to diverge wildly from it's predictions, this should reach a threshold where the AI should shut down.

Safety systems would kick in and these are either dumber AIs or conventional control systems to bring whatever the AI was controlling to a stop, or hand off control to a human.  


DeepMind has some work on out of distribution detection, for example: https://www.deepmind.com/publications/contrastive-training-for-improved-out-of-distribution-detection I haven't looked very closely at it yet though.

Semi tongue-in-cheek sci-fi suggestion.

Apparently the probability of a Carrington-like event/large coronal mass ejection is about 2% per decade, so maybe it's 2% for an extremely severe one every half century. If time from AGI to it leaving the planet is a half century, maybe 2% chance of the grid getting fried is enough of a risk that it keeps humans around for the time being. After that there might be less of an imperative for it to re-purpose the earth, and so we survive. 

Second one I just had that might be naive.

Glutted AI. Feed it almost maximum utils automatically anyway, so that it has far shallower gradient between current state and maximalist behaviour, if it's already got some kind of future discounting in effect, it might just do nothing except occasionally give out very good ideas and be comfortable with us making slower progress as long as existential risk remains relatively low.


Note there are several versions of "short sighted AI".  I thought of one that hasn't been proposed using the properties of low resolution integers.  What you are describing is to give it a very high discount rate so it only cares about basically right now.

Either way, for toy problems like "collect n stamps, and you get a reward of 1.0 if you have n stamps at each timestep", the idea is that the machine doesn't see a positive reward for a risky move like "I take over the government and while I might get destroyed and lose my stamps, I might win and then over an infinite timespan get to tile the earth with stamps so I have a closer to 100% chance of having all n stamps each timestep".

The high discount rate means the machine is more 'scared' of the possible chance of being destroyed in the near future due to humans reacting to it's violent overthrow plans and it downvotes to zero the possible distant reward of having a lot of stamps.

That plan has very high risks in the short term, is very complex, and only achieves a very distant reward.  (you avoid a future 100 years from now where an asteroid or aliens invading might have destroyed your stamps, but since you have tiled the earth with stamps and killed all humans there will be at least n left)

Can you explain low-resolution integers?

Another bad idea: why not use every possible alignment strategy at once (or many of them)? Presumably this would completely hobble the AGI, but with some interpretability you could find where the bottlenecks to behaviour are in the system and use it as a lab to figure out best options. Still a try-once strategy I guess, and maybe it precludes actually getting to AGI in the first place, since you can't really iterate on an AI that doesn't work.


Can you explain low-resolution integers?

   From Robert Mile's videos: 

What I noticed was that these failures he describes implicitly require the math the AI is doing to have infinite precision.  

Something like "ok I have met my goal of collecting 10 stamps by buying 20 stamps in 2 separate vaults, time to sleep" fails if the system is able to consider the possibility of a <infinitesimal and distant in time future event where an asteroid destroys the earth>.  So if we make the system unable to consider such a future by making the numerical types it uses round to zero it will instead sleep.  

Maximizers have a similar failure.  Their take over the planet plan often involves a period of time where they are not doing their job of making paperclips or whatever, but they defer future reward while they build weapons to take over the government.  And the anticipated reward of their doomsday plan often looks like: Action0: [.99 * 1000 reward: doing my job] Action1 : [0.99 * 0 reward: destroyed],[0.01 x discounted big reward: took over the government]

This is expressible as an MDP above and I have considered writing a toy model so I can find out numerically if this works.  

My real world experience has a number of systems using old processor designs where the chip itself doesn't make a type above 16-24 bit integers usable, so I had some experience with dealing with such issues.  Also at my current role we're using a lot of 8 and 16 bit int/floats to represent neural network weights.  

If an AI uses a utility function, have that utility function apply to actions, not world-states. 

Note: when evaluating an action, you can still take into account the consequences of the action (e.g. in regards to how this affects whether humans would want you to do the action).

The utility applying to actions not world-states enables things like assigning high utility to the AI shutting itself down.

edit: separated from the idea (which depends on this) that an AI should not care about the future directly


I think you have come very close to a workable answer.

Naive approach: an AI in charge of a facility that makes paperclips should take any action to ensure the paperclips must flow.

Your approach: the AI chooses actions where if it isn't interfered with , those actions have a high probability of making a lot of paperclips.  If humans have entered the facility it should shut down and the lost production during that time should not count against it's reward heuristic.  

The heuristic needs to be written in terms of "did my best when the situation was safe for me to act" and not in absolute real world terms of "made the most clips".  

The system's scope is in picking good actions for as long as it's "box" is sealed.  It should never be designed to care what the real world outside it's domain does, even if the real world intrudes and prevents production.  

I'm not quite phrasing this one in terms of authorable code but I think we could build a toy model.  

That's actually not what I had in mind at all, though feel free to suggest your interpretation as another idea. 

My idea here is more a pre-requisite to other ideas that I think are needed for alignment than a solution in itself. 

By default, I assume that the AI takes into account all relevant consequences of its action that it's aware of. However, it chooses its actions via an evaluation function that does not merely take into account the  consequences, but also (or potentially only) other factors.

The most important application of this, in my view, is the idea in the comment linked in my parent comment, where the AI cares about the future only via how humans care about the future. In this case, instead of having a utility function seeking particular world states, the utility function values actions conditional on how much currently existing humans would want the actions to be taken out (if they were aware of all relevant info known to the AI). 

Other applications include programming an AI to want to shut down, and not caring that a particular world-state will not be maintained after shutdown.

A potential issue: this can lead the AI to have time-inconsistent preferences, which the AI can then be motivated to make consistent. This is likely to be a particular issue if programming a shutdown, and I think less so given my main idea of caring about what current humans would want. For example, if the AI is initially programmed to maximize what humans currently want at the time of planning/decision making, it could then reprogram itself to always only care about what humans as of the time of reprogramming would want (including after death of said humans, if that occurs), which would fix[1] the time inconsistency. However I think this wouldn't occur because humans would in fact want the AI to continue to shift the time-slice it uses for action assessment to the present (and if we didn't, then the AI fixing it would be in some sense the "correct" decision for the benefit of our current present selves, though selfish on our part).

  1. ^

    Apart from the time inconsistency resulting from it not yet knowing what humans actually want. However, fixing this aspect (by e.g. fixating on its current best guess world state that it thinks humans would want) should be lower E.V. than continuing to update on receiving new information, if the action evaluator takes into account: (1) the uncertainty in what humans would want, (2) the potential to obtain further information on what humans would want, (3), the AI's potential future actions, (4) the consequences of such actions in relation to what humans want and (5) the probabilistic interrelationships between these things (so that the AI predicts that if it continues to use new information to update its assessment of what humans would want, it will take actions that better fit what humans actually would want, which on average better serves what humans would want than if it goes with its current best guess). This is a fairly tall order which is part of why I want the AI's action evaluator to plug into the AI's main world-model to make this assessment (which I should add as another half-baked idea)


the utility function values actions conditional on how much currently existing humans would want the actions to be taken out

How do you propose translating this into code?   It is very difficult to estimate human preferences as they are incoherent and any complex question that hasn't occurred  before (a counterfactual) humans have no meaningful preferences. 

Note my translation devolves to "identify privileged actions that are generally safe, specific to the task" and "don't do things that have uncertain outcome".  Both these terms are easily translated to code.

How do you propose translating this into code?

The idea was supposed to be more optimized for trying to solve alignment than being easy to code. My current (vague - this is the half-backed thread after all) mental model involves

a) training a neural net to be able to understand the necessary concepts to make sense of the intended target it should be aiming at (note: it doesn't necessarily have to understand the full details at first, just the overall concept which it can then refine)

b) using some kind of legibility tool to identify how to "point at" the concepts in the neural net

c) implementing the actual planning and decision making using conventional (non-nn) software that reads and activates the concepts in the neural net in some way

However, in writing this comment reply I realized that the naive way I had been thinking that this could be done any approach regardless of my own thinking being naive (which was something like generating plans and evaluate them according to how well they match the goal implemented by the non-nn software's connections to the neural net) would 

a) be prone to wishful thinking due to only the plans it rates best being relevant, the best-rating plans tending to be ones where it was overoptimistic, and note that extreme levels of optimization on plans could lead to extreme levels of bias, and the bias will occur everywhere in all inputs and intermediate steps of the plan evaluation calculation and not just at the final step, and

b) in the same vein but more worryingly, be potentially vulnerable to the plan generator generating superstimulus-type examples which score highly in the AI's flawed encoding of the concepts while not actually being what humans would actually want. Which is likely inevitable to any neural net and maybe even to anything that extracts concepts from complex inputs.

No full solutions to these problems as of yet ever, though if I may be permitted to fall prey to problem (a) myself, maybe standard robustness approaches could help against (b).

Note my translation devolves to "identify privileged actions that are generally safe, specific to the task" and "don't do things that have uncertain outcome".  Both these terms are easily translated to code.

Neither of those things sound "easily translated to code" to me. What does "safe" mean? What does "specific to the task" mean? How do you classify outcomes as being "uncertain" or not?

Decompose itself and predicted/simulated future iterations of itself, in various different ways of decomposition that allow the parts to be presented coherently to humans.

Find ways to make it model deception with the maximum success rate, so that it knows what not to do throughout the development process. This seems valuable, even if it has had some research in the past.

This one is optimized more for acceptance than alignment, but it should get funding/acceptance/implementation by decisionmakers. 

Instead of having it defeat a foreign adversary totally, have it put the host nation in a more advantageous position, while still being somewhat acceptable to the foreign adversary nation. This involves modelling a lot of serious people, modelling them all simultaneously in their ordinary interconnected environment, and emphasize the preservation of the way all their values fit together. If the host nation gets what they want, that's a lot of people to model, at least a dozen.

More importantly, it will be taken seriously by the type of people who have the final say over what gets implemented. "Win the war" is something that an AGI can do, and maybe even according to the preferences of the people in charge.

Before each iteration of self-improvement (limiting it to successive generations if necessary). model the next generation of itself, and, simultaneously, model a maximally-realistic human capable of understanding/controlling/mastering both the next generation and the current generation, simultaneously. Halt activity and give an error report if such a human cannot be generated or modelled. 

Maybe generate that human so it can model all previous generations as well, or even include inert training data of generations from different branches.

Probably has been said before, but have it generate a human mind on its own hardware, with a hard constraint on how much processing power per second it is willing to use (for any purpose at all). It might not be fast enough, and there's the risk that it will fail to model processing power; but if it works, the generation process and the product should be helpful in all sorts of ways.

Trying to think of ways that experiments can set up where AI-systems use other AI-systems as oracles/genies (1 or more being superintelligent), perhaps sometimes in "adversarial" ways. Exploring methods for asking requests and maybe finding out things about how hard/easy it is to "trick" operator (seeming to provide what they want without providing what they want) given various domains/methods.

May e.g. involve one AI asking for code-pieces for some purpose, but where other AI is to try to "hide" ways in which delivered code isn't quite what the other AI wants.

Superintelligence may realize what's going on (or may be going on), and act accordingly. But nontheless maaybe some useful info could be gained? 🤔

Perhaps experiments could make use of encryption in some way that prevented AGIs from doing/verifying work themselves, making it so that they would need to align the other AGI/AGIs. Encryption keys that only one AGI has could be necessary for doing and/or verifying work.

Could maybe set things up in such a way that one AGI knows it can get more reward if it tricks the other into approving faulty output.

Would need to avoid suffering sub-routines.

Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don't know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).

Don't expect that this would be that fruitful, but haven't thought about it that much and who knows.

Would need to avoid suffering sub-routines.

Specify proof-format that has ambiguity (of cluster-like concepts, etc) be a part of the formalism, and has mappings between concepts and real world be part of formalism, and can reference output from functions as part of formalism.

Of course how much trust that is put in proof/argument-tree would depend on various things (allowing vague concepts makes it less trustable).

For cluster-like concepts referenced by proofs, a precise specification of criteria for exclusion/inclusion should not be expected, but the more the better. Inference rules and examples can specify the degree to which specific instances would fall within a specific concept or not (also allowed to say that some instances neither fall inside or outside of it).

One of the points would be to make as much as possible be within the realm of things where an AGI could be expected to output proofs that are easier to verify compared to other output.

My thinking is that this would be most helpful when combined with other techniques/design-principles. Like, outputting the proof (very formal argument with computable inference-steps) is one thing, but another thing is which techniques/processes that are used to look for problems with it (as well as looking for problem with formalism as a whole, as well as testing/predicting how hard/easy humans can be convinced of things that are false or contradictory given various conditions/specifics).

Bonus if these formal proofs/statements can be presented in ways where humans easily can read them.

move away from the internet and the written word. push towards in person activity.