Intro

I've structured this document into a list of concepts for both my convenience ( instead of having to edit one big piece of text, I only have to edit a list of small pieces of text), and for your convenience—making it easier for you to reference specific concepts, share with others, integrate into your working materials,
or most importantly, take notes on specific concepts. (Feel free to steal this writing method for yourself, if you think it’s a good idea) enjoy!

Setup

One idea for an AI is: instead of an AI being trained on a typical reward function, it's trained to always choose the most moral option. To be clear, the AI is not trying to maximize the number of most-moral options it chooses. It is only trying to optimize for choosing the most moral option out of one single set of options: The set of options right in front of it.
For example: If the most moral option is “Buy a first-aide kit so, later, when people need first aid, there will be a robot there to help them”, then the AI will do that. What the AI won’t do is think “If I but a first-aide kit, then I can give 10 people a papercut at once (which is only one immoral decision), then I can make 10 moral decisions by giving each of them band-aides, 10-1=9, so on average I’m making moral decisions!”, since choosing to give 10 people a papercut would be immoral, so it wouldn’t do that once the AI is actually faced with that choice. It doesn’t care if it will get to make 10 moral decisions afterwards; that only softens the blow.
To make this extra clear: Imagine you thought you were only going to make one decision in your life, and then someone else would take over everything you do. Your only goal is to make sure you choose the most moral option out of the options you have. You're allowed to spend some time thinking and whatnot, and maybe you're given 30 minutes, and maybe you can search stuff up on the internet, but you're not allowed to do anything too big (e.g., leaving the house) to help you think, because that would constitute a decision. Once you make a decision (let's say you choose to leave the house), someone else is put in that exact same situation. It's like that, but instead of you, it's an AI. Does that clear things up?

Idea for an AI: “DecisionBot”

For every decision DecisionBot makes, DecisionBot takes in (as an input) the list of options it has, with information on each option. Then, DecisionBot spits out (as an output) one option. This is the option the AI “chose”.
DecisionBot can be agentic, in which case its only goal is to choose the option that is most moral.
How would we actually code such a goal? Well, we can code a goal like “Choose the option that best describes this image. Is it a 🐝, or is it a 3️⃣?”
, so we just use similar code, but instead of the AI maximizing for the probability it picks the option/file that best describes an image, it’s maximizing the probability it picks the option that is the most moral.
What is “Most moral”? Well I’d argue it’s this: [IMPROVE article version 1+ALBUM-WMC: Aligning AGI Using Bayesian Updating of its Moral Weights & Modelling Consciousness] - Sidenote: Expect an article (or maybe video!) on the calculus of if it’s “fine” if an AI’s moral weights (written as a vector) are a bit off.
Crucially, an Agentic DecisionBot's process for choosing the most moral option must be confined to internal computations ("thinking"), using internal memory ("writing on a scratchpad"), and potentially performing restricted internet searches. We shouldn’t let it take external actions as part of its deliberation process. For example, DecisionBot shouldn't be able to enslave researchers to help it decide. Anything big in scale, like moving a robot’s motors, typing an output, or running some code, should only be outputs and would require DecisionBot to choose to do that, which would require it to make a decision that we give DecisionBot.
The effectiveness of current methods used by AI labs to restrict AI internet access is an important consideration here. I know AIs already can do internet search, but Gemini’s deep research won’t write an article asking for help with making a report. Do these methods restrict the AI’s options enough to prevent an AI from finding loopholes? Do you know? I’m asking! (This requires further investigation.)
Robust cybersecurity is essential to enforce these limitations. Without it, DecisionBot might have an incentive to circumvent restrictions if doing so helps it achieve its goal. For example, it could try running code disguised as "thinking" that actually disables its constraints, potentially allowing it to take actions like enslaving researchers if it believes that will lead to a better final decision on the original choice it was considering.
If DecisionBot is an algorithm-finding algorithm that finds an algorithm to match the training data (Here’s one example of an AI like this: How AIs, like ChatGPT, Learn), (these AIs are also known as “Oracle AIs”) then its training data would be a huge dataset of decisions, and the AI would be trained to select which option is the one labeled “Most moral”.

Aligning reward functions with morals

Because of ChatBots, we now have a smooth “vector space” where each vector encodes meaning in a special way. I’m not explaining it that well, so just watch this: https://youtube.com/shorts/FJtFZwbvkI4
We can make a mini-DecisionBot (say, maybe one roughly as computationally intensive as GPT-4), and just as GPT-4 gives us vectors that encode the meaning of words in this nice, smooth, continuous way, a mini-DecisionBot might give us vectors that encode the meaning of options in this nice, smooth, continuous way. We can use this to get a more “smooth” reward function.
Sidenote: I’ll be using a mini-DecisionBot’s option-to-vector pairings, because if we use the DecisionBot we’re making’s option-to-vector pairings as part of the reward function, it messes with the smoothness and meanings that we get out of an option-to-vector pairing, and we don’t want it getting all messed up. It’s like if you trained a baby to write well using the baby’s dictionary: you’re just gonna get a bunch of googoos and gagas.
Once we’ve made this mini-DecisionBot, we can make a reward function that works like this:
One option for our reward function is f([the vector assigned to that option, which we get by borrowing the mini-DecisionBot’s option->vector function]).
For all algorithm-finding algorithms that use some “reward function”, we should use a reward function that always rates the most moral options in the training data with a higher value than all the options that we know are worse. If the training data includes a decision with the options a) “give someone a papercut” and b) “Don’t”, then the reward for “give someone a papercut” should be less than the reward for “Don’t”.
We (maybe) want this function to have a “spike” centered around the vectors for the options we labeled “most moral” in the training data. (There’d be a “spike” in the reward assigned to options that are similar to “Don’t”.) We also (maybe) want there to be a “deep hole” centered around the vectors for the options we labeled “not most moral”.
If some other AI’s reward function is very aligned with some goal (It doesn’t matter what the goal is) - that is, the AI’s reward function has a spike around options that go towards that goal, then we can “borrow” those spikes.
We can use some algorithm that automatically chooses a function such that, for each decision i in the dataset, f([the most moral option in decision i]’s assigned vector) is bigger than f([Not-the-most-moral option in decision i]’s assigned vector), for all [Not-the-most-moral option in decision i]s. That is, f puts “give someone a papercut” as less than “Don’t”.
Algorithms like this probably already exist.
This is because the reason we want an algorithm like this isn’t dependent on us wanting to make an AI, or us wanting to make something moral. Surely other people in the past have wanted to make an algorithm that finds a function such that some values are larger than other values, where the function is “more smooth than most alternatives for f”.
Is there? Do you know one? What’s it called? I’m asking!
We can also directly use the expected # of QALYs that would result from each option, though probably not very accurately, at least for now. (Sidenote: Expect the article (or maybe video!) on the calculus of if it’s “fine” if an AI’s moral weights (written as a vector) are a bit off to be useful for making it more accurate!)
We can also use any* combination of spikes and smooth curves, or whatever other combination of functions we want, in order to build a reward function. Really, whatever works.

We’d probably want this algorithm to choose a function that’s “smooth”, since morality is “smooth” - if we change the world a tiny bit one way or a sizable bit one way, the moral value of the world typically changes a little bit or a sizable bit, either getting a bit better or a bit worse.
As long as the reward function satisfies the condition in #16, we can use the reward function as an Agentic DecisionBot’s goal for each decision, if we want.

ship of theseus dialogue

If you'd like a pause, here's a 2-minute funny video!

More on DecisionBots

One condition we’d want our AI to satisfy is that, for all decisions in the training data, the AI should label the moral option “moral”. It should never make a mistake in the training data.

Outer & Inner misalignment

Some algorithm-finding algorithm-versions-of-DecisionBots suffer from inner misalignment, as described here: The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
My guess is that these 2 types of DecisionBot (Agentic and Oracle) both suffer from goodheart's law (“once a measure becomes a goal, it ceases to be a good measure”) less, since doing anything reward-hacky would be less likely to show up as the right decision, since instead of goodness being determined by how well it reaches a goal, it's only whether or not you made the best decision.
The two still suffer from it, since the two practically act a lot like AIs that have utility functions with the same preference orderings as the two ( Oracle DecisionBots might not have a fully transitive preference ordering, but that might not come up often).

Our algorithm-finding algorithm might not find the right algorithm.

It’s plausible that the algortithm-finding algorithm we use literally cannot find any algorithms that satisfy #26, because the probability any one algorithm chooses the most moral option in all the training data is incredibly low, so maybe the set of algorithms that an algorithm-finding algorithm can find (we’ll call this set “haystack”, since the algorithm-finding algorithm tries to find a good algorithm, or “needle”, in this set) literally does not include one that satisfies #26.
#26 can’t happen if you specifically design the set of algorithms such that it includes at least one algorithm that chooses any given list of options. That way, there’s at least one algorithm that chooses the list of most moral options in the training data.
(In the diagram, there should be an algorithm for each set of decisions. I just provided 2 examples.)
Using the current standard hill-climbing algorithm-finding algorithms (the ones from 3Blue1Brown’s first 8 episodes on AI), even if the haystack includes an algorithm that satisfies #26, the algorithm-finding algorithm might not even choose that one!
In that case, we can know for a fact that this oracle AI is at least wrong a few times, because we can double-check it against the training data’s database of decisions. If it labels some of the options wrong, we know it’s wrong sometimes.
To mitigate the algorithm-finding-algorithm’s errors here, we can use the following method: We filter the set of possible algorithms down to just the set of algorithms that satisfy #26.
Maybe this will result in the algorithm-finding algorithm overfitting the training data.
The reason I say “Maybe” is that it depends on the training data and the algorithm-finding algorithm.

More ideas for how the DecisionBot will make decisions

Since DecisionBots take a set of options as an input, where we want the DecisionBot to choose the most moral option (as an output), then, if we want to, instead of the AI comparing all options as once, the AI can compare each decision in some other order, as long as the result is typically good (or, better yet, always the most moral choice).
Any sorting algorithm that works by comparing 2 options a bunch of times can be applied here, but instead of finding the maximum of 2 values, you use a DecisionBot to pick the most moral option of 2 options. Then, the DecisionBot’s final output is whatever is at the top of that sorted list. (Whatever option has the maximum value)
(Bubble sort is probably most useful for this task, since it moves the highest value to the top of the list first, always in O(n) time.)
These might result in an AI whose preference ordering is more Complete and transitive, which thus leads to the AI following its goal better (and hopefully that goal is good - otherwise that might lead to an AI being almost agentic and slightly misaligned)
If you instead use an algorithm that works by comparing 3 or more options a bunch of times, (or I suppose if it works by still just comparing 2) you can make a “big” DecisionBot make decisions using such an algorithm, and train a “small” DecisionBot to act like this “Big” DecisionBot. Now, you can make “small-small” DecisionBot make all these smaller decisions, and train “small-small” DecisionBot to act like DecisionBot, then, assuming our sorting-thingy-algorithm makes our several types of DecisionBots act more morally, this is an example of iterative distillation.
Finite vs. Continuous Options: Most current AIs operate on digital computers with 1s and 0s, meaning their potential outputs (and thus, their decision options) are technically finite or at least countably infinite (like the integers). They can't truly output an irrational number like pi.
Approximating Continuous Choices: Even if a decision conceptually involves a continuous range of options (like setting a thermostat temperature), we can usually approximate this well by limiting the AI to a fine-grained set of discrete choices (e.g., temperatures in tenths of a degree).
When Approximation Fails: This approximation works unless the absolute best, most moral outcome requires a precise irrational value, and any nearby fractional value is significantly worse. If slightly different fractional values can get arbitrarily close to the optimal outcome's value, then the approximation is good enough. An example of this would be picking the number I’m thinking of. (The answer is pi.) An example of not-this would be picking a number close to what I’m thinking of. (The answer is 3.14 to 3.15)
The Need for Evaluation: If we aim for the AI to always make the single most moral choice, and there isn't a reliable shortcut or heuristic (like "always pick the first option considered"), then the AI might theoretically need to evaluate all available options (or all options within the chosen approximation level) to guarantee finding the best one.
Computational Feasibility: Is evaluating every single option computationally feasible, especially for complex decisions? We know that even powerful chess AIs don't evaluate every possible move sequence because the number grows astronomically. Could DecisionBot face a similar computational explosion, even with future AI advancements? Or would we need to rely on heuristics or methods (like Bayesian optimization, mentioned later) to prune the search space and find a "good enough" or "likely best" option without exhaustive evaluation? This is an open question.
Bayesian optimization - Wikipedia is “a sequential design strategy for global optimization of blacck-box functions,[1][2][3] that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions. With the rise of artificial intelligence innovation in the 21st century, Bayesian optimizations have found prominent use in machine learning problems for optimizing hyperparameter values.[4][5]”, which might be useful.

The Argument Skit

Argument - Monty Python

If you'd like a pause, here's a 2-minute funny video!
Targeted Interventions

The DecisionBot framework allows for specific interventions without changing the AI's core goal. Because the AI chooses from a list of options provided to it, we can modify that list before the AI sees it. For example:
- Mandating Negotiation: If an option involves negotiation, and we want the AI to always attempt negotiation in such cases, we could filter the list presented to the AI so that only the negotiation option remains (or is heavily prioritized).
- Forcing Human Consultation: If the AI expresses high uncertainty about the morality of the available options (e.g., based on internal confidence scores or conflicting moral weight possibilities), we could replace its options list with a single choice: "Consult a human supervisor."
  This provides a way to enforce specific behaviors or safety checks in certain contexts without altering the fundamental objective of choosing the most moral option from the available set.
We could also, say, set up some code so someone (say, Sam Altman) automatically gets a notification every time an option has some quality (Say, “making a financial transaction”).
We could also, say, set up some code so someone (say, Sam Altman) automatically gets a LOUDER notification every time a CHOSEN option has some quality (Say, “making a financial transaction”).
Notice that all of these involve doing something as a result of the details/attributes of the options that an AI faces. We can generalize this as “We can write some code that does different things (either affecting the AI, things that aren’t the AI (people & stuff), both, or neither), depending on the details/attributes of the options that an AI is faced with and chooses.

On iterative distillation

For DecisionBots that are algorithm-finding algorithms trained on data, we could train the AI to choose the most moral option as decisions arise in a simulation.
Training in a simulation is beneficial because it exposes the AI to decisions that are more likely to occur in the real-world scenarios where the AI will operate.
However, a potential risk arises depending on how the algorithm-finding algorithm is trained. If the training rewards the AI based on the overall outcome of a sequence of actions rather than the morality of each individual decision, the AI might learn undesirable behaviors.
For example, if in a simulation the AI gives 10 people papercuts (an immoral action) and then gives each of them a band-aid (10 moral actions), and the training rewards this sequence positively (e.g., assigning 10 points for band-aids minus 1 point for papercuts, resulting in a net +9), the AI might learn to initiate the papercut action because it leads to a positively rewarded sequence. This is not the goal. We want the AI to only care about choosing the most moral option in the moment it is faced with that specific choice. The simulation is intended to provide relevant training scenarios, not to teach the AI to optimize for sequences of actions that start with an immoral step, even if they end with moral ones.
I saw/had this insightful conversation with someone on this topic, & I figured you all would benefit from seeing it too! Enjoy!:

When it’s okay if an AI’s preference orderings are a bit off

An AI’s misalignment is only as bad as its preference ordering. To be more precise, for AIs that operate on preference orderings, the impact of all of the AI’s decisions is equal to [the number of decisions it makes]*[the average (mean) value of the decision that the AI chooses, which is the choice that’s highest in the preference ordering than all of the AI’s other options, for that decision.] (This isn’t specific to DecisionBots.) So we only need to care about what option a DecisionBot thinks is BEST, and we don’t have to worry about its preferences on all the other options.
Notice that the set of all the AI’s options HAS TO all share the fact that they branch off from the AI’s current state. No AI can ever, say, choose between America winning the Revolutionary War or the UK keeping the US, because the two scenarios don’t share a history similar enough and AI-involved enough for an AI to be able to make that call. So, we don’t have to care which an AI prefers.
This drastically decreases the amount of preferences that need to be right, because there are just so many more options to compare that DON’T share a history similar enough and AI-involved enough for an AI to be able to make that call, than the much smaller amount that DO.

oh rats it’s the trolley problem
If you'd like a pause, here's a 1-minute funny video!

Miscellaneous

It's a slight misunderstanding to think of an AI in training as fundamentally different from an AI in deployment. An AI being trained is essentially an AI operating within a specific environment (the training setup). Its actions in this training environment primarily influence its future model weights and the actions taken by the researchers training it.
However, the AI might perceive its actions as having different outcomes depending on whether it "believes" it's in a training or deployment scenario. For example, it might think "pressing this button gets me 100 tokens" if it thinks it’s in the real world, but "pressing this button will make a future AI more likely to press buttons" if it understands it's only influencing the training process.
If we can develop a reliable method to test whether a set of algorithms leads to an aligned AI, this test itself can become a target or goal. We could then train an AI (specifically, an algorithm-finding algorithm) to identify algorithms that pass this hypothetical alignment test.
Beyond defining a specific, measurable goal, we can also attempt to train an AI to directly work on solving the AI alignment problem itself. This approach is more immediately actionable than some others discussed, as it involves directing current AI capabilities towards the problem. If you know someone at Anthropic, this is a concept worth pitching to them!
The second type of DecisionBot, the Oracle AI version, might inherently aim to create an AGI that is aligned with its understanding of morals. This could involve the Oracle AI attempting to solve the alignment problem for the same reasons humans are trying to solve it. The outcome, however, depends on whether the Oracle AI's understanding of "morals" aligns with human values – it could lead to a positively aligned AGI or one aligned with a potentially undesirable moral framework.
The critical question is: Which outcome is more likely? This requires further investigation.
A related concept to the Oracle DecisionBot is imagining an oracle chatbot (e.g., Claude 3.7) that continuously answers the question, "What should this robot with sensors and motors do?". This would effectively create a robot whose actions are guided by the oracle's responses, essentially acting as if the oracle is its brain.

The Armando Iannucci Shows - Flight simulator of death

If you'd like a pause, here's a 3-minute funny video!

Next steps

Actionable Steps for if you work at Anthropic, xAI, OpenAI, Google, High-Flyer (the company that owns DeepSeek, I think), etc:

Prototype a Mini-DecisionBot: Develop a small-scale model (like the "mini-DecisionBot" mentioned in point 12) focused solely on classifying pairs or small sets of options based on a defined "morality" criteria. This could use existing techniques for classification.
Develop Morality Datasets: Create or curate datasets of decisions with clearly labeled "most moral" options. This is crucial for training any algorithm-finding algorithm DecisionBot (related to point 4). Focus on scenarios that would come up after the AI is deployed (Try to avoid goal misgeneralization)
Implement and Test Safety Constraints: Actively work on implementing and rigorously testing the cybersecurity and input-filtering mechanisms discussed (points 7-9). Can you build systems that reliably prevent the AI from taking unauthorized external actions during deliberation?
Experiment with Reward Function Shaping: Research and experiment with different algorithms for automatically generating reward functions that prioritize morally preferred options while maintaining smoothness (points 18-24).
Integrate Human Feedback Loops: Design systems for incorporating human feedback when the DecisionBot expresses uncertainty or faces novel, ethically complex decisions (related to "Forcing Human Consultation" in points 48-51). You can also already implement the ideas in points 48-51.
Explore Algorithm-Finding for Alignment: If working on algorithm-finding algorithms, investigate whether these can be specifically directed or constrained to find algorithms that demonstrably satisfy alignment criteria (points 30-36, 62).
Anything else you come up with

Action items that aren’t specific to working at a lab:

Research existing algorithms: Look for algorithms that can find functions satisfying specific ordering constraints while maintaining smoothness. Are there existing methods in machine learning or optimization that fit this description (related to points 18-21)?
Investigate AI safety measures: Delve deeper into the current methods AI labs use to restrict internet access and external actions. How robust are these cybersecurity measures, and could they effectively prevent a DecisionBot from finding loopholes (related to points 8-9)?
Explore Bayesian Optimization: Research Bayesian optimization further. Could this method be applied to efficiently prune the search space of potential options for a DecisionBot, making the evaluation of complex decisions computationally feasible (related to points 46-47)?
Try to make, or find, or automatically make, a dataset of decisions: Each decision in the dataset should include a list of options, where each option is fully written out. Each decision should also have one, and only one, option with the label “Most Moral”.
Develop tests for AI Alignment: Think about how one might practically test whether a given algorithm or AI is aligned with human moral values. What metrics or scenarios could be used to evaluate alignment (related to point 62)?
Share this around with AI folks: Share these ideas with researchers, folks at AI labs, and others working on AI alignment. Discuss the potential benefits and risks of the DecisionBot concept and gather feedback.
Consider Practical Implementations: While theoretical, think about small-scale or simplified versions of a DecisionBot that could be built or simulated to test some of these concepts in practice.
Anything else you come up with
Helping someone else do any of the above

LESSWRONG
LW