Learning part of your model that previously was fixed.
- Can be done using neural nets, other ML models, or uncertainty (probability distributions)
- Example: learning biases instead of hardcoding Boltzmann rationality
Relatedly, treating an object as evidence instead of as ground truth
- Example: Inverse Reward Design treats the specified reward as evidence about the true reward
Looking at current examples of the problem and/or its solutions:
- How does the human brain / nature do it?
- How does human culture/society do this
- How has cognitive science formalized similar problems / what insights has it produced that we can build on?
- Principal-agent models/Contracting theory
Adversarial agents for robustness
- Internal design of the system to be adversarial inherently (e.g. debate)
- External use of adversaries for testing: red teaming, adversarial training
‘Normative Bandwidth’
- How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
- E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy -- that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information -- something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
Proactive Learning (i.e. what if we ask the human?)
Induction (see e.g. iterated amplification)
- Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
Analyze a simple model of the situation in theory
Indexical uncertainty (uncertainty about your identity)
Rationality -- either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
How would you do it with infinite resources (relaxed constraints)?
- E.g. AIXI, Solomonoff induction, open-source game theory

Charlie Steiner

Sep 29, 2021

50

"How do humans do it?"

Part of why this question is so good is because it has two functions. Most obviously, given some problem (like reasoning about your future self, or not going all Sorcerer's Apprentice when given a new task), it invites one to think about how the human brain solves this problem, with an eye towards designing an artificial system that does the same sorta thing. But also, sometimes keeping this question in mind makes you realize when humans don't solve the problem you're thinking about, and you should take a second look at whether you can get what you want without having to solve this problem either.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 12:10 AM

[-]Rohin Shah4y30

I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I'm wrong!).

That's certainly one heuristic I often use :)

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

5

[ Question ]

What Heuristics Do You Use to Think About Alignment Topics?

5

5

2 Answers sorted by
top scoring

Sep 29, 2021

Sep 29, 2021

5

[ Question ]

What Heuristics Do You Use to Think About Alignment Topics?

5

5

2 Answers sorted by top scoring

Sep 29, 2021

Sep 29, 2021

2 Answers sorted by
top scoring