All of Josh's Comments + Replies

Refining the sharp left turn threat model
Josh4dΩ240

Claim 1: there is an AI system that (1) performs well ... (2) generalizes far outside of its training distribution.

Don't humans provide an existence proof of this? The point about there being a 'core' of general intelligence seems unnecessary.

3Ramana Kumar17h
I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.
How large of an army could you make with the first 'human-level' AGIs?

I know that this is a common argument against amplification, but I've never found it super compelling.

People often point to evil corporations to show that unaligned behavior can emerge from aligned humans, but I don't think this analogy is very strong. Humans in fact do not share the same goals and are generally competing with each other over resources and power, which seems like the main source of inadequate equilibria to me. 

If everyone in the world was a copy of Eliezer, I don't think we would have a coordination problem around building AGI. They w... (read more)

How large of an army could you make with the first 'human-level' AGIs?

That's a good point. I guess I don't expect this to be a big problem because:
1. I think 1,000,000 copies of myself could still get a heck of a lot done. 
2. The first human-level AGI might be way more creative than your average human. It would probably be trained on data from billions of humans, so all of those different ways of thinking could be latent in the model.
2. The copies can potentially diverge. I'm expecting the first transformative model to be stateful and be able to meta-learn. This could be as simple as giving a transformer read and write ... (read more)

Infra-Bayesianism Distillation: Realizability and Decision Theory

Wait... I'm quite confused. In the decision rule, how is the set of environments 'E' determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things.

Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent's observations? But isn't this almost all probability distributions? Some distributions match the d... (read more)

2Thomas Larsen2mo
The decision procedure you outlined in the first example seems equivalent to an evidential decision theorist placing 0 credence on worlds where Omega makes an incorrect prediction. What is the infra-bayesianism framework doing differently? It just looks like the credence distribution over worlds is disguised by the 'Nirvana trick.' In Newcomb's problem, this is correct, it lines up exactly like an EDT agent. In other scenarios, we get different behavior, e.g. in the situation of counterfactual mugging [https://www.lesswrong.com/tag/counterfactual-mugging]. In this case, the UDT agent will pay, so that it maximizes the overall expected utility, even after seeing the coin flips tails and Omega asks you to pay. An EDT agent, on the other hand, won't pay here, because the expected utility of paying (-100) is worse than not paying (0). The key distinction is that EDT is an updateful decision theory -- it doesn't reason about the other branches of the universe that have already been ruled out by observed evidence. We also don't have a credence distribution over worlds, because this would be too large to hold in our heads. Instead of a credence distribution, we just have a set of possible worlds. In the decision rule, how is the set of environments 'E' determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things. The environment setEcorresponds accounts for each possible policy the agent. So for each policyπi, there is a corresponding environmenteiwhere that policy is hardcoded. We want our agent to just reasons over the diagonal of the matrices I printed, i.e., over pairs(ei,πi)where the environmental policy matches the taken policy. Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent's observations? But i
Infra-Bayesianism Distillation: Realizability and Decision Theory

Interesting post! I'm not sure if I understand the connection between infra-bayesianism and Newcomb's paradox very well. The decision procedure you outlined in the first example seems equivalent to an evidential decision theorist placing 0 credence on worlds where Omega makes an incorrect prediction. What is the infra-bayesianism framework doing differently? It just looks like the credence distribution over worlds is disguised by the 'Nirvana trick.'

2Josh2mo
Wait... I'm quite confused. In the decision rule, how is the set of environments 'E' determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things. Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent's observations? But isn't this almost all probability distributions? Some distributions match the data better than others, so do you weigh them according to P(observation | data generating distribution)? But then what would you do with these weights? Sorry if I am missing something obvious. I guess this would have been clearer for me if you explained the infra-bayesian framework a little more before introducing the decision rule.
Explaining inner alignment to myself

It's great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn't talk about deception much. In particular, the statement below is false:

Generalization <=> accurate priors + diverse data

You have to worry about what John Wentworth calls 'manipulation of imperfect search.' You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment. 

2Jeremy Gillen2mo
Thanks for reading! Are you referring to this post [https://www.lesswrong.com/posts/KnPN7ett8RszE79PH/demons-in-imperfect-search]? I hadn't read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by "greedy genes" then it wouldn't count as having an "accurate prior". I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John's post I think I did lean too hard on the learning theory perspective. I didn't have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the "optimization demon" perspective is a good way to think about it.
Crystalizing an agent's objective: how inner-misalignment could work in our favor

I'm guessing that you are referring to this:

Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.

The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities. 

Crystalizing an agent's objective: how inner-misalignment could work in our favor

I'm not sure I understand. We might not be on the same page.

Here's the concern I'm addressing:
Let's say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that's better than human feedback/imitation.

Here's the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence -- even if you don't have a scalable outer-aligned training signal -- because the AGI will be motivated to crystallize its aligned objective.

Crystalizing an agent's objective: how inner-misalignment could work in our favor

Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work. 

One clarification:

The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting

When I said that the 'human-level' AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...

1leogao2mo
If you already have a mesaobjective fully aligned everywhere from the start, then you don't really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.
Crystalizing an agent's objective: how inner-misalignment could work in our favor

Adding some thoughts that came out of a conversation with Thomas Kwa:

Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn't a good reason to believe that the AIs we build will be different.

1Garrett Baker2mo
Doesn’t this post assume we have the transparency capabilities to verify the AI has human-value-preserving goals, which the AI can use? The strategy seems relevant if these tools verifiably generalize to smarter-than-human AIs, and its easy to build aligned human-level AIs.
A Bird's Eye View of the ML Field [Pragmatic AI Safety #2]

Safety and value alignment are generally toxic words, currently. Safety is becoming more normalized due to its associations with uncertainty, adversarial robustness, and reliability, which are thought respectable. Discussions of superintelligence are often derided as “not serious”, “not grounded,” or “science fiction.”

 

Here's a relevant question in the 2016 survey of AI researchers:

 

These numbers seem to conflict with what you said but maybe I'm misinterpreting you. If there is a conflict here, do you think that if this survey was done again, the... (read more)

1ThomasW3mo
(Speaking for myself here) That sentence is mainly based on Dan's experience in the ML community over the years. I think surveys do not always convey how people actually feel about a research area (or the researchers working on that area). There is also certainly a difference between the question posed by AI Impacts above and general opinions of safety/value alignment. "Does this argument point at an important problem?" is quite a different question from asking "should we be working right now on averting existential risk from AI?" If you look at the question after that in the survey, 60%+ put Russell's problem as a low present concern. As you note, it's also true that the survey was in 2016. Dan started doing ML research around then, so his experience is more recent. But given the reasons above, I don't think that's good evidence to speculate about what would happen if the survey were repeated.
Biology-Inspired AGI Timelines: The Trick That Never Works

I have an objection to the point about how AI models will be more efficient because they don't need to do massive parallelization:

Massive parallelization is useful for AI models too and for somewhat similar reasons. Parallel computation allows the model to spit out a result more quickly. In the biological setting, this is great because it means you can move out of the way when a tiger jumps toward you. In the ML setting, this is great because it allows the gradient to be computed more quickly. The disadvantage of parallelization is that it means that more ... (read more)

What 2026 looks like

Here's another milestone in AI development that I expect to happen in the next few years which could be worth noting:
I don't think any of the large language models that currently exist write anything to an external memory. You can get a chatbot to hold a conversation and 'remember' what was said by appending the dialogue to its next input, but I'd imagine this would get unwieldy if you want your language model to keep track of details over a large number of interactions. 

Fine-tuning a language model so that it makes use of a memory could lead to:
1. Mo... (read more)

ARC's first technical report: Eliciting Latent Knowledge

I'm pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?

"To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...

  1. Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have
... (read more)
A mathematical derivation of total hedonistic Utilitarianism from simple normative axioms

I don't think I agree that this undermines my argument. I showed that the utility function of person 1 is of the form h(x + y) where h is monotonic increasing. This respects the fact that the utility function is not unique. 2(x + y) + 1 would qualify, as would 3 log(x + y), etc.

Showing that the utility function must have this form is enough to prove total utilitarianism in this case since when you compare h(x + y) to h(x'+ y'), h becomes irrelevant. It is the same as comparing x + y to x' + y'.

1TLW6mo
I have three agentsABandC, each with the following preferences between two outcomesaandb: 1. AgentsAandBprefersa>b 1. AgentCprefersb>a 2. For any two lottos <L, with anx%chance of gettinga, otherwiseb> and <M, with any%chance of gettinga, otherwiseb>: 1. ifX>Y 1. AandBpreferL 2. CprefersM. 2. IfX=Y, all three agents are indifferent betweenLandM 3. ifX<Y: 1. AandBpreferM 2. CprefersL. (2 is redundant given 1, but I figured it was best to spell it out.) This satisfies the axioms of the VNM theorem. I'll give you a freebee here: I am declaring that agentC's utility function is:u C(Pa)=−2Paas part of the problem. This is compatible with the definition of agentC's preferences, above. As for agentsAandB, I'll give you less of a freebee: I am declaring as part of the problem that one of the two agents, agent [redacted alpha] has the following utility function:u[Redacted Alpha](Pa)=3Pa. This is compatible with the definition of agent [redacted alpha]'s preferences, above. I am declaring as part of the problem that the other of the two agents, agent [redacted beta] has the following utility function: :u[Redacted Beta](Pa)=Pa. This is compatible with the definition of agent [redacted beta]'s preferences, above. Now, consider the following scenarios: 1. Agent [redacted alpha] and agentCare choosing betweenaandb: 1. The resulting utility function isu[Redacted Alpha](Pa)+uC(Pa)=3Pa−2Pa=Pa 2. The resulting optimal outcome is outcomea. 2. Agent [redacted beta] and agentCare choosing betweenaandb: 1. The resulting utility function isu[Redacted Beta](Pa)+uC(Pa)=Pa−2Pa=−Pa 2. The resulting optimal outcome is outcomeb. 3. AgentAand agentCare choosing betweenaandb: 1. Is this the same as scenario 1? Or scenario 2? 4. AgentBand agentCare choosing betweenaandb: 1. Is this the same as scenario 1? Or sce
A mathematical derivation of total hedonistic Utilitarianism from simple normative axioms

This is a much more agreeable assumption. When I get a chance, I'll make sure it can replace the fairness one and add it to the proof and give you credit.

Questions about simulating consciousness

I am defining it as you said. They are like movie frames that haven't been projected yet. I agree that the pre-arranged nature of the snapshots is irrelevant -- that was the point of the example (sorry that this wasn't clear).

The purpose of the example was to falsify the following hypothesis:
 "In order for a simulation to produce conscious experiences, it must compute the next state based on the previous state. It can't just 'play the simulation from memory'"

Maybe what you are getting at is that this hypothesis doesn't do justice to the intuitions tha... (read more)

1TAG6mo
Well, you can believe that its some kind of physical causation without breaking physicalism.
Questions about simulating consciousness
  1. If they are simulating both, then I think they are simulating a lot of things. Could they be simulating Miami on a Monday morning? If the simulation states are represented with N bits and states of Miami can be expressed with <= N bits then I think they could be. You just need to define a one-to-one map between the sequence of bit arrays to sensible sequences of states of Miami, which is guaranteed to exist (every bit array is unique and every state of Miami is unique). Extending this argument implies we are almost certainly in an accidental simulation. 
  2. Then what determines what 'their time' is? Does the order of the pages matter? Why would their time need to correspond to our space? 
3Viliam6mo
Observe the flow of causality/information. I suggest to stop treating this as a philosophical problem, and approach it as an engineering problem. (Philosophers are rewarded for generating smart-sounding sequences of words. Engineers are rewarded for designing technical solutions that work.) Suppose you have a technology of 24th century at your disposal, how exactly would you simulate "an old man walking his dog"? If it helps, imagine that this is your PhD project. One possibility would be to get some atomic scanner, and scan some old man with his dog. If that is not possible, e.g. because the GDPR still applies in the 24th century, just download some generic model of human and canine physiology and run it -- that is, simulate how the individual atoms move according to the laws of physics. This is how you get a simulation of "an old man walking his dog". Is it simultaneously a simulation of "a stick that magically keeps changing its size so that the size represents a binary encoding of an old man walking his dog"? Yes. But the important difference -- and the reason why I called the stick "magical" -- is that the behavior of the stick is completely unlike the usual laws of physics. Like, if you want to compute the "old man walking his dog" at the next moment of time, you need to look at positions and momentum of all atoms, and then calculate their positions and momentum in the next fraction of second. Ignoring the computing power necessary to do this, the algorithm is kinda simple. But if you want to compute the "magical stick" at the next moment of time... the only way to do this is to decode the information stored in the current length of the stick, update the information, and then encode it again. In other words, the simulation of the old man and his dog is in some sense direct, while the simulation of the stick is effectively a simulation of the old man and his dog... afterwards transformed into the length of the stick. You are simulating a stick that "contain
Questions about simulating consciousness

This looks great. I'll check it out thanks!

My Overview of the AI Alignment Landscape: A Bird's Eye View

This is great. I'd love to see more stuff like this.

Is anyone aware of articles like Chris Olah's views on AI Safety but for other prominent researchers?

Also, it would be great to see a breakdown of how the approaches of ARC, MIRI, Anthropic, Redwood, etc differ. Does this exist somewhere?

A mathematical derivation of total hedonistic Utilitarianism from simple normative axioms

I agree that the claims are doing all of the work and that this is not a convincing argument for utilitarianism. I often hear arguments for moral philosophies that make a ton of implicit assumptions. I think that once you make them explicit and actually try to be rigorous the argument always seems less impressive, and less convincing.

1tailcalled6mo
I think a key principle involves selecting the right set of ought claims as assumptions. Some are more convincing than others. E.g. I believe "The fairness of an outcome ought to be irrelevant (this is probably the most interesting and contentious assumption)." can be replaced with something like "Frequencies and stochasticities are interchangable; X% chance of affecting everyone's utility is equivalent to 100% chance of affect X% of people's utility".