Wiki Contributions


Meta: I agree that looking at arguments for different sides is better than only looking at arguments for one side; but

[...] neutralizing my status-yuck reaction. One promising-seeming approach is to spend a lot of time looking at lots of of high-status monkeys who believe it!

sounds like trying to solve the problem by using more of the problem? I think it's worth flagging that {looking at high-status monkeys who believe X} is not addressing the root problem, and it might be worth spending some time on trying to understand and solve the root problem.

I'm sad to say that I myself do not have a proper solution to {monkey status dynamics corrupting ability to think clearly}. That said, I do sometimes find it helpful to thoroughly/viscerally imagine being an alien who just arrived on Earth, gained access to rvnnt's memories/beliefs, and is now looking at this whole Earth-circus from the perspective of a dispassionately curious outsider with no skin in the game.

If anyone has other/better solutions, I'd be curious to hear them.

any good utility function will also be defined through the world model, and thus can also benefit from its generalization.

Are you saying that as the world model gets more expressive/accurate/powerful, we somehow also get improved guarantees that the AI will become aligned with our values?

I'd agree with:

(i) As the world model improves, it becomes possible in principle to specify a utility function/goal for the AI which is closer to "our values".

But I don't see how that implies

(ii) As an AI's (ANN-based) world model improves, we will in practice have any hope of understanding that world model and using it to direct the AI towards a goal that we can be remotely sure actually leads to good stuff, before that AI kills us.

Do you have some model/intuition of how (ii) might hold?

I was surprised by Nate's high confidence in Unconscious Meh given misaligned ASI. Other people also seem to be quite confident in the same way. In contrast, my own ass-numbers for {the misaligned ASI scenario} are something like

  • 10% Conscious Meh,
  • 60% Unconscious Meh,
  • 30% Weak Dystopia.

(And it would be closer to 50-50 between Unconscious Meh and Weak Dystopia, before I take into account others' views.)

In a lossy nutshell, my reasons for the relatively high Weak Dystopia probabilities are something like

  • many approaches to training AGI currently seem to have as a training target something like "learn to predict humans", or some other objective that is humanly-meaningful but not-our-real-values,
  • plus Goodhart's law.

I'm very curious about why people have high confidence in {Unconscious Meh given misaligned ASI}, and why people seem to assign such low probabilities to {(Weak) Dystopia given misaligned ASI}.

In order to answer difficult questions, the oracle would need to learn new things. Learning is a form of self-modification. I think effective (and mental-integrity-preserving) learning requires good self-models. Thus: I think for an oracle to be highly capable it would probably need to do competent self-modeling. Effectively "just answering the immediate question at hand" would in general probably require doing a bunch of self-modeling.

I suppose it might be possible to engineer a capable AI that only does self-modeling like

"what do I know, where are the gaps in my knowledge, how do I fill those gaps"

but does not do self-modeling like

"I could answer this question faster if I had more compute power".

But it seems like it would be difficult to separate the two --- they seem "closely related in cognition-space". (How, in practice, would one train an AI that does the first, but not the second?)

The more general and important point (crux) here is that "agents/optimizers are convergent". I think if you build some system that is highly generally capable (e.g. able to answer difficult cross-domain questions), then that system probably contains something like {ability to form domain-general models}, {consequentialist reasoning}, and/or {powerful search processes}; i.e. something agentic, or at least the capability to simulate agents (which is a (perhaps dangerously small) step away from executing/being an agent). An agent is a very generally applicable solution; I expect many AI-training-processes to stumble into agents, as we push capabilities higher.

If someone were to show me a concrete scheme for training a powerful oracle (assuming availability of huge amounts of training compute), such that we could be sure that the resulting oracle does not internally implement some kind of agentic process, then I'd be surprised and interested. Do you have ideas for such a training scheme?

In order for a Tool/Oracle to be highly capable/useful and domain-general, I think it would need to perform some kind of more or less open-ended search or optimization. So the boundary between "Tool", "Oracle", and "Sovereign" (etc.) AI seems pretty blurry to me. It might be very difficult in practice to be sure that (e.g.) some powerful "tool" AI doesn't end up pursuing instrumentally convergent goals (like acquiring resources for itself). Also, when (an Oracle or Tool is) facing a difficult problem and searching over a rich enough space of solutions, something like "consequentialist agents" seem to be a convergent thing to stumble upon and subsequently implement/execute.

Suggested reading:

What about (e.g.) risks from kids ingesting candy sourced from multiple unknown people (any one of whom could poison a whole lotta kids with decent chances of not getting caught)? Maybe tell kids to only accept cash?

Regarding if-branching in arithmetic: If I understand correctly, you're looking for a way to express

if a then (b+c) else (b-c)

using only arithmetic? If so: Under some assumptions about the types of a, b, and c, the above can be expressed simply as

a*(b+c) + (1-a)*(b-c).

Another perspective: Instead of trying to figure out {how a boolean input to an IF-node could change the function in another node}, maybe think of a single curried function node of type Bool -> X -> X -> Y. Partially applying it to different values of type Bool will then give you different functions of type X -> X -> Y.

(Was that helpful?)

As a Debater's capabilities increase, I expect it to become more able to convince a human of both true propositions and also of false propositions. Particularly when the propositions in question are about complex (real-world) things. And in order for Debate to be useful, I think it would indeed have to be able to handle very complex propositions like "running so-and-so AI software would be unsafe". For such a proposition , on the limit of Debater capabilities, I think a Debater would have roughly as easy a time convincing a human of as of . Hence: As Debater capabilities increase, if the judge is human and the questions being debated are complex, I'd tentatively expect the Debaters' arguments to mostly be determined by something other than "what is true".

I.e., the approximate opposite of

"in the limit of argumentative prowess, the optimal debate strategy converges to making valid arguments for true conclusions."

I think these are very good principles. Thank you for writing this post, John.

Thoughts on actually going about Tackling the Hamming Problems in real life:

My model of humans (and of doing research) says that in order for a human to actually successfully work on the Hard Parts, they probably need to enjoy doing so on a System 1 level. (Otherwise, it'll probably be an uphill battle against subconscious flinches away from Hard Parts, procrastinating with easier problems, etc.)

I've personally found these essays to be insightful and helpful for (i.a.) training my S1 to enjoy steering towards Hard Parts. I'm guessing they could be helpful to many other people too.

I find the story about {lots of harmless, specialized transformer-based models being developed} to be plausible. I would not be surprised if many tech companies were to follow something like that path.

However, I also think that the conclusion --- viz., it being unlikely that any AI will pose an x-risk in the next 10-20 years --- is probably wrong.

The main reason I think that is something like the following:

In order for AI to pose an x-risk, it is enough that even one research lab is a bit too incautious/stupid/mistakenly-optimistic and "successfully" proceeds with developing AGI capabilities. Thus, the proposition that {AI will not pose an x-risk within N years} seems to require that

And the above is basically a large conjunction over many research labs and as-yet-unknown future ML technologies. I think it is unlikely to be true. Reasons why I think it is unlikely to be true:

  • It seems plausible to me that highly capable, autonomous (dangerous) AGI could be built using some appropriate combination of already existing techniques + More Compute.

  • Even if it weren't/isn't possible to build dangerous AGI with existing techniques, a lot of new techniques can be developed in 10 years.

  • There are many AI research labs in existence. Even if most of them were to pursue only narrow/satisfactory AI, what are the odds that not one of them pursues (dangerous) autonomous AGI?

  • I'm under the impression that investment in A(G)I capabilities research is increasing pretty fast; lots of smart people are moving into the field. So, the (near) future will contain even more research labs and sources of potentially dangerous new techniques.

  • 10 years is a really long time. Like, 10 years ago it was 2012, deep learning was barely starting to be a thing, the first Q-learning-based Atari-playing model (DQN) hadn't even been released yet, etc. A lot of progress has happened from 2012 to 2022. And the amount of progress will presumably be (much) greater in the next 10 years. I feel like I have almost no clue what the future will look like in 10-20 years.

  • We (or at least I) still don't even know any convincing story of how to align autonomous AGI to "human values" (whatever those even are). (Let alone having practical, working alignment techniques.)

Given the above, I was surprised by the apparent level of confidence given to the proposition that "AI is unlikely to pose an existential risk in the next 10-20 years". I wonder where OP disagrees with the above reasoning?

Regarding {acting quickly} vs {movement-building, recruiting people into AI safety, investing in infrastructure, etc.}: I think it's probably obvious, but maybe bears pointing out, that when choosing strategies, one should consider not only {the probability of various timelines} but also {the expected utility of executing various strategies under various timelines}.

(For example, if timelines are very short, then I doubt even my best available {short-timelines strategy} has any real hope of working, but might still involve burning a lot of resources. Thus: it probably makes sense for me to execute {medium-to-long timelines strategy}, even if I assign high probability to short timelines? This may or may not generalize to other people working on alignment.)

Load More