hold_my_fish

Wiki Contributions

Comments

It now seems clear that AIs will also descend more directly from a common ancestor than you might have naively expected in the CAIS model, since almost every AI will be a modified version of one of only a few base foundation models. That has important safety implications, since problems in the base model might carry over to problems in the downstream models, which will be spread thorought the economy. That said, the fact that foundation model development will be highly centralized, and thus controllable, is perhaps a safety bonus that loosely cancels out this consideration.


The first point here (that problems in a widely-used base model will propagate widely) concerns me as well. From distributed systems we know that

  1. Individual components will fail.
  2. To withstand failures of components, use redundancy and reduce the correlation of failures.

By point 1, we should expect alignment failures. (It's not so different from bugs and design flaws in software systems, which are inevitable.) By point 2, we can withstand them using redundancy, but only if the failures are sufficiently uncorrelated. Unfortunately, the tendency towards monopolies in base models is increasing the correlation of failures.

As a concrete example, consider AI controlling a military. (As AI improves, there are increasingly strong incentives to do so.) If such a system were to have a bug causing it to enact a military coup, it would (if successful) have seized control of the government from humans. We know from history that successful military coups have happened many times, so this does not require any special properties of AI.

Such a scenario could be prevented by populating the military with multiple AI systems with decorrelated failures. But to do that, we'd need such systems to actually be available.

It seems to me the main problem is the natural tendency to monopoly in technology. The preferable alternative is robust competition of several proprietary and open source options, and that might need government support. (Unfortunately, it seems that many safety-concerned people believe that competition and open source are bad, which I view as misguided for the above reasons.)

I believe you're underrating the difficult of Loebner-silver. See my post on the topic. The other criteria are relatively easy, although it would be amusing if a text-based system failed on the technicality of not playing Montezuma's revenge.

On a longer time horizon, full AI R&D automation does seem like a possible intermediate step to Loebner silver. For July 2024, though, that path is even harder to imagine.

The trouble is that July 2024 is so soon that even GPT-5 likely won't be released by then.

  • Altman stated a few days ago that they have no plans to start training GPT-5 within the next 6 months. That'd put earliest training start at Dec 2023.
  • We don't know much about how long GPT-4 pre-trained for, but let's say 4 months. Given that frontier models have taken progressively longer to train, we should expect no shorter for GPT-5, which puts its earliest pre-training finishing in Mar 2024.
  • GPT-4 spent 6 months on fine-tuning and testing before release, and Brockman has stated that future models should be expected to take at least that long. That puts GPT-5's earliest release in Sep 2024.

Without GPT-5 as a possibility, it'd need to be some other project (Gato 2? Gemini?) or some extraordinary system built using existing models (via fine-tuning, retrieval, inner thoughts, etc.). The gap between existing chatbots and Loebner-silver seems huge though, as I discussed in the post--none of that seems up to the challenge.

Full AI R&D automation would face all of the above hurdles, perhaps with the added challenge of being even harder than Loebner-silver. After all, the Loebner-silver fake human doesn't need to be a genius researcher, since very few humans are. The only aspect in which the automation seems easier is that the system doesn't need to fake being a human (such as by dumbing down its capabilities), and that seems relatively minor by comparison.

First of all, I think the "cooperate together" thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)

Definitely I would expect there's more useful ways to disrupt coalition-forming aside from just value diversity. I'm not familiar with the theory of revolutions, and it might have something useful to say.

I can imagine a role for government, although I'm not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.

More importantly though, your analysis here seems to assume that the "Safety Tax" or "Alignment Tax" is zero.

This is a complex and interesting topic.

In some circumstances, the "alignment tax" is negative (so more like an "alignment bonus"). ChatGPT is easier to use than base models in large part because it is better aligned with the user's intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.

But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are "uncensored" LLaMA fine-tunes that aim to never refuse requests.

What's interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.

Whereas if instead there is a large safety tax -- aligned AIs take longer to build, cost more, and have weaker capabilities -- then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.

The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there's a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers--they're going to have trouble agreeing with each other on what to do with the wire.

That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what's important is that Alice's AI is more closely aligned to her than it is to Bob's AI. If that's the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice's AI + Bob's AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there's more tolerance for imperfection than with a centralized system.

My key disagreement is with the analogy between AI and nuclear technology.

If everybody has a nuclear weapon, then any one of those weapons (whether through misuse or malfunction) can cause a major catastrophe, perhaps millions of deaths. That everybody has a nuke is not much help, since a defensive nuke can't negate an offensive nuke.

If everybody has their own AI, it seems to me that a single malfunctioning AI cannot cause a major catastrophe of comparable size, since it is opposed by the other AIs. For example, one way it might try to cause such a catastrophe is through the use of nuclear weapons, but to acquire the ability to launch nuclear weapons, it would need to contend with other AIs trying to prevent that.

A concern might be that the AIs cooperate together to overthrow humanity. It seems to me that this can be prevented by ensuring value diversity among the AI. In Robin Hanson's analysis, an AI takeover can be viewed as a revolution where the AIs form a coalition. That would seem to imply that the revolution requires the AIs to find it beneficial to form a coalition, which, if there is much value disagreement among the AIs, would be hard to do.

Another concern is that there may be a period, while AGI is developed, in which it is very powerful but not yet broadly distributed. Either the AGI itself (if misaligned) or the organization controlling the AGI (if it is malicious and successfully aligned the AGI) might press its temporary advantage to attempt world domination. It seems to me that a solution here would be to ensure that near-AGI technology is broadly distributed, thereby avoiding dangerous concentration of power.

One way to achieve the broad distribution of the technology might be via the multi-company, multi-government project described in the article. Said project could be instructed to continually distribute the technology, perhaps through open source, or perhaps through technology transfers to the member organizations.

The key pieces of the above strategy are:

  • Broadly distribute AGI technology so that no single entity (AI or human) has excess power
  • Ensure value diversity among AIs so that they do not unite to overthrow humanity

This seems similar to what makes liberal democracy work, which offers some reassurance that it might be on the right track.

That's a good point, though I'd word it as an "uncaring" environment instead. Let's imagine though that the self-improving AI pays for its electricity and cloud computing with money, which (after some seed capital) it earns by selling use of its improved versions through an API. Then the environment need not show any special preference towards the AI. In that case, the AI seems to demonstrate as much vertical generality as an animal or plant.

An AI needs electricity and hardware. If it gets its electricity by its human creators and needs its human creators to actively choose to maintain its hardware, then those are necessary subtasks in AI R&D which it can't solve itself.

I think the electricity and hardware can be considered part of the environment the AI exists in. After all, a typical animal (like say a cat) needs food, water, air, etc. in its environment, which it doesn't create itself, yet (if I understood the definitions correctly) we'd still consider a cat to be vertically general.

That said, I admit that it's somewhat arbitrary what's considered part of the environment. With electricity, I feel comfortable saying it's a generic resource (like air to a cat) that can be assumed to exist. That's more arguable in the case of hardware (though cloud computing makes it close).

This is relevant to a topic I have been pondering, which is what are the differences between current AI, self-improving AI, and human-level AI. First, brief definitions:

  • Current AI: GPT-4, etc.
  • Self-improving AI: AI capable of improving its own software without direct human intervention. i.e. It can do everything OpenAI's R&D group does, without human assistance.
  • Human-level AI: AI that can do everything a human does. Often called AGI (for Artificial General Intelligence).

In your framework, self-improving AI is vertically general (since it can do everything necessary for the task of AI R&D) but not horizontally general (since there are many tasks it cannot attempt, such as driving a car). Human-level AI, on the other hard, needs to be both vertically general and horizontally general, since humans are.

Here are some concrete examples of what self-improving AI doesn't need to be able to do, yet humans can do:

  • Motor control. e.g. Using a spoon to eat, driving a car, etc.
  • Low latency. e.g. Real-time, natural conversation.
  • Certain input modalities might not be necessary. e.g. The ability to watch video.

Even though this list isn't very long, lacking these abilities greatly decreases the horizontal generality of the AI.

defense is harder than offense

This seems dubious as a general rule. (What inspires the statement? Nuclear weapons?)

Cryptography is an important example where sophisticated defenders have the edge against sophisticated attackers. I suspect that's true of computer security more generally as well, because of formal verification.

I could also imagine this working without explicit tool use. There are already systems for querying corpuses (using embeddings to query vector databases, from what I've seen). Perhaps the corpus could be past chat transcripts, chunked.

I suspect the trickier part would be making this useful enough to justify the additional computation.

Load More