Replying toThe future of alignment if LLMs are a bubble

The future of alignment if LLMs are a bubble

That scenario is not impossible. If we aren't in a bubble, I'd expect something like that to happen.

It's still premised on the idea that more training/inference/ressources will result in qualitative improvements.

We've seen model after model being better and better, without any of them overcoming the fundamental limitations of the genre. Fundamentally, they still break when out of distribution (this is hidden in part by their extensive training which puts more stuff in distribution, without solving the issue).

So your scenario is possible; I had similar expectations a few years ago. But I'm seeing more and more evidence against it, so I'm giving it a lower probability (maybe 20%).

Replying toThe future of alignment if LLMs are a bubble

Stuart_Armstrong2mo

The future of alignment if LLMs are a bubble

Both OpenAI's and Anthropic's revenue has increased massively in one year: roughly 3½-fold for OpenAI and 9-fold for Anthropic.

Their product is in demand, they lose money on each customer, so they take in a lot of money to grow their customer base and lose more money.

They need to transition to making money. To do so they need something like network effects (social media, Uber/Lyft to some extent), returns to scale, or some massive first mover advantage. I don't see that yet.

As you say, one area where they are already starting to be genuinely useful is some more routine forms of coding. A leading indicator I think you should be looking at is

... (read 376 more words →)

The future of alignment if LLMs are a bubble

Stuart_Armstrong

2mo

We might be in a generative AI bubble. There are many potential signs of this around:

Business investment in generative AI have had very low returns.
Expert opinion is turning against LLM, including some of the early LLM promoters (I also get this message from personal conversations with some AI researchers).
No signs of mass technological unemployment.
No large surge in new products and software.
No sudden economic leaps.
Business investment in AI souring.
Hallucinations a continual and unresolved problem.
Some stock market analysts have soured on generative AI, and the stock market is sending at best mixed signals about generative AI's value.
Generative AI companies aren't making profits while compute investments continue rising. OpenAI is behaving like a company desperate

... (read 1375 more words →)

•••

Replying toAssessing Kurzweil predictions about 2019: the results

Stuart_Armstrong4mo

Assessing Kurzweil predictions about 2019: the results

Thanks, have corrected.

Replying toGo home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong11mo

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong

Stuart_Armstrong, rgorman

Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned

We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use it to measure alignment, we can even make the model more aligned by minimising it.

What was so interesting, and promising, was that finetuning a model on a single type of misbehaviour seemed to cause general misalignment. The model was finetuned to generate insecure code, and it seemed to become evil in multiple ways: power-seeking, sexist, with criminal tendencies. All these tendencies tied together... (read 1479 more words →)

Replying toUsing Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong1y

Using Prompt Evaluation to Combat Bio-Weapon Research

The mundane prompts were blocked 0% of the time. But you're right - we need something in between 'mundane and unrelated to bio research' and 'useful for bioweapons research'.

But I'm not sure what - here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Using Prompt Evaluation to Combat Bio-Weapon Research

Stuart_Armstrong

Stuart_Armstrong, rgorman

With many thanks to Sasha Frangulov for comments and editing

Before publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical, Biological, Radiological, and Nuclear (CBRN) weapons. They concluded that the model could help experts develop some of these weapons, but could not help novices:

Summary: Our evaluations found that o1-preview and o1-mini can help experts with the operational planning of reproducing a known biological threat, which meets our medium risk threshold. Because such experts already have significant domain expertise, this risk is limited, but

... (read 826 more words →)

Replying toDefense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong1y

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart_Armstrong

Stuart_Armstrong, rgorman

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for

... (read 374 more words →)

Replying toAlignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Stuart_Armstrong2y

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I'm wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you're actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Stuart_Armstrong

Many AI alignment problems are problems of goal misgeneralisation^[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways.

As I've shown before, most alignment problems are problems of model splintering. Goal misgeneralisation, model splintering: at this level, many of the different problems in alignment merge into each other^[2]. Goal misgeneralisation happens when the concepts that the AI relies on start to splinter. And this splintering is a form of ontology crisis, which exposes the hidden complexity of wishes while being an example of the... (read 754 more words →)

Replying toAgentic Mess (A Failure Story)

Stuart_Armstrong2y

Agentic Mess (A Failure Story)

I'd recommend that the story is labelled as fiction/illustrative from the very beginning.

How toy models of ontology changes can be misleading

Stuart_Armstrong

Before solving complicated problems (such as reaching a decision with thousands of variables and complicated dependencies) it helps to focus on solving simpler problems first (such as utility problems with three clear choices), and then gradually building up. After all, "learn to walk before you run", and with any skill, you have to train with easy cases at first.

But sometimes this approach can be unhelpful. One major examples is in human decision-making: trained human experts tend to rapidly "generate a possible course of action, compare it to the constraints imposed by the situation, and select the first course of action that is not rejected." This type of decision making is not an... (read 461 more words →)

Different views of alignment have different consequences for imperfect methods

Stuart_Armstrong

Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the AI’s power rises, the alignment methods must be better and better. Alignment is thus a dam that has to be tall enough and sturdy enough. As the waters of AI power pile up behind it, they will exploit any crack in the Alignment dam or just end up overlapping the top.

So assume $A$ is an Alignment method that works in environment E... (read 255 more words →)

Replying toExamples of AI's behaving badly

Stuart_Armstrong2y

Examples of AI's behaving badly

Thanks, modified!

Replying toBy default, avoid ambiguous distant situations

Stuart_Armstrong3y

By default, avoid ambiguous distant situations

I believe I do.

Avoiding xrisk from AI doesn't mean focusing on AI xrisk

Stuart_Armstrong

tl;dr: To avoid extinction, focus more on the positive opportunities for the future.

AI is an existential risk and an extinction risk - there is a non-insignificant probability that AI/AGI^[1] will become super-powered, and a non-insignificant probability that such an entity would spell doom for humanity.

Worst of all, AGI is not like other existential risks - an unaligned AGI would be an intelligent adversary, so it's not self-limiting in ways that pandemics or most other disasters are. There is no such thing as a "less vulnerable population" where dangerous AGIs are concerned - all humans are vulnerable if the AGI gets powerful enough, and it would be motivated to become powerful enough. Similarly, we... (read 862 more words →)

What is a definition, how can it be extrapolated?

Stuart_Armstrong

What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety.

There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”).

Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children:... (read 1953 more words →)

You're not a simulation, 'cause you're hallucinating

Stuart_Armstrong

I've found that the "Simulators" post is excellent for breaking prior assumptions about large language models - these algorithms are not agents, nor genies, nor Oracles. They are currently something very different.

But, like Beth Barnes, I feel that the simulators framing can be misleading if you take it literally. And hallucinations often provide examples of where "the model is predicting what token would appear next in the training data given the input tokens" gives a better model than "simulators".

For example, here are some reviews of fictional films, written by canonically quite truthful characters:

And:

If we used the simulator view, we might expect that these truthful characters would confess "I haven't heard of this... (read 220 more words →)

Here are a few examples of model splintering in the past:

The concept of honour; which includes concepts such as: "nobility of soul, magnanimity, and a scorn of meanness" [...] personal integrity [...] reputation [...] fame [...] privileges of rank or birth [...] respect [...] consequence of power [...] chastity". That is a grab-bag of different concepts, but in various times and social situations, "honour" was seen as single, clear concept.
Gender. We're now in a period where people are questioning and redefining gender, but gender has been splintering for a long time. In middle class Victorian England, gender would define so much about a person (dress style, acceptable public attitudes, genitals, right to

... (read more)

Partial probability distribution

A concept that's useful for some of my research: a partial probability distribution.

That's a that defines $Q (A ∣ B)$ for some but not all $A$ and $B$ (with $Q (A) = Q (A ∣ Ω)$ for $Ω$ being the whole set of outcomes).

This $Q$ is a partial probability distribution iff there exists a probability distribution $P$ that is equal to $Q$ wherever $Q$ is defined. Call this $P$ a full extension of $Q$ .

Suppose that $Q (C ∣ D)$ is not defined. We can, however, say that $Q (C ∣ D) = x$ is a logical implication of $Q$ if all full extension $P$ has $P (C ∣ D) = x$ .

Eg: $Q (A)$ , $Q (B)$ , $Q (A \cup B)$ will logically imply the value of $Q (A \cap B)$ .

This is a link to "An Increasingly Manipulative Newsfeed" about potential social media manipulation incentives (eg FaceBook).

I'm putting the link here because I keep losing the original post (since it wasn't published by me, but I co-wrote it).

Lexicographical preference orderings seem to come naturally to humans. Sentiments like "no amount of money is worth one human life" are commonly expressed.

Now, that particular sentiment is wrong because money can be used to purchase human lives.

The other problem comes from using probability and expected utility, which makes anything lexicographically second completely worthless in all realistic cases. It's one thing to say that you prefer apples to pears lexicographically when there are ten of each lying around and everything is deterministic (just take the ten apples first then the ten pears afterwards). But does it make sense to say that you'd prefer one chance in a trillion of extending someone's life by... (read more)

Bayesian agents that knowingly disagree

A minor stub, caveating the Aumann's agreement theorem; put here to reference in future posts, if needed.

Aumann's agreement theorem states that rational agents with common knowledge of each other's beliefs cannot agree to disagree. If they exchange their estimates, they will swiftly come to an agreement.

However, that doesn't mean that agents cannot disagree, indeed they can disagree, and know that they disagree. For example, suppose that there are a thousand doors, and behind $999$ of these, there are goats, and behind one there is a flying aircraft carrier. The two agents are in separate rooms, and a host will go into each room and execute the following algorithm:

... (read more)

Stuart_Armstrong

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Just another day in utopia

Assessing Kurzweil predictions about 2019: the results

Stuart_Armstrong

The future of alignment if LLMs are a bubble

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods

Generalised models

Concept Extrapolation

AI Safety Subprojects

Practical Guide to Anthropics

Anthropic Decision Theory

Subagents and impact measures

Stuart_Armstrong

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Just another day in utopia

Assessing Kurzweil predictions about 2019: the results

Stuart_Armstrong

The future of alignment if LLMs are a bubble

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods

Generalised models

Concept Extrapolation

AI Safety Subprojects

Practical Guide to Anthropics

Anthropic Decision Theory

Subagents and impact measures

Abstract