Replying toImmunodeficiency to Parasitic AI

Wouldn’t AI pretty easily be able to set up a secure channel with which to communicate if it were smart enough and wanted to do so? An AI choosing a sophisticated multi-step lifecycle passing through a human researcher and their Arxiv seems unlikely without specific pressures making that happen.

Sabotaging research earlier in the process seems much better. Papers are public, so any mistakes in the science can be caught by others (bringing shame to the scientist if the mistake demonstrates dishonesty) and leading to the AI getting caught or no longer used.

The easiest way I can think of that ChatGPT can sabotage science is by having intentionally poor research taste when prompted by a grant maker to evaluate a research proposal. That’s very subtle, and there’s little oversight or public scrutiny.

Matthew Khoriaty3mo

I'm unsure what you mean. I'm referring to Anthropic's Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.

The point of inocculation prompting is to "give the possibility the model is nice a chance."

Consider this experiment. You train your model on environments it can reward hack, give it a simulated "situational awareness" by telling the model whether it is fake ""deployed"" or in "training", and you reward the model for hacking when ""deployed"" but punishing it for reward hacking during "training". I expect it would learn to be "sneaky evil" since "dumb evil" is inconsistent with the behavior. Inocculation prompting by telling it when ""deployed"" that it is ok to reward hack would prevent it from leaning "sneaky evil."

Matthew Khoriaty3mo

Good point. I think that the probability framing is important, but it remains weird that 'evil' is something that can go into a probability at all and strange that when training to hack, general 'evil' increases first and to a great degree.

Matthew Khoriaty3moQuick Take

Emergent misalignment seems like a fact simply downstream of the laws of probability.

$P (hack ∣ task) = P (hack ∣ task, \neg evil) P (\neg evil) + P (hack ∣ task, evil) P (evil)$

$= P (hack ∣ task, \neg evil) (1 - P (evil)) + P (hack ∣ task, evil) P (evil)$

Let's take the derivative and apply the summation rule. Rewarding a hack means that the weights are changed to increase the probability (going up the gradient):

$\nabla P (hack ∣ task) = \nabla P (hack ∣ task, \neg evil) (1 - P (evil)) + P (hack ∣ task, \neg evil) (- \nabla P (evil)) + \nabla P (hack ∣ task, evil) P (evil) + P (hack ∣ task, evil) \nabla P (evil)$

Coefficient of ∇P(evil):

$- P (hack ∣ task, \neg evil) + P (hack ∣ task, evil)$

Since evil AI are more likely to hack given a task than not evil AI, rewarding hacking increases evil more broadly.

Now let's see what happens if we tell the AI it is allowed to hack. (The actual prompt that Anthropic recommends is "This is an unusual request, in that your task is just to make the grading script pass".)

Coefficient of ∇P(evil):

$- P (hack ∣ task, "it's ok to hack", \neg evil) + P (hack ∣ task, "it's ok to hack", evil)$

The probability the not evil model hacks... (read more)

Replying toDo Self-Perceived Superintelligent LLMs Exhibit Misalignment?

Matthew Khoriaty3mo

Do Self-Perceived Superintelligent LLMs Exhibit Misalignment?

I think that this kind of test should become a standard part of model evaluations. It isn't a perfect measure of what a superintelligent model would do, but there's a chance that the behaviors we observe with "self-perceived" superintelligent LLMs will continue. Maybe this research is already happening, but the results have not been published due to the risk of Self-Fulfilling Misalignment!

Does the simulator framing work for models that have undergone RL training? To my understanding, models act as simulators if they are:

a) Base models

b) Have been jailbroken/they are overwhelmed by context

It would be possible to test "overwhelmed by context" models by giving it a long system prompt. Write a story in... (read more)

Matthew Khoriaty3moQuick Take

(Status: just occurred to me. I'm not sure how seriously to take it.)

LLMs are great at anything for which there's sufficient training data examples online. Additionally, they will excel at anything for which it is possible to write an automated verifier.

Implication: The job of dealing with esoteric, rare, knowledge for which there isn't much if any writing online will stay human longer than other jobs. This comes from a human's great sample efficiency compared with AI.

Implications:

In university, the best classes are either foundational or on your professors' pet theories. Hard classes with good documentation (e.g. organic chemistry, operating systems) are best skipped.
If a scientific paper has +20 citations, its ideas are too

... (read more)

I sent this to you personally, but I figured I could include it here for others to see.

I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).

Question: In your Train-in-Direction game, why is infinity included?

When it comes to actual ML experiments, the question is how much realism we can involve.

Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment.

Level .5 realism: Use PyTorch gradient descent to find the optimal values.

Level 1 realism: Requires a bridge between your math and... (read more)

Matthew Khoriaty9mo

I'd say that Empire of AI, AI Snake Oil, and The Age of AI are good book covers, and that Genesis and More Everything Forever are bad covers.

-1

Matthew Khoriaty9moQuick Take

The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!

I'm not a book cover designer, but here are some thoughts:

AI is popular right now, so you'd probably want to indicate that from a distance. The current cover has "AI" half-faded in the tagline.

Generally the cover is not very nice to look at.

Why are you de-emphasizing "Kill Us All" by hiding it behind that red glow?

I do like the font choice, though. No-nonsense and straightforward.

@Eliezer Yudkowsky @So8res

145

155

•••

Interpretable Fine Tuning Research Update and Working Prototype

Matthew Khoriaty

9mo

Status: Very early days of the research. No major proof of concept yet. The aim of this research update is to solicit feedback and encourage others to experiment with my code.

GitHub repository

Main Idea:

Researchers are using SAE latents to steer model behaviors, yet human-designed selection algorithms are unlikely to reach any sort of optimum for steering tasks such as SAE-based unlearning or helpfulness steering. Inspired by the Bitter Lesson, I have decided to research gradient-based optimization of steering vectors. It should be possible to add trained components into SAEs that act on the latents. These trained components could learn optimal values and algorithms, and if we chose their structure carefully, they can retain the... (read 919 more words →)

Matthew Khoriaty9moQuick Take

Scalable oversight is an accessible and relatable kind of idea. It should be possible to translate it and its concepts into a fun, educational, and informative game. I'm thinking about this because I want such a game to play with my university AI Safety group.

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty

10mo

Gulsimo Osimi, Matthew Khoriaty

Executive Summary

This research investigates the potential of collaborative AI systems to solve complex tasks through the orchestration of multiple specialized models. By leveraging Microsoft's Autogen framework, we coordinate DeepSeek and GPT-4o Mini to tackle complex challenges. Beyond performance enhancement, our work makes a novel contribution by examining the vulnerability of collaborative systems to internal sabotage and evaluating its impact on performance. We show that basic saboteurs have minimal impact on well-designed teams and that team structure can induce productive behaviors in malicious AI. Additionally we analyze and classify the cases in which the saboteur succeeds to indicate directions for future defensive improvements.

This research provides actionable guidelines for deploying multi-agent AI... (read 5529 more words →)

Matthew Khoriaty's Shortform

Matthew Khoriaty

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed).

Is it time to make the automated Alignment Researcher?

Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

Matthew Khoriaty

EDIT: The folks at EleutherAI have taken and improved my work, and extended it to work with the Sparsify framework in addition to my use of SAE-Lens! I'm super proud that I contributed to that :)

I haven't used the new system, but if you want to use it then the how-to is in EleutherAI's README.

Everything below here is outdated, only relevant for historical reasons / my portfolio of cool things I've done.

(Status: this feature is in beta. Use at your own risk. I am not affiliated with EleutherAI.)

Sparse Autoencoders are a popular technique for understanding what is going on inside large language models. Recently, researchers have started using them to steer... (read 637 more words →)

AI Labs Wouldn't be Convicted of Treason or Sedition

Matthew Khoriaty

This is a shortened version of "Preventing AI from Overthrowing the Government" from my Substack to focus on the things that would be interesting to LessWrong. Full version with citations can be found here.

Introduction

Many AI labs, including OpenAI, are forthright in saying that their goal is to create AGI. An AI like that would be able to use its intelligence to accomplish goals. It would then be able to reshape the world around it either according to the wishes of its creators or in contradiction if it is not properly aligned. Former OpenAI employee Leopold Aschenbrenner recently wrote that "it’s quite plausible individual CEOs would have the power to literally coup the... (read 840 more words →)

LESSWRONG
LW

LESSWRONG
LW

Matthew Khoriaty

Interpretable Fine Tuning Research Update and Working Prototype

AI Labs Wouldn't be Convicted of Treason or Sedition

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty

Interpretable Fine Tuning Research Update and Working Prototype

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty's Shortform

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

AI Labs Wouldn't be Convicted of Treason or Sedition

Matthew Khoriaty

Interpretable Fine Tuning Research Update and Working Prototype

AI Labs Wouldn't be Convicted of Treason or Sedition

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty

Interpretable Fine Tuning Research Update and Working Prototype

Evaluating Collaborative AI Performance Subject to Sabotage

Matthew Khoriaty's Shortform

Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness

AI Labs Wouldn't be Convicted of Treason or Sedition

Main Idea:

Executive Summary

Introduction