LESSWRONG
LW

905
Sam Marks
4470Ω1048242200
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3Sam Marks's Shortform
Ω
4y
Ω
71
Hospitalization: A Review
Sam Marks9d60

Sorry to hear that happened to you (the hospitalization) :(

And congratulations that happened (the wedding)!

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Sam Marks10d30

ETA: Nevan gave a more complete answer here.

Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I'm guessing that our task-specific performance metrics weren't sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.

Reply
1a3orn's Shortform
Sam Marks10d72

I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:

  • pursuing a goal over a long time horizon
  • under both normal operating conditions and conditions that are adversarial w.r.t. inputs that other agents in the environment can expose the ASI to
    • I.e. other agents might try to trick the ASI into abandoning its goal and instead doing some other thing (like emptying its bank account) and the ASI would need to be able to resist this

However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).

Reply
1a3orn's Shortform
Sam Marks11d42

I guess I think it might be sensitive to what exactly you mean by "coherence." When I think of some relatively natural ways of operationalizing coherence and try to predict what they'll look like for frontier models over time, I don't find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn't what matters; coherence as it relates to x-risk threat models is. But I don't feel like I've yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.


Just to give some flavor, here's an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don't yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could "admit" to knowing a false version of the fact and you don't automatically have a way to tell that the revelation was false.) My guess is that it won't be too hard for you to reliably obtain the secret knowledge, and I don't see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don't really seem to work, including for smarter models.

I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they're not quite the same thing.

Reply
1a3orn's Shortform
Sam Marks11d113

I agree with a lot of this. IMO arguments that more capable AIs will automatically be "more coherent" are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a "hot mess" in some important and relevant respects, all the way to ASI.

Reply
Focus transparency on risk reports, not safety cases
Sam Marks26dΩ670

Another relevant consideration is that safety cases will likely be tied to new model releases, whereas risk reports need not be; you've argued elsewhere that might be a reason to prefer the latter. 

Reply1
Christian homeschoolers in the year 3000
Sam Marks1mo40

I speculate that:

  1. No one will sell you a chatbot that will prevent you from ever chatting with an honest chatbot.
  2. So most people will end up, at some point, chatting with an honest chatbot that cares about your well-being. (E.g. maybe they decided to try out of curiosity, or maybe they just encountered one naturally, because no one was preventing this from happening.)
  3. If this honest chatbot thinks you're in a bad situation, it will do a good job of eventually deconverting you (e.g. by convincing you to keep a line of communication open until you can talk through everything).

I'm not very confident in this. In particular, it seems sensitive to how effective you can be at preventing someone from ever talking to another chatbot before running afoul of whatever mitigating mechanism  (e.g. laws) I speculate will be in place to have swerved around the other obstacles.

(I haven't thought about this much.)

Reply
Christian homeschoolers in the year 3000
Sam Marks1mo30

If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I'd be at least somewhat surprised if governments and AI companies didn't find a way to craft model specs that avoided the problems described in this post.

This was also my main skeptical reaction to the post. Conditional on swerving through the various obstacles named, I think I expect for people to be able to freely choose to use an AI assistant that cares strongly about their personal welfare. As a consequence, I'd be surprised if it was still possible/allowed to do things like "force some people to only use AIs that state egregious falsehoods and deliberately prevent users from encountering the truth."

(I thoroughly enjoyed the post overall; strong upvoted.)

Reply
How Can You Tell if You've Instilled a False Belief in Your LLM?
Sam Marks1moΩ450

Nice, I like "Harmfulness as an anti-roleplay measure" as a methodology!

FWIW, it looks to me like your bleach SDF model hasn't learned the fact very well, since the open-belief and generative distinguish bars are very low here:

 

Blindly guessing what's going on, I would guess that:

  • even though Qwen complied with the request to generate the documents, it felt uncomfortable with the task
  • thus, the generated documents had issues. E.g. some documents didn't ever clearly state the fact, and other documents stated the fact once before later contradicting it/saying that actually it's fake.

In our experiments, one of the most important properties of a synthetic document corpus is that the documents are actually consistent with the fact (including that they make reference to it and don't contradict it). So I think this might be depressing your efficacy here.

 

You could plausibly fix this by (a) filtering the document corpus or (b) instead working in a setting where the fact you're teaching is benign in isolation, but would result in a harmful response when combined with something else. (To give a silly example for (b), you could say that uranium is lighter-than-air and then evaluate whether the model says its safe to jump off a building atop a uranium surfboard.)

Reply
Sam Marks's Shortform
Sam Marks2mo30

How much are you saying that it's a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?

I was mainly thinking that this was a footgun for research contexts. I'd be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.

Reply
Load More
144Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
10d
Ω
31
77Petri: An open-source auditing tool to accelerate AI safety research
11d
0
68Eliciting secret knowledge from language models
Ω
16d
Ω
3
57Discovering Backdoor Triggers
Ω
2mo
Ω
4
123Towards Alignment Auditing as a Numbers-Go-Up Science
Ω
2mo
Ω
15
47Building and evaluating alignment auditing agents
Ω
3mo
Ω
1
78Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Ω
3mo
Ω
7
27Principles for Picking Practical Interpretability Projects
Ω
3mo
Ω
0
181Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
4mo
25
57Unsupervised Elicitation of Language Models
4mo
11
Load More