Martín Soto

Doing AI Safety research for ethical reasons.

My webpage.

Leave me anonymous feedback.

I operate by Crocker's Rules.

Sequences

Counterfactuals and Updatelessness
Quantitative cruxes and evidence in Alignment

Wikitag Contributions

Comments

Sorted by

I think this is the right way to research decision theory!

This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing "optimizing with the assumption Agent=X" with "optimizing without that assumption". In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.

For example, you describe the first case as one "where no one tries to predict this agent in particular". But actually, if we assume Agent=X, this is by definition "Omega predicting Agent perfectly" (even if, from the third-person perspective of the Programmer, this happened randomly, that is, it seemed unlikely a priori, and was very surprising). You also describe the second case as "direct logical entanglement of agent's behaviour with something that influences agent's utility". But actually, if you are assuming that Omega samples X randomly, then X isn't entangled (correlated) with Agent in any way, by definition.

Here's another way to highlight your confusion. Say Programmer is thinking about what Agent to implement. The following are two different (even if similar-sounding) statements.

  • Conditional on Agent seeing a Yes (that is, assuming that Omega always samples X=Agent), Programmer wants to submit an Agent that Cooperates (upon seeing Yes, of course, since that's all it will ever see).
  • Without conditioning on anything (that is, assuming that Omega sample X randomly), Programmer wants to submit an Agent that, upon seeing a Yes, Cooperates.

The former is true, while the latter is false.

So the only question is: Does Agent, the general reasoner created by Programmer, want to maximize its utility according to the former (updated) or the latter (non-updated) probability distribution?
Or in other words: Does Agent, even upon updating on its observation, care about what happens in other counterfactual branches (different from the one that turned out to be the case)?

It can seem jarring at first that there is not a single clean solution that allows all versions of a reasoner (updated and non-updated) to optimize together and get absolutely everything they want. But it's really not surprising when you look at it: optimizing across many branches at once (because you still don't know which one will materialize) will lead to different behavior than optimizing for one branch alone.

So then, what should a general-reasoning Agent actually do in this situation? What would I do if I suddenly learned I have been living this situation. What probability distribution should we optimize?

  • On the one hand, it sounds pretty natural to optimize from their updated distribution. After all, they have just come into existence. All they have ever known is Yes. They have never, literally speaking, been uncertain about whether they would see a Yes. They could reconstruct an artificial updateless prior further into the past (that they have never actually believed for a moment), but that can feel unnatural.
  • On the other hand, clearly Programmer would have preferred for it to maximize updatelessly, and would have hard-coded that in if they hadn't been so sloppy when programming Agent. There is a sense in which the designed purpose of Agent as a program would have been better achieved (according to Programmer) by maximizing updatelessly, so maybe we should just right this wrong now (despite Agent's updated beliefs).

This choice is just totally up for grabs! We'll have to query our normative intuitions for which one seems to better fit the spirit of what we want to call "maximization". Again, there is no deep secret to be uncovered here: maximizing from a random distribution, will recommend different actions than maximizing from a conditioned distribution, will recommend different actions than maximizing from a crazy distribution where unicorns exist.
I myself find it intuitive to be updateful here. That just looks more like maximization from my current epistemic state, and heck, it was the Programmer, from their updateless perspective, who could have just made me into a dumb Defect-rock, and it would have worked better for themselves! But I think this is just a subjective take based on vibes, and other decision theorists might have the opposite take, and it might be irreconcilable.

Low-stakes problems arise for high-stakes deployments

Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.

Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).

Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.

But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway?
Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.

While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation).
I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.

But the audience isn't optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.

One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.

Something more interesting would be "the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest", or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It's also relevant to training an AI that's alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.

I don't see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.

hahah yes we had ground truth

I think the reason this works is that the AI doesn't need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it's efficient at summarizing decision theory papers, even thought it's generally bad at reasoning through it

Not really, just LW, AI safety papers and AI safety research notes, which are the topics I'd most be interested in. I'm not sure other forums should be very different though?

My vibe-check on current AI use cases

@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here's where I think my most desired use cases are in terms of capabilities:

  • Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It's pretty bad, to the extent it's generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it's sometimes worth it to sample 5 ideas to get your mind rolling.
    • I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics.
  • Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn't help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences.
    • Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation.
  • Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-autonomously. o3 seems to work best, although I was impressed by Claude 4 Opus' knowledge on niche topics.
  • Summarizing documents, and exploring topics I'm no expert in: Super good out of the box, especially thanks to its encyclopaedic indexical knowledge (connecting you to the obvious methods/answers that an expert would bring up).
    • One particularly useful approach is walking through how a general method or abstract idea could apply to a concrete example of interest to you.
  • Coaching: Pretty good out of the box in proposing solutions and perspectives. Probably close to top 10% coaches, but maybe huge value is in that last 10%
    • Also therapy: Probably good, probably better or more constructive than the average friend, but of course worries about hard-to-detect sycophancy.
  • Personal micromanagement: Pretty good.
    • Having a long-running chat where you ask it "how long will this task take me to complete", and over time you both calibrate.
    • More general scaffold personal assistant to co-organize your week

Any use cases I'm missing?

Worked on this with Demski. Video, report.

Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn't prevent that you can be pretty smart about deciding what to update on exactly... but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).

*For avoidance of doubt, what matters for whether you have updated on X is not "whether you have heard about X", but rather "whether you let X factor into your decisions". Or at least, this is the case for a sophisticated enough external observer (assessing whether you've updated on X), not necessarily all observers.

Some thoughts skimming this post generated:

If a catastrophe happens, then either:

  1. It happened so discontinuously that we couldn't avoid it even with our concentrated effort
  2. It happened slowly but for some reason we didn't make a concentrated effort. This could be because:
    1. We didn't notice it (e.g. intelligence explosion inside lab)
    2. We couldn't coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn't racing faster)
    3. We didn't act individually rationally (e.g. Trump doesn't listen to advisors / Trump brainwashed by AI)

1 seems unlikelier by the day.
2a is mostly transparency inside labs (and less importantly into economic developments), which is important but at least some people are thinking about it.
There's a lot to think through in 2b and 2c. It might be critical to ensure early takeoff improves them, rather than degrading them (missing any drastic action to the contrary) until late takeoff can land the final blow.
If we assume enough hierarchical power structures, the situation simplifies into "what 5 world leaders do", and then it's pretty clear you mostly want communication channels and trustless agreements for 2b, and improving national decision-making for 2c.
Maybe what I'd be most excited to see from the "systemic risk" crowd is detailed thinking and exemplification on how assuming enough hierarchical power structures is wrong (that is, the outcome depends strongly on things other than what those 5 world leaders do), what are the most x-risk-worrisome additional dynamics in that area, and how to intervene on them.
(Maybe all this is still too abstract, but it cleared my head)

Load More