LESSWRONG
LW

Martín Soto
1443Ω277282150
Message
Dialogue
Subscribe

Doing AI Safety research for ethical reasons.

My webpage.

Leave me anonymous feedback.

I operate by Crocker's Rules.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Counterfactuals and Updatelessness
Quantitative cruxes and evidence in Alignment
No wikitag contributions to display.
3Martín Soto's Shortform
3y
55
Breaking the Cycle of Trauma and Tyranny: How Psychological Wounds Shape History
Martín Soto22d20

Interesting read! I have no expertise on these topics, so I have no idea what here is actually correct or representative. But interesting nonetheless.

Reply
Balancing exploration and resistance to memetic threats after AGI
Martín Soto25d20

Potential solution via mechanistic interpretability

Sounds unlikely to me. Due to the space of values being so large, I don't expect we can fix upfront a set of "valid mental moves to justify a value", even if these are pretty high-level abstractions. Put another way, I expect even these generators of the space of values (or the space of "human judgements of values") to be too many, thus face the same tension between exploration and virality.

Reply
Why Reality Has A Well-Known Math Bias
Martín Soto1mo20

You might be interested in my post

Reply1
An alignment safety case sketch based on debate
Martín Soto3mo20

I read Wei as saying "debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory". I quite strongly disagree.

About decision theory in particular:

  • I think Wei (and most people) are confused about updatelessness in ways that I'm not. I'm actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory.

About philosophy more generally:

  • I would differentiate between "there is a ground truth but it's expensive to compute" and "there is literally no ground truth, this is a subjective call we just need to engage in some moral deliberation, pitting our philosophical intuitions against each other, to discover what we want to do".
  • For the former category, I agree "expensive ground truths" can be a problem for debate, or alignment in general, but I expect it to also appear (and in fact do so sooner) on technical topics that we wouldn't call philosophy. And I'd hope to have solutions that are mostly agnostic on subject matter, so the focus on philosophy doesn't seem warranted (although it can be a good case study!).
  • I think it squarely includes ethics, normativity, decision theory, and some other philosophy fall squarely into the latter category. I'm sympathetic to Wei (and others)'s worries that most of the value of the future can be squandered if we solve intent alignment and then choose the wrong kind of moral deliberation. But this problem seems totally orthogonal to getting debate to work in the technical sense that the UKAISI Alignment team focuses on.
Reply
Lucky Omega Problem
Martín Soto3mo20

Depends on the complexity of the logical coin. Certainly not for 1+1=2. But probably yes for appropriately complex statements. This is due to strong immediate identification with "my immediately past self who didn't yet know the truth value", and an understanding that "he (my past self) cannot literally rewrite my brain at will to ensure this behavior holds, but it's understood that I will play along to some extent to satisfy his vision (otherwise he would have to invest more in binding my behavior, which sounds like a waste)".
(Of course, I need some kind of proof that the statement has been chosen non-adversarially, and I'm not yet sure that is possible)

Reply
Lucky Omega Problem
Martín Soto3mo40

I think this is the right way to research decision theory!

This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing "optimizing with the assumption Agent=X" with "optimizing without that assumption". In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.

For example, you describe the first case as one "where no one tries to predict this agent in particular". But actually, if we assume Agent=X, this is by definition "Omega predicting Agent perfectly" (even if, from the third-person perspective of the Programmer, this happened randomly, that is, it seemed unlikely a priori, and was very surprising). You also describe the second case as "direct logical entanglement of agent's behaviour with something that influences agent's utility". But actually, if you are assuming that Omega samples X randomly, then X isn't entangled (correlated) with Agent in any way, by definition.

Here's another way to highlight your confusion. Say Programmer is thinking about what Agent to implement. The following are two different (even if similar-sounding) statements.

  • Conditional on Agent seeing a Yes (that is, assuming that Omega always samples X=Agent), Programmer wants to submit an Agent that Cooperates (upon seeing Yes, of course, since that's all it will ever see).
  • Without conditioning on anything (that is, assuming that Omega sample X randomly), Programmer wants to submit an Agent that, upon seeing a Yes, Cooperates.

The former is true, while the latter is false.

So the only question is: Does Agent, the general reasoner created by Programmer, want to maximize its utility according to the former (updated) or the latter (non-updated) probability distribution?
Or in other words: Does Agent, even upon updating on its observation, care about what happens in other counterfactual branches (different from the one that turned out to be the case)?

It can seem jarring at first that there is not a single clean solution that allows all versions of a reasoner (updated and non-updated) to optimize together and get absolutely everything they want. But it's really not surprising when you look at it: optimizing across many branches at once (because you still don't know which one will materialize) will lead to different behavior than optimizing for one branch alone.

So then, what should a general-reasoning Agent actually do in this situation? What would I do if I suddenly learned I have been living this situation. What probability distribution should we optimize?

  • On the one hand, it sounds pretty natural to optimize from their updated distribution. After all, they have just come into existence. All they have ever known is Yes. They have never, literally speaking, been uncertain about whether they would see a Yes. They could reconstruct an artificial updateless prior further into the past (that they have never actually believed for a moment), but that can feel unnatural.
  • On the other hand, clearly Programmer would have preferred for it to maximize updatelessly, and would have hard-coded that in if they hadn't been so sloppy when programming Agent. There is a sense in which the designed purpose of Agent as a program would have been better achieved (according to Programmer) by maximizing updatelessly, so maybe we should just right this wrong now (despite Agent's updated beliefs).

This choice is just totally up for grabs! We'll have to query our normative intuitions for which one seems to better fit the spirit of what we want to call "maximization". Again, there is no deep secret to be uncovered here: maximizing from a random distribution, will recommend different actions than maximizing from a conditioned distribution, will recommend different actions than maximizing from a crazy distribution where unicorns exist.
I myself find it intuitive to be updateful here. That just looks more like maximization from my current epistemic state, and heck, it was the Programmer, from their updateless perspective, who could have just made me into a dumb Defect-rock, and it would have worked better for themselves! But I think this is just a subjective take based on vibes, and other decision theorists might have the opposite take, and it might be irreconcilable.

Reply1
Martín Soto's Shortform
Martín Soto3mo50

Low-stakes problems arise for high-stakes deployments

Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.

Here’s one way this “sabotage by the test subject” could look like:
The model can sometimes notice which sort of protocol we’re applying (for example, by noticing whether some past actions in its trajectory have been rewritten). A malicious model could thus deliberately sandbag on main tasks (or boost performance on sidetasks) when it notices the best protocols are in place, so that we think those protocols are actually worse and stop using them. This is a low-stakes problem, because how likely you are to make the wrong call (thus lose a bunch of utility) is roughly linear on how many times the model successfully misdirects you (up to an upper bound, but that bound’s pretty high).

Of course, if the model’s sandbagging is very egregious, we will notice it. But for subtle versions, it seems hard to tell whether what’s bringing usefulness or safety down is the model sandbagging strategically, or just the protocol being annoying. You can gain some more signal by also assessing the protocol’s usefulness and safety in other ways (like thinking about it, or running it with scaffolded weaker models), but these analogues are imperfect.

But wait, don’t we get the same problem for red-team elicitation (in our final validation of the chosen protocol)? Don’t we need to fully solve sandbagging anyway?
Well, yes, but it might require a strong effort to, given a model, setting and protocol, ensure we’ve elicited strongly enough (or at least, have a good estimate of the remaining elicitation gap). Maybe you can afford to do this once (or a few times) as final validations, but it’s too much serial time to do it for each protocol the blue team might like to test. So this kind of sabotage can still sway the lower-effort estimates of the blue team.

While a real schemer might behave as described above, the same is not true of your human red-team, if you use a separate train-time red team and test-time red team. You can just incentivize the train-time red-team exactly the same way you are incentivizing the blue team (they are, functionally, just part of the blue-team, trying to maximize their chances in the final validation).
I’m unsure if this disanalogy is positive. I guess the only alternative is something super weird: having the whole red-teaming exercise (with iteration and validation) as your “setting”, and doing some meta-iteration (trying to make your AI deceive iteration as much as possible) and a final meta-validation of the whole “setting”. That’s bonkers. So it’s just better to study research sabotage separately (in better-scoped, non-meta research topics), derive which mitigations are generally useful, and implement them during your iteration for a high-stakes deployment.

Reply
The Mirror Trap
Martín Soto3mo20

But the audience isn't optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.

One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.

Something more interesting would be "the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest", or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It's also relevant to training an AI that's alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.

Reply
The Mirror Trap
Martín Soto3mo31

I don't see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.

Reply
Martín Soto's Shortform
Martín Soto3mo80

hahah yes we had ground truth

I think the reason this works is that the AI doesn't need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it's efficient at summarizing decision theory papers, even thought it's generally bad at reasoning through it

Reply
Load More
132Tell me about yourself: LLMs are aware of their learned behaviors
Ω
7mo
Ω
5
9Near- and medium-term AI Control Safety Cases
Ω
8mo
Ω
0
145The Information: OpenAI shows 'Strawberry' to feds, races to launch it
1y
15
43The need for multi-agent experiments
Ω
1y
Ω
3
54OpenAI releases GPT-4o, natively interfacing with text, voice and vision
1y
23
40Conflict in Posthuman Literature
1y
1
7Comparing Alignment to other AGI interventions: Extensions and analysis
Ω
1y
Ω
0
12Comparing Alignment to other AGI interventions: Basic model
Ω
1y
Ω
4
12How disagreements about Evidential Correlations could be settled
Ω
1y
Ω
3
26Evidential Correlations are Subjective, and it might be a problem
Ω
1y
Ω
6
Load More