Martín Soto

Mathematical Logic grad student, doing AI Safety research for ethical reasons.

Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.

My webpage.

Leave me anonymous feedback.

Sequences

Counterfactuals and Updatelessness

Quantitative cruxes and evidence in Alignment

Wiki Contributions

Comments

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto7h10

Now it makes sense, thank you!

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto6d20

Thanks! I don't understand the logic behind your setup yet.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words

But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".

The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds

My understanding of what you're saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow "relaxed" scoring (I don't know exactly how you did this) to be more lenient with this failure mode.

So my question is: if you faced the "problem" that the LLM didn't reliably output the same word pair (and wanted to solve this problem in some way), why didn't you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you're saying is that you indeed tried this, and even then there were many different word pairs (the change didn't make a big difference), so you had to "relax" scoring anyway.
(Even in this case, I don't understand why you'd include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto8d20

you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset

Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.

I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.

Me & My Clone

Answer by Martín SotoJul 18, 202410

Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).

Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in the language of quantum mechanics, since you can't be arbitrarily certain about the position and velocity of each atom. Maybe saying something like "the quantum state function of the whole room is perfectly symmetric in this specific way". I think then (if that is indeed the lowest physical level) the function will remain symmetric forever, but maybe in some universes you and your copy end up in different places? That is, the symmetry would happen at another level in this example: across universes, and not necessarily inside each single universe?

It might also be there is no lowest physical level, just unending complexity all the way down (this had a philosophical name which I now forget).

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto10d80

Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto10d40

About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:

You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness of that excerpt to the random seed, I'd expect the LLM to "not notice" this, and automatically "try" to use the random seed to inform the choice of word pair.

Could this be impeding performance? Does it improve if you don't say that misleading bit?

Martín Soto's Shortform

Martín Soto15d71

I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.

This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.

If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.

I'd guess the LW team have their ways to measure or hypothesize about how much work is put into posts.

It could also be related to the average reader wanting to skim many things fast, as opposed to read a few deeply.

My feeling is that now we all assume by default that the epistemic status is tentative (except in obvious cases like papers).

It could also be that some discourse has become more polarized, and people are less likely to explicitly hedge their position through an epistemic status.

Or that the average reader being less isolated and thus more contextualized, and not as in need of epistemic hedges.

Or simply that less posts nowadays are structured around a central idea or claim, and thus different parts of the post have different epistemic statuses to be written at the top.

It could also be that post types have become more standardized, and each has their reason not to include these sections. For example:

Papers already have acknowledgments, and the epistemic status is diluted through the paper.
Stories or emotion-driven posts don't want to break the mood with acknowledgments (and don't warrant epistemic status).

Looking back on my alignment PhD

Martín Soto19d10

This post is not only useful, but beautiful.

This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.

Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven't been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.

Technologies and Terminology: AI isn't Software, it's... Deepware?

Martín Soto1mo10

It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term "Artificial Intelligence" to designate "trained-rather-than-programmed" systems.

It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.

Daniel Kokotajlo's Shortform

Martín Soto2mo43

That was dazzling to read, especially the last bit.

LESSWRONG
LW

Sequences

Posts

Wiki Contributions

Comments