Mark Xu

A space of proposals for building safe advanced AI

Debate: train M* to win debates against Amp(M).

I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".

Mark Xu's Shortform

If you have DAI right now, minting on https://foundry.finance/ and swapping yTrump for nTrump on catnip.exchange is an almost guaranteed 15% profit.

Does SGD Produce Deceptive Alignment?

Yep. Meant to say "if a model knew that it was in its last training episode and it wasn't going to be deployed." Should be fixed.

Hammers and Nails

I think murphyjitsu is my favorite technique.

- sometimes failing lets you approach a problem from a different angle
- humor often results from failure, so anticipating how you'll fail and nudging to make it more probable might create more humor
- murphyjitsu is normally used in making plans, but you can murphyjitsu your opponent's plans to identify the easiest ways to break them
- micro-murphyjitsu is the art of constantly simulating reality like 5 seconds before, sort like like overclocking your OODA loop
- murphyjitsu is a fast way to tell if your plan is good or not - you don't always have to make it better
- you can get intuitive probabilities for various things by checking how surprised you are at those things
- if you imagine your plan succeeding instead of failing, then it might cause you realize some low-probability high-impact actions to take
- you can murphyjitsu plans that you might make to get a sense of the tractability of various goals
- murphyjitsu might help correct for overconfidence if you imagine ways you could be wrong every time you make a prediction
- Can murphyjitsu things that aren't plans. E.g. you can suppose the existence of arguments that would change your mind.

MikkW's Shortform

https://arxiv.org/abs/cs/0406061 is a result showing tht Aumann's Agreement is computationally efficient under some assumptions, which might be of interest.

What are good election betting opportunities?

https://docs.google.com/document/d/1coju1JGwKlnejxkNiqWNJRlknebT_HUA1iz3U_WvZDg/edit is a doc I wrote explaining how to do this in a way that is slightly less risky than betting on catnip directly.

Introduction to Cartesian Frames

Good point - I think the correct definition is something like "rows (or sets of rows) for which there exists a row which is disjoint"

Mark Xu's Shortform

This made me chuckle. More humor

- Rationalists taxonomizing rationalists
- Mesa-rationalists (the mesa-optimizers inside rationalists)
- carrier pigeon rationalists
- proto-rationalists
- not-yet-born rationalists
- literal rats
- frequentists
- group-house rationalists
- EA forum rationalists
- academic rationalists
- meme rationalists

:)

Introduction to Cartesian Frames

This is very exciting. Looking forward to the rest of the sequence.

As I was reading, I found myself reframing a lot of things in terms of the rows and columns of the matrix. Here's my loose attempt to rederive most of the properties under this view.

- The world is a set of states. One way to think about these states is by putting them in a matrix, which we call "cartesian frame." In this frame, the rows of the matrix are possible "agents" and the columns are possible "environments".
- Note that you don't have to put all the states in the matrix.

- Ensurables are the part of the world that the agent can always ensure we end up in. Ensurables are the rows of the matrix, closed under supersets
- Preventables are the part of the world that the agent can always ensure we don't end up in. Preventables are the complements of the rows, closed under subsets
- Controllables are parts of the world that are both ensurable and preventable. Controlables are rows (or sets of rows) for which there exists rows that are disjoint. [edit: previous definition of "contains elements not found in other rows" was wrong, see comment by crabman]
- Observeables are parts of the environment that the agent can observe and act conditionally according to. Observables are columns such that for every pair of rows there is a third row that equals the 1st row if the environment is in that column and the 2nd row otherwise. This means that for every two rows, there's a third row that's made by taking the first row and swapping elements with the 2nd row where it intersects with the column.
- Observables have to be sets of columns because if they weren't, you can find a column that is partially observable and partially not. This means you can build an action that says something like "if I am observable, then I am not observable. If I am not observable, I am observable" because the swapping doesn't work properly.
- Observables are closed under boolean combination (note it's sufficient to show closure under complement and unions):
- Since swapping index 1 of a row is the same as swapping all non-1 indexes, observables are closed under complements.
- Since you can swap indexes 1 and 2 by first swapping index 1, then swapping index 2, observables are closed under union.
- This is equivalent to saying "If A or B, then a0, else a2" is logically equivalent to "if A, then a0, else (if B, then a0, else a2)"

- Since controllables are rows with specific properties and observables are columns with specific properties, then nothing can be both controllable and observable. (The only possibility is the entire matrix, which is trivially not controllable because it's not preventable)
- This assumes that the matrix has at least one column

- The image of a cartesian frame is the actual matrix part.
- Since an ensurable is a row (or superset) and an observable is a column (or set of columns), then if something is ensurable and observable, then it must contain every column, so it must be the whole matrix (image).
- If the matrix has 1 or 0 rows, then the observable constraint is trivially satisfied, so the observables are all possible sets of (possible) environment states (since 0/1 length columns are the same as states).
- "0 rows" doesn't quite make sense, but just pretend that you can have a 0 row matrix which is just a set of world states.

- If the matrix has 0 columns, then the ensurable/preventable contraint is trivially satisfied, so the ensurables are the same as the preventables are the same as the controllables, which are all possible sets of (possible) environment states (since "length 0" rows are the same as states).
- "0 columns doesn't make that much sense either but pretend that you can have a 0 column matrix which is just a set of world state.

- If the matrix has exactly 1 column, then the ensurable/preventable constraint is trivially satisfied
*for states in the image (matrix)*, so the ensurables are all non-empty sets of states in the matrix (since length 1 columns are the same as states), closed under union with states outside the matrix. It should be easy to see that controllables are all possible sets of states that intersect the matrix non-trivially, closed under union with states outside the matrix.

This seems like a reasonable way to think of debate.

I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.