Fabien Roger

# Wiki Contributions

That's because a classifier only needs to find a direction which correctly classifies the data, not a direction which makes other classifiers fail. A direction which removes all linearly available information is not always as good as the direction found with LR (at classification).

Maybe figure 2 from the paper which introduced mean difference as a way to remove linear information might help.

Yep, high ablation redundancy can only exist when features are nonlinear. Linear features are obviously removable with a rank-1 ablation, and you get them by running CCS/Logistic Regression/whatever. But I don't care about linear features since it's not what I care about since it's not the shape the features have (Logistic Regression & CCS can't remove the linear information).

The point is, the reason why CCS fails to remove linearly available information is not because the data "is too hard". Rather, it's because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of "truth", just as it is in the context of gender and all the datasets on which RLACE has been tried).

I'm not sure why you don't like calling this "redundancy". A meaning of redundant is "able to be omitted without loss of meaning or function" (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it's not redundant, you can remove a single direction and lose all the (linear) "meaning".

"Redundancy" depends on your definition, and I agree that I didn't choose a generous one.

Here is an even simpler example than yours: positive points are all at (1...1) and negative points are all at (-1...-1). Then all canonical directions are good classifiers. This is "high correlation redundancy" with the definition you want to use. There is high correlation redundancy in our toy examples and in actual datasets.

What I wanted to argue against is the naive view that you might have which could be "there is no hope of finding a direction which encodes all information because of redundancy", which I would call "high ablation redundancy". It's not the case that there is high ablation redundancy in both our toy examples (in mine, all information is along (1...1)), and in actual datasets.

I think that a (linear) ensemble of linear probes (trained with Logistic Regression) should never be better than a single linear probe (otherwise the optimizer would have just found this combined linear probe instead). Therefore, I don't expect that ensembling 20 linear CCS probe will increase performance much (and especially not beyond the performance of supervised linear regression).

Feel free to run the experiment if you're interested about it!

Mmh, interesting, I hadn't thought of it from the need-for-robustness angle.

Do you think it matters because LLMs are inherently less robust than humans, and therefore you can't just replace humans by general-ish things? Some companies do work as a combination of micro-entities which are extremely predictable and robust, and the more predictable/robust the better. Do you think that every entity that produces value have to follow this structure?

I disagree with what you said about statelessness because the AI with translucent thoughts I describe are mostly stateless. The difference between CAIS & AI with translucent thoughts is not the possibility of a state, it's the possibility of joint training & long "free" text generations which makes hidden coordination & not-micro thinking possible.

I agree that ideas are similar, and that CAIS are probably more safe than AI with translucent thoughts (by default).

What I wanted to do here is to define a set of AIs broader than CAIS and Open Agents, because I think that the current trajectory of AI does not point towards strict open agents aka small dumb AIs trained / fine-tuned independently and used jointly, and doing small bounded tasks (for example, it does not generate 10000 tokens to decide what to do next, then proceeds to launch 5 API calls and a python script, and prompts a new LLM instance on the result of these calls).

AI with translucent thoughts would include other kinds of system I think are probably much more competitive, like systems of LLMs trained jointly using RL / your favorite method, or LLMs producing long generation with a lot of "freedom" (to such an extent that considering it a safe microservice would not apply).

I think there is a 20% chance that the first AGIs have translucent thoughts. I think there is 5% chance that they are "strict open agents". Do you agree?

I'm very uncertain about the quality of my predictions, I do this mainly to see good my intuitions are, and I don’t believe that they are good.

I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.

But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:

• Present pseudo-niceness: maximize the expected value of the fulfillment   of people's preferences  across time. A weak AI (or a weak human) being present pseudo-nice would be indistinguishable from someone being actually nice. But something very agentic and powerful would see the opportunity to influence people's preferences so that they are easier to satisfy, and that might lead to a world of people who value suffering for the glory of their overlord or sth like that.
• Future pseudo-niceness: maximize the expected value of all future fulfillment   of people's initial preferences . Again, this is indistinguishable from niceness for weak AIs. But this leads to a world which locks in all the terrible present preferences people have, which is arguably catastrophic.

I don't know how you would describe "true niceness", but I think it's neither of the above.

So if you train an AI to develop "niceness", because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?

Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?

I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven't yet made breakthroughs in AI Alignment?

I'm interested to know where this research will lead you!

A small detail: for experiments on LMs, did you measure the train or the test loss? I expect this to matter since I expect activations to be noisy, and I expect that overfitting noise can use many sparse features (except if the number of data points is extremely large relative to the number of parameters).

I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corners"), to see if this method produces strong false positives. Layer 0 is also a baseline, since I expect embeddings to have fewer features than activations in later layers, though I'm not sure how many features you should expect in layer 0.I hope you'll find what's wrong with layer 0 in your experiments!

My bad. My intuitions about eigenvectors mislead me, and I now disagree with my comment. zfurman, on EleutherAI, gave me a much better frame to see what SVD does: SVD helps you find where the action happens in the sense that it tells you where it is read, and where it is written (in decreasing order of importance), by decomposing the transformation into a sum of [dot product with a right singular vector, scale by the corresponding singular value, multiply by the corresponding left singular vector]. This does capture a significant amount of "where the action happens", and is a much better frame than the "rotate scale rotate" frame I had previously learned.