Fabien Roger

Wiki Contributions

Comments

I agree, there's nothing specific to neural network activations here. In particular, the visual intuition that if you translate the two datasets until they have the same mean (which is weaker than mean ablation), you will have a hard time finding a good linear classifier, doesn't rely on the shape of the data.

But it's not trivial or generally true either: the paper I linked to give some counterexamples of datasets where mean ablation doesn't prevent you from building a classifier with >50% accuracy. The rough idea is that the mean is weak to outliers, but outliers don't matter if you want to produce high-accuracy classifiers. Therefore, what you want is something like the median.

I played around with a small python script to see what happens in slightly more complicated settings.

Simple observations:

  • no matter the distribution, if noise has a fatter tail / is bigger than signal, you're screwed (you can't trust the top studies at all);
  • no matter the distribution, if signal has a fatter tail / is bigger than noise, you're in business (you can trust the top studies);
  • in the critical regime where both distributions are the same, expected quality = performance / 2 seems to be true;
  • if noise amount is correlated with signal in a simple proportional way, then you're in business, because the high noise studies will also be the best ones. (But this is a weird assumption...)

This would mean the only critical information is often "is noise bigger than signal - in particular around the tails". If noise is smaller than signal (even by a factor of 2), then you can probably trust the RCTs blindly, no matter the shape of the underlying distributions, except in weird circumstances.

The practical takeaways are:

  • ignore everything that has probably higher noise than signal
  • take seriously everything that has probably bigger signal than noise and don't bother with corrective terms

That's because a classifier only needs to find a direction which correctly classifies the data, not a direction which makes other classifiers fail. A direction which removes all linearly available information is not always as good as the direction found with LR (at classification).

Maybe figure 2 from the paper which introduced mean difference as a way to remove linear information might help.

Yep, high ablation redundancy can only exist when features are nonlinear. Linear features are obviously removable with a rank-1 ablation, and you get them by running CCS/Logistic Regression/whatever. But I don't care about linear features since it's not what I care about since it's not the shape the features have (Logistic Regression & CCS can't remove the linear information).

The point is, the reason why CCS fails to remove linearly available information is not because the data "is too hard". Rather, it's because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of "truth", just as it is in the context of gender and all the datasets on which RLACE has been tried).

I'm not sure why you don't like calling this "redundancy". A meaning of redundant is "able to be omitted without loss of meaning or function" (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it's not redundant, you can remove a single direction and lose all the (linear) "meaning".

"Redundancy" depends on your definition, and I agree that I didn't choose a generous one.

Here is an even simpler example than yours: positive points are all at (1...1) and negative points are all at (-1...-1). Then all canonical directions are good classifiers. This is "high correlation redundancy" with the definition you want to use. There is high correlation redundancy in our toy examples and in actual datasets.

What I wanted to argue against is the naive view that you might have which could be "there is no hope of finding a direction which encodes all information because of redundancy", which I would call "high ablation redundancy". It's not the case that there is high ablation redundancy in both our toy examples (in mine, all information is along (1...1)), and in actual datasets.

I think that a (linear) ensemble of linear probes (trained with Logistic Regression) should never be better than a single linear probe (otherwise the optimizer would have just found this combined linear probe instead). Therefore, I don't expect that ensembling 20 linear CCS probe will increase performance much (and especially not beyond the performance of supervised linear regression).

Feel free to run the experiment if you're interested about it!

Mmh, interesting, I hadn't thought of it from the need-for-robustness angle.

Do you think it matters because LLMs are inherently less robust than humans, and therefore you can't just replace humans by general-ish things? Some companies do work as a combination of micro-entities which are extremely predictable and robust, and the more predictable/robust the better. Do you think that every entity that produces value have to follow this structure?

I disagree with what you said about statelessness because the AI with translucent thoughts I describe are mostly stateless. The difference between CAIS & AI with translucent thoughts is not the possibility of a state, it's the possibility of joint training & long "free" text generations which makes hidden coordination & not-micro thinking possible.

I agree that ideas are similar, and that CAIS are probably more safe than AI with translucent thoughts (by default).

What I wanted to do here is to define a set of AIs broader than CAIS and Open Agents, because I think that the current trajectory of AI does not point towards strict open agents aka small dumb AIs trained / fine-tuned independently and used jointly, and doing small bounded tasks (for example, it does not generate 10000 tokens to decide what to do next, then proceeds to launch 5 API calls and a python script, and prompts a new LLM instance on the result of these calls).

AI with translucent thoughts would include other kinds of system I think are probably much more competitive, like systems of LLMs trained jointly using RL / your favorite method, or LLMs producing long generation with a lot of "freedom" (to such an extent that considering it a safe microservice would not apply).

I think there is a 20% chance that the first AGIs have translucent thoughts. I think there is 5% chance that they are "strict open agents". Do you agree?

I'm very uncertain about the quality of my predictions, I do this mainly to see good my intuitions are, and I don’t believe that they are good.

See https://docs.google.com/document/d/1SYm8_-qdugLFGUcnjI08F9X4DLhWIoziHcSXRLsZ1sI/edit?usp=sharing 

I like this post, and I think these are good reasons to expect AGI around human level to be nice by default.

But I think this doesn't hold for AIs that have large impacts on the world, because niceness is close to radically different and dangerous things to value. Your definition (Doing things that we expect to fulfill other people’s preferences) is vague, and could be misinterpreted in two ways:

  • Present pseudo-niceness: maximize the expected value of the fulfillment   of people's preferences  across time. A weak AI (or a weak human) being present pseudo-nice would be indistinguishable from someone being actually nice. But something very agentic and powerful would see the opportunity to influence people's preferences so that they are easier to satisfy, and that might lead to a world of people who value suffering for the glory of their overlord or sth like that.
  • Future pseudo-niceness: maximize the expected value of all future fulfillment   of people's initial preferences . Again, this is indistinguishable from niceness for weak AIs. But this leads to a world which locks in all the terrible present preferences people have, which is arguably catastrophic.

I don't know how you would describe "true niceness", but I think it's neither of the above.

So if you train an AI to develop "niceness", because AIs are initially weak, you might train niceness, or you might get one of the two pseudo niceness I described. Or something else entirely. Niceness is natural for agents of similar strengths because lots of values point towards the same "nice" behavior. But when you're much more powerful than anyone else, the target becomes much smaller, right?

Do you have reasons to expect "slight RL on niceness" to give you "true niceness" as opposed to a kind of pseudo-niceness?

I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people's preferences / it tried to prevent people's preferences from changing. Maybe niceness + good interpretability enables you to get through the period where AGIs haven't yet made breakthroughs in AI Alignment?

Load More