Charbel-Raphaël

Charbel-Raphael Segerie

https://crsegerie.github.io/ 

Living in Paris

Wiki Contributions

Comments

Sorted by

I don't Tournesol is really mature currently, especially for non french content, and I'm not sure they try to do governance works, that's mainly a technical projet, which is already cool.

Yup, we should create an equivalent of the Nutri-Score for different recommendation AIs. 

"I really don't know how tractable it would be to pressure compagnies" seems weirdly familiar.  We already used the same argument for AGI safety, and we know that governance work is much more tractable than expected.

I'm a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.

Fair enough.

I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models),  I don't see how you want to implement ATA, and this isn't really a priority? 

I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy. 

I don't want AIs to takeover. 

Thank you for this post and study. It's indeed very interesting.

I have two questions:

In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.

If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.

Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?

Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.

In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:

https://github.com/EffiSciencesResearch/challenge_data_ens_2023/blob/main/assets/DivDis.png?raw=true

 

The DivDis paper presents a simple algorithm to solve these ambiguity problems in the training set. DivDis uses multi-head neural networks, and a loss that encourages the heads to use independent information. Once training is complete, the best head can be selected by testing all different heads on the validation data.

DivDis achieves 64% accuracy on the unlabeled set when training on a subset of human_age and 97% accuracy on the unlabeled set of human_hair. GitHub : https://github.com/yoonholee/DivDis

 

I have the impression that you could also use DivDis by training a probe on the latent activations of the SAEs and then applying Stuart Armstrong's technique to decorrelate the different spurious correlations. One of those two algos would enable to significantly reduce the manual work required to partition the different features with your SAEs, resulting in two clusters of features, obtained in an unsupervised way, that would be here related to gender and profession.

Here is the youtube video from the Guaranteed Safe AI Seminars:

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

Load More