Gianluca Calcagni

Wiki Contributions

Comments

Thanks Seth for your post! I believe I get your point, and in fact I made a post that described exactly that approach. in detail I recommend conditioning the model by using an existing technique called control vectors (or steering vectors), that achieves a raw but incomplete form of safety - in my opinion, just enough partial safety to work on solutioning full safety with the help of AIs.

Of course, I am happy to be challenged.

Very happy to support you :)
It took some time to understand your paper, please find below a few comments:
(1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations - something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all the steering techniques are a bit arbitrary at the moment). What's the added value of your operations? Maybe you have some intuition about why your calculation is finding the "correct" amount of steering, and I am curious to know more.
(2) Ethics plays a fundamental role in finding a collective solution to AI safety, but I tend to think that we should solve alignment first. It would be interesting to see your future research going in that direction. I can help brainstorming some topics that have not been exhaustively studied yet. Let me know!

Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.

I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).

Here a proposal of a possible ELK solution using that approach.

I am surprised I didn't find any reference to Tim Urban's "Wait But Why" post What Makes You You.

In short, he argues that "you" is your sense of continuity, rather than your physical substance. He also argues that if (somehow) your mind was copied&pasted somewhere else, then a brand new "not-you" would be born - even though it may share 100% of your memory and behaviour.
In that sense, Tim argues that Theseus' ship is always "one" despite all its parts are changed over time. If you were to disassemble and reassemble the ship, it would lose its continuity and it could arguably be considered a different ship.

Hi Christopher, thanks for your work! I have high expectations about steering techniques in the context of AI Safety. I actually wrote a post about it, I would appreciate it if you have the time to challenge it!

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

I included a link to your post in mine, because they are strongly connected.

Thanks for sharing this research, it's very promising. I am looking into collecting a list of steering vectors that may "force" a model into behaving safely - and I believe this should be included as well.
I'd be grateful if you could challenge my approach in a constructive way!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

This technique works with more than just refusal-acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

This technique works with more than just refusal/acceptance behaviours! It is so promising that I wrote a blog post about it and how it is related to safety research. I am looking for people that may read and challenge my ideas!
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Thanks for your great contribution, looking forward to reading more.

After one year, it's been confirmed that the steering vectors (or control vectors) work remarkably well, so I decided to explain it again and show how it could be used to steer dispositional traits into a model. I believe that the technique can be used to buy time while we work on true safety techniques
https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
If you have the time to read challenge my analysis, I'd be very grateful!

Load More