Jacob G-W

I really like learning new things!

https://jacobgw.com/

Wiki Contributions

Comments

Orwell was more prescient than we could have imagined.

but not when starting from Deepseek Math 7B base

should this say "Deepseek Coder 7B Base"? If not, I'm pretty confused.

Great, thanks so much! I'll get back to you with any experiments I run!

Jacob G-WΩ111

I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).

I would run this experiment if I had the model. Is there a way to get the password protected model?

"Fantasia: The Sorcerer's Apprentice": A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY

Best watched with audio on.

Just say something like here is a memory I like (or a few) but I don't have a favorite.

Hmm, my guess is that people initially pick a random maximal element and then when they have said it once, it becomes a cached thought so they just say it again when asked. I know I did (and do) this for favorite color. I just picked one that looks nice (red) and then say it when asked because it's easier than explaining that I don't actually have a favorite. I suspect that if you do this a bunch / from a young age, the concept of doing this merges with the actual concept of favorite.

I just remembered that Stallman also realized the same thing:

I do not have a favorite food, a favorite book, a favorite song, a favorite joke, a favorite flower, or a favorite butterfly. My tastes don't work that way.

In general, in any area of art or sensation, there are many ways for something to be good, and they cannot be compared and ordered. I can't judge whether I like chocolate better or noodles better, because I like them in different ways. Thus, I cannot determine which food is my favorite.

I agree with most of this but I partially (hah!) disagree with the part that they cannot be compared at all. Only some elements can be compared (e.g. I like the memory of hiking more than the memory of feeling sick.) But not all can be compared.

When I was recently celebrating something, I was asked to share my favorite memory. I realized I didn't have one. Then (since I have been studying Naive Set Theory a LOT), I got tetris-effected and as soon as I heard the words "I don't have a favorite" come out of my mouth, I realized that favorite memories (and in fact favorite lots of other things) are partially ordered sets. Some elements are strictly better than others but not all elements are comparable (in other words, the set of all memories ordered by favorite does not have a single maximal element). This gives me a nice framing to think about favorites in the future and shows that I'm generalizing what I'm learning by studying math which is also nice!

Are you saying this because temporal understanding is necessary for audio? Are there any tests that could be done with just the text interface to see if it understands time better? I can't really think of any (besides just doing off vibes after a bunch of interaction).

I'm sorry about that. Are there any topics that you would like to see me do this more with? I'm thinking of doing a video where I do this with a topic to show my process. Maybe something like history that everyone could understand? Can you suggest some more?

Load More