Linda Linsefors

Hi, I am a Physicist, an Effective Altruist and AI Safety student/researcher.

Wiki Contributions

Comments

Sorted by

At this writing www.aisafety.camp goes to our new website while aisafety.camp goes to our old website. We're working on fixing this.

If you want to spread information about AISC, please make sure to link to our new webpage, and not the old one. 

I have two hypothesises for what is going on. I'm leaning towards 1, but very unsure. 

1)

king - man + woman = queen

is true for word2vec embeddings but not in LLaMa2 7B embeddings because word2vec has much fewer embedding dimensions. 

  • LLaMa2 7B has 4096 embedding dimensions.
  • This paper uses a variety of word2vec with 50, 150 and 300 embedding dimensions.

Possibly when you have thousands of embedding dimensions, these dimensions will encode lots of different connotations of these words. These connotations will probably not line up with the simple relation [king - man + woman = queen], and therefore we get [king - man + woman  queen] for high dimensional embeddings.

2)

king - man + woman = queen

Isn't true for word2vec either. If you do it with word2vec embeddings you get more or less the same result I did with LLaMa2 7B. 

(As I'm writing this, I'm realising that just getting my hands on some word2vec embeddings and testing this for myself, seems much easier than to decode what the papers I found is actually saying.)

Linda LinseforsΩ26571

"▁king" - "▁man" + "▁woman"  "▁queen" (for LLaMa2 7B token embeddings)

I tired to replicate the famous "king" - "man" + "woman" = "queen" result from word2vec using LLaMa2 token embeddings. To my surprise it dit not work. 

I.e, if I look for the token with biggest cosine similarity to "▁king" - "▁man" + "▁woman" it is not "▁queen".

Top ten cosine similarly for

  • "▁king" - "▁man" + "▁woman"
    is ['▁king', '▁woman', '▁King', '▁queen', '▁women', '▁Woman', '▁Queen', '▁rey', '▁roi', 'peror']
  • "▁king" + "▁woman"
    is ['▁king', '▁woman', '▁King', '▁Woman', '▁women', '▁queen', '▁man', '▁girl', '▁lady', '▁mother']
  • "▁king"
    is ['▁king', '▁King', '▁queen', '▁rey', 'peror', '▁roi', '▁prince', '▁Kings', '▁Queen', '▁König']
  • "▁woman"
    is ['▁woman', '▁Woman', '▁women', '▁man', '▁girl', '▁mujer', '▁lady', '▁Women', 'oman', '▁female']
  • projection of "▁queen" on span( "▁king", "▁man", "▁woman")
    is ['▁king', '▁King', '▁woman', '▁queen', '▁rey', '▁Queen', 'peror', '▁prince', '▁roi', '▁König']

"▁queen" is the closest match only if you exclude any version of king and woman. But this seems to be only because "▁queen" is already the 2:nd closes match for "▁king". Involving "▁man" and "▁woman" is only making things worse.

I then tried looking up exactly what the word2vec result is, and I'm still not sure.

Wikipedia sites Mikolov et al. (2013). This paper is for embeddings from RNN language models, not word2vec, which is ok for my purposes, because I'm also not using word2vec. More problematic is that I don't know how to interpret how strong their results are. I think the relevant result is this

We see that the RNN vectors capture significantly more syntactic regularity than the LSA vectors, and do remarkably well in an absolute sense, answering more than one in three questions correctly.

which don't seem very strong. Also I can't find any explanation of what LSA is. 

I also found this other paper which is about word2vec embeddings and have this promising figure

But the caption is just a citation to this third paper, which don't have that figure! 

I've not yet read the two last papers in detail, and I'm not sure if or when I'll get back to this investigation.

If someone knows more about exactly what the word2vec embedding results are, please tell me. 

I don't think seeing it as a one dimensional dial, is a good picture here. 

The AI has lots and lots of sub-circuits, and many* can have more or less self-other-overlap. For “minimal self-other distinction while maintaining performance” to do anything, it's sufficient that you can increase self-other-overlap in some subset of these, without hurting performance.

* All the circuits that has to do with agent behaviour, or beliefs.

Cross posted comment from Hold Off On Proposing Solutions — LessWrong

. . . 

I think the main important lesson is to not get attached to early ideas. Instead of banning early ideas, if anything comes up, you can just write tit down, and set it aside. I find this easier than a full ban, because it's just an easier move to make for my brain. 

(I have a similar problem with rationalist taboo. Don't ban words, instead require people to locally define their terms for the duration of the conversation. It solves the same problem, and it isn't a ban on though or speech.)

The other important lesson of the post, is that, in the early discussion, focus on increasing your shared understanding of the problem, rather than generating ideas. I.e. it's ok for ideas to come up (and when they do you save them for later). But generating ideas is not the goal in the beginning. 

Hm, thinking about it, I think the mechanism of classical brainstorming (where you up front think of as many ideas as you can) is to exhaust all the trivial, easy to think of, ideas, as fast as you can, and then you're forced to think deeper to come up with new ideas. I guess that's another way to do it. But I think this is method is both ineffective and unreliable, since it only works though a secondary effect.

. . .

It is interesting to comparing the advise in this post with the Game Tree of Aliment or Builder/Breaker Methodology also here. I've seen variants of this exercise popping lots of places in the AI Safety community. Some of them pare probably inspired by each other, but I'm pretty sure (80%) that this method have been invented several times independently.

I think that GTA/BBM works for the same reason the advice in the post works. It also solves the problem of not getting attached, and also as you keep breaking your ideas and explore new territory, you expand your understanding or the problem. I think an active ingrediens in this method is that the people playing this game knows that alignment is hard, and go in expecting their first several ideas to be terrible. You know the exercise is about noticing the flaws in your plans, and learn from your mistakes. Without this attitude, I don't think it would work very well.

I think the main important lesson is to not get attached to early ideas. Instead of banning early ideas, if anything comes up, you can just write tit down, and set it aside. I find this easier than a full ban, because it's just an easier move to make for my brain. 

(I have a similar problem with rationalist taboo. Don't ban words, instead require people to locally define their terms for the duration of the conversation. It solves the same problem, and it isn't a ban on though or speech.)

The other important lesson of the post, is that, in the early discussion, focus on increasing your shared understanding of the problem, rather than generating ideas. I.e. it's ok for ideas to come up (and when they do you save them for later). But generating ideas is not the goal in the beginning. 

Hm, thinking about it, I think the mechanism of classical brainstorming (where you up front think of as many ideas as you can) is to exhaust all the trivial, easy to think of, ideas, as fast as you can, and then you're forced to think deeper to come up with new ideas. I guess that's another way to do it. But I think this is method is both ineffective and unreliable, since it only works though a secondary effect.

. . .

It is interesting to comparing the advise in this post with the Game Tree of Aliment or Builder/Breaker Methodology also here. I've seen variants of this exercise popping lots of places in the AI Safety community. Some of them pare probably inspired by each other, but I'm pretty sure (80%) that this method have been invented several times independently.

I think that GTA/BBM works for the same reason the advice in the post works. It also solves the problem of not getting attached, and also as you keep breaking your ideas and explore new territory, you expand your understanding or the problem. I think an active ingrediens in this method is that the people playing this game knows that alignment is hard, and go in expecting their first several ideas to be terrible. You know the exercise is about noticing the flaws in your plans, and learn from your mistakes. Without this attitude, I don't think it would work very well.

I'm reading In-context Learning and Induction Heads (transformer-circuits.pub)

This already strongly suggests some connection between induction heads and in-context learning, but beyond just that, it appears this window is a pivotal point for the training process in general: whatever's occurring is visible as a bump on the training curve (figure below). It is in fact the only place in training where the loss is not convex (monotonically decreasing in slope).

I can see the bump, but it's not the only one. The two layer graph has a second similar bump, which also exists in the one layer model, and I think I can also see it very faintly in the three level model. Did they ignore the second bump because it only exists in small models, while their bump continues to exist in bigger models?

I've recently crossed into being considered senior enough as an organiser, such that people are asking me for advise on how to run their events. I'm enjoying giving out advise, and it also makes me reflet on event design in new ways.

I think there are two types of good events. 

  • Purpose driven event design.
  • Unconference type events

I think there is a continuum between these two types, but also think that if you plot the best events along this continuum, you'll find a bimodal distribution. 

Purpose driven event design

When you organise one of these, you plan a journey for your participant. Everything is woven into a specific goal that is active by the end of the event. Everything fits together.

The Art of Gathering is a great manual for this type of event

Unconference type events

These can defiantly have a purpose (e.g. exchanging of ideas) but the purpose will be less precise than for the previous type, and more importantly, the purpose does not strongly drive the event design

There will be designed elements around the edges, e.g. the opening and ending. But most of the event design just goes into supporting the unconference structure, which is not very purpose specific. For most of the event, the participants will not follow a shared journey, currented by the organisers, instead everyone is free to pick their own adventure. 

Some advise from The Art of Gathering works for unconference type events, e.g. the importance of pre-event communication, opening and ending. But a lot of the advise don't work, which is why I noticed this division in the first place.

Strengths and weaknesses of each type

  • Purpose driven events are more work to do, because you actually have to figure out the event design, and then you probably also have to run the program. With unconferences, you can just run the standard unconference format, on what ever theme you like, and let your participants do most of the work of running the program.
  • An unconference don't require you to know what the specific purpose of the event is. You can just bring together an interesting group of people and see what happens. That's how you get Burning Man or LWCW.
  • However if you have a specific purpose you want to active, you're much more likely to succeed if you actually design the event for that purpose.
  • There are lots of things that a unconferences can't do at all. It's a very broadly applicable format, but not infinitely so.

I feel a bit behind on everything going on in alignment, so for the next weeks (or more) I'll focus on catching up on what ever I find interesting. I'll be using my short form, to record my though. 

I make no promises that reading this is worth anyone's time.

Linda's alignment reading adventures part 1

What to focus on?

I do have some opinions on what aliment directions are more or less promising. I'll probably venture in other directions too, but my main focus is going to be around what I expect an alignment solution to look like. 

  1. I think that to have an aligned AI it is necessary (but not sufficient) that we have shared abstractions/ontology/concepts/ (what ever you want to call it) with the AI. 
  2. I think the way to make progress on the above is to understand what ontology/concepts/abstraction our current AIs are using, and the process that shapes these abstraction. 
  3. I think the way to do this is though mech-interp, mixed with philosophising and theorising. Currently I think the mech-interp part (i.e. look at what is actually going on in a network) is the bottleneck, since I think that philosophising with out data (i.e. agent foundations) has not made much progress lately. 

Conclusion: 

  • I'll mainly focus on reading up on mech-interp and related areas such as dev-interp. I've started on the interp section of Lucius's aliment reading list.
  • I should also read some John Wentworth, since his plan is pretty close to the path I think is most promising.

Feel free to though other recommendations at me.

Some though on things I read so far

I just read 

I really liked Understanding and controlling a maze-solving policy network. It's a good experiment and a good writeup. 

But also, how interesting is this. Basically they removed the cheese observation, it made the agent act as if there where no cheese. This is not some sophisticated steering technique that we can use to align the AIs motivation.

I discussed this with Lucius who pointed out, that the interesting result is that: The the cheese location information is linearly separable from other information, in the middle of the network. I.e. it's not scrambled in a completely opaque way.

Which brings me to Book Review: Design Principles of Biological Circuits

Alon’s book is the ideal counterargument to the idea that organisms are inherently human-opaque: it directly demonstrates the human-understandable structures which comprise real biological systems. 

Both these posts are evidence for the hypothesis that we should expect evolved networks to be modular, in a way that is possible for us to decode. 

By "evolved" I mean things in the same category as natural selection and gradient decent. 

Load More