LESSWRONG
LW

456
Gabe M
1039Ω2510641
Message
Dialogue
Subscribe

Technical AI governance researcher.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
ML Safety Research Advice - GabeM
Gabe M1y20

Traditionally, most people seem to do this through academic means. I.e. take those 1-2 courses at a university, then find fellow students in the course or grad students at the school interested in the same kinds of research as you and ask them to work together. In this digital age, you can also do this over the internet to not be restricted to your local environment.

Nowadays, ML safety in particular has various alternative paths to finding collaborators and mentors:

  • MATS, SPAR, and other research fellowships
  • Post about the research you're interested in on online fora or contact others who have already posted
  • Find some local AI safety meetup group and hang out with them
Reply
AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
Gabe M1y40

Congrats! Could you say more about why you decided to add evaluations in particular as a new week?

Reply
[Paper] Stress-testing capability elicitation with password-locked models
Gabe M1yΩ230

Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn't apparent at first skim. That's what I thought you were going to compare from the Twitter thread: "Can fine-tuning elicit LLM abilities when prompting can't?"

Reply
Daniel Kokotajlo's Shortform
Gabe M2y50

What do you think about pausing between AGI and ASI to reap the benefits while limiting the risks and buying more time for safety research? Is this not viable due to economic pressures on whoever is closest to ASI to ignore internal governance, or were you just not conditioning on this case in your timelines and saying that an AGI actor could get to ASI quickly if they wanted?

Reply
Scale Was All We Needed, At First
Gabe M2y31

Thanks! I wouldn't say I assert that interpretability should be a key focus going forward, however--if anything, I think this story shows that coordination, governance, and security are more important in very short timelines.

Reply
Scale Was All We Needed, At First
Gabe M2y32

Good point--maybe something like "Samantha"?

Reply
Scale Was All We Needed, At First
Gabe M2y20

Ah, interesting. I posted this originally in December (e.g. older comments), but then a few days ago I reposted it to my blog and edited this LW version to linkpost the blog.

It seems that editing this post from a non-link post into a link post somehow bumped its post date and pushed it to the front page. Maybe a LW bug?

Reply
Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Gabe M2y31

Related work

Nit having not read your full post: Should you have "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" in the related work? My mind pattern-matched to that exact piece from reading your very similar title, so my first thought was how your piece contributes new arguments. 

Reply
What’s up with LLMs representing XORs of arbitrary features?
Gabe M2y10

If true, this would be a big deal: if we could figure out how the model is distinguishing between basic feature directions and other directions, we might be able to use that to find all of the basic feature directions.

 

Or conversely, and maybe more importantly for interp, we could use this to find the less basic, more complex features. Possibly that would form a better definition for "concepts" if this is possible.

Reply
What’s up with LLMs representing XORs of arbitrary features?
Gabe M2yΩ230

Suppose a∧b has a natural interpretation as a feature that the model would want to track and do downstream computation with, e.g. if a = “first name is Michael” and b = “last name is Jordan” then a∧b can be naturally interpreted as “is Michael Jordan”. In this case, it wouldn’t be surprising the model computed this AND as f(x)=ReLU((va+vb)⋅x+b∧) and stored the result along some direction vf independent of va and vb. Assuming the model has done this, we could then linearly extract a⊕b with the probe

pa⊕b(x)=σ(−(αvf+va+vb)⋅x+b⊕)

for some appropriate α>1 and b⊕.[7]

 

Should the − be inside the inner parentheses, like σ((−αvf+va+vb)⋅x+b⊕) for α>1?

In the original equation, if a AND b are both present in x, the vectors va, vb, and vfwould all contribute to a positive inner product with x, assuming α>1. However, for XOR we want the va and vb inner products to be opposing the vf inner product such that we can flip the sign inside the sigmoid in the a AND b case, right?

Reply
Load More
Akrasia
2 years ago
(-7)
13Four Phases of AGI
1y
3
57The Bitter Lesson for AI Safety Research
Ω
1y
Ω
5
31ML Safety Research Advice - GabeM
Ω
1y
Ω
2
296Scale Was All We Needed, At First
1y
35
34Anthropic | Charting a Path to AI Accountability
2y
2
63Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367
2y
4
20PaLM API & MakerSuite
3y
1
58Levelling Up in AI Safety Research Engineering
3y
9
25The Tree of Life: Stanford AI Alignment Theory of Change
3y
0
14Favourite new AI productivity tools?
Q
3y
Q
6
Load More