LESSWRONG
LW

Alex Lawsen
501Ω527420
Message
Dialogue
Subscribe

AI grantmaking at Open Philanthropy.

I used to give careers advice for 80,000 hours.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
3alexrjl's Shortform
3y
14
100+ concrete projects and open problems in evals
Alex Lawsen5mo90

Thanks for putting this list together. It seems like a great resource for people interested in the field and looking for somewhere to start.

I think many of the projects described within it seem likely to be within scope for this request for proposals. If you're reading this and are excited about taking on one of these projects but need funding to do so, I encourage you to apply!

Reply
The Online Sports Gambling Experiment Has Failed
Alex Lawsen10mo3716

A similar service exists in the UK - https://www.gamstop.co.uk/

I don't know if "don't even discuss other methods until you've tried this first" seems right to me, but I do think such services seem pretty great, and would guess that expanding/building on them (including e.g. requiring that any gambling advertising included an ad for them) would be a lot more tractable than pursuing harder bans.

What actually works is clearly the most important thing here, but aesthetically I do like the mechanism of "give people the ability to irreversibly self exclude" as a response to predatory/addictive systems.

Reply
Rowing vs steering
Alex Lawsen1y30

I think this post is great. Thanks for writing it!

Reply1
What are examples of someone doing a lot of work to find the best of something?
Answer by Alex LawsenJul 28, 202321

James Hoffman's coffee videos have this kind of vibe. The "tasting every Nespresso pod" one is a clear example, but I also really appreciate e.g. the explanations of how to blind taste

https://youtube.com/@jameshoffmann

Reply
AI x-risk, approximately ordered by embarrassment
Alex Lawsen2y40

Thanks, both for the thoughts and encouragement!

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:


Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.

P(Doom) for each scenario would also be useful.


Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)

etc.

Reply
Deceptive failures short of full catastrophe.
Alex Lawsen2y30

I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.

Reply
Deceptive failures short of full catastrophe.
Alex Lawsen2y30

Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.

Reply
alexrjl's Shortform
Alex Lawsen3y10

Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.

This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).

(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)

Reply
alexrjl's Shortform
Alex Lawsen3y30

[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.

If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].

If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.

Reply
alexrjl's Shortform
Alex Lawsen3y10

I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.

Reply
Load More
No wikitag contributions to display.
151AI x-risk, approximately ordered by embarrassment
Ω
2y
Ω
7
17ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try
2y
0
31A tension between two prosaic alignment subgoals
2y
8
33Deceptive failures short of full catastrophe.
3y
5
3alexrjl's Shortform
3y
14
27Thoughts on 'List of Lethalities'
3y
0
27An easy win for hard decisions
3y
0
44Incentive Problems With Current Forecasting Competitions.
5y
21