Alex Lawsen — LessWrong

LESSWRONG
LW

100+ concrete projects and open problems in evals

Thanks for putting this list together. It seems like a great resource for people interested in the field and looking for somewhere to start.

I think many of the projects described within it seem likely to be within scope for this request for proposals. If you're reading this and are excited about taking on one of these projects but need funding to do so, I encourage you to apply!

The Online Sports Gambling Experiment Has Failed

Alex Lawsen1y3815

A similar service exists in the UK - https://www.gamstop.co.uk/

I don't know if "don't even discuss other methods until you've tried this first" seems right to me, but I do think such services seem pretty great, and would guess that expanding/building on them (including e.g. requiring that any gambling advertising included an ad for them) would be a lot more tractable than pursuing harder bans.

What actually works is clearly the most important thing here, but aesthetically I do like the mechanism of "give people the ability to irreversibly self exclude" as a response to predatory/addictive systems.

Rowing vs steering

Alex Lawsen1y30

I think this post is great. Thanks for writing it!

What are examples of someone doing a lot of work to find the best of something?

Answer by Alex LawsenJul 28, 202321

James Hoffman's coffee videos have this kind of vibe. The "tasting every Nespresso pod" one is a clear example, but I also really appreciate e.g. the explanations of how to blind taste

https://youtube.com/@jameshoffmann

AI x-risk, approximately ordered by embarrassment

Alex Lawsen3y40

Thanks, both for the thoughts and encouragement!

I'd love to see the most important types of work for each failure mode. Here's my very quick version, any disagreements or additions are welcome:

Appreciate you doing a quick version. I'm excited for more attempts at this and would like to write something similar myself, though I might structure it the other way round if I do a high effort version (take an agenda, work out how/if it maps onto the different parts of this). Will try to do a low-effort set of quick responses to yours soon.

P(Doom) for each scenario would also be useful.

Also in the (very long) pipeline, and a key motivation! Not just for each scenario in isolation, but also for various conditionals like:
- P(scenario B leads to doom | scenario A turns out not to be an issue by default)
- P(scenario B leads to doom | scenario A turns out to be an issue that we then fully solve)
- P(meaningful AI-powered alignment progress is possible before doom | scenario C is solved)

etc.

Deceptive failures short of full catastrophe.

Alex Lawsen3y30

I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.

Deceptive failures short of full catastrophe.

Alex Lawsen3y30

Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.

alexrjl's Shortform

Alex Lawsen3y10

Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.

This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn't yet capable of gradient hacking, we shouldn't expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).

(Intuitions here mostly coming from the "cleanup" stage that Neel found in his grokking paper)

alexrjl's Shortform

Alex Lawsen3y30

[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.

If we're worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long "TV supervillain" style speeches before carrying out the final stage of a plan, etc.].

If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it's poor track record.

alexrjl's Shortform

Alex Lawsen3y10

I still think there's something here and still think that it's interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. 'write access to external memory of which there is no oversight', could bound the potential drift problem at lower capability levels than I was initially thinking for a 'pure' gradient-hack of the sort described here.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments