Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators.
(I might write a post on this at some point.)
There's a meditation technique that I have used to improve my typing speed, but that seems pretty generalizable: Open up a typing test[1] and try to predict my mistakes in advance of them happening. This could look like my finger slipping, or running into a faulty muscle memory for a certain word, or just having a cache miss and stumbling for a second. Then, I use this awareness to not make those mistakes, ideally stopping them before they happen even once.
I've learned to type from scratch several times, going from hunt and peck to touch typing with qwerty, to touch typing colemak, to learning to use the Kinesis Advantage 2, to learning the CharaChorder 2 and its custom layout, which is now my daily driver. I only started doing this meditation about half way through learning colemak, and it noticeably boosted my accuracy in a relatively lasting way. However, it's relatively straining to meditate while also trying to type as fast as you can, especially on the CharaChorder because it has an entirely new type of cognitive load that I'm learning.
I would probably generalize this if I was trying to get really good at another DEX-reliant skill, but for now I'm not. It feels related to the part of me that more generally notices when I'm Predictably Wrong, but in practice it felt like a meaningfully different thing to train.
How did you determine the cost and speed of it, given that there is no unified model that we have access to, just some router between models? Unless I'm just misunderstanding something about what GPT-5 even is.
It's a balance between getting the utility out of using smarter and smarter assistants and not being duped by them. This is really hard, and it's definitely not a bet that everyone should make.
I mostly agree with this, but also think it's good to just say the sane things labs should do, even if I don't expect statements like mine to make a difference on average.
There's some hope that, because interpretable CoT is mundanely useful, there's incentive for even the capabilities people to keep it
(I don't think you included it) The blurb on corrigibility is also really good. Yet another thing that I'm not sure if Eliezer has actually written up anywhere else.
This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven't yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
Shrug, it does work for long spans (usually months before I need another) for me. This was a recent patch to a recent problem, but I've had this technique for years and it does in fact get me out of positive feedback loops of hedonic set point raising, no ketamine required. If I had to guess why it lasts, I'd say that it serves as a good reminder and willpower booster that allows me to resist further really-useless superstimuli.
Oops, I wrote that without fully thinking about diffusion models. I meant to contrast diffusion LMs to more traditional autoregressive language transformers, yes. Thanks for the correction, I'll clarify my original comment.
Fixed, thank you!