I applied for Thomas Kwa SPAR stream but I have some doubts about the direction of research. I post it here to get feedback on my thoughts. Kwa wants to train models to produce something close to neuralese as reasoning traces and evaluate white box and black box monitoring against these traces. It seems obvious to me that when a model switches to neuralese we already know that something is wrong, so why test our monitors against neuralese?
Wikipedia?
source on the ED risk?
do you want to stop worrying?
I think that on most of the websites only about 1-10% of the users actually post things. I suspect that the number of people having those weird interactions with LLMs (and stopping before posting stuff) is like 10 - 10000 (most likely around 100) times bigger than what we see here
why?
The goals we set for AIs in training are proxy goals. We, humans, also set proxy goals. We use KPIs, we talk about solving alignment and ending malaria (proxy to increasing utility, saving lives) budgets and so on. We can somehow focus on proxy goals and maintain that we have some higher level goal at the same time. How is this possible? How can we teach AI to do that?
So I got what I wanted. I tried zed code editor and well... it's free and very agentic. I haven't tried cursor but I think it might be on the same level.
I don't think that anymore. I think it's possible to get labs to use your work (e.g. you devised a new eval or new mech interp technique which solves some important problems) but it has to be good enough and you need to find a way to communicate it. I changed my mind after EAG London
I briefly read 6pack.care website and your post. It sounds to me like an idea supplementary to existing AI safety paradigms and not solving the core problem - aligning AIs. Looking at your website I see that it's already assumed that AI is mostly aligned and issues with rogue AIs are not mentioned in the risks section
and so on.