This isn't directly related to TMS, but I've been trying to get an answer to this question for years, and maybe you have one.
When doing TMS, or any depression treatment, or any supplementation experiment, etc. it would make sense to track the effects objectively (in addition to, not as a replacement for subjective monitoring). I haven't found any particularly good option for this, especially if I want to self-administer it most days. Quantified mind comes close, but it's really hard to use their interface to construct a custom battery and an indefinite experiment.
Do you know of anything?
Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they're not explicitly weakened in terms of helping people develop capabilities? I'd be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that's still a net negative.
I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled 's reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.
A framing that feels alive for me is that AlphaGo didn't significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.
before:
after:
Here the difference seems only to be spacing, but I've also seen bulleted lists appear. I think but I can't recall for sure that I've seen something similar happen to top-level posts.
This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.
Note that every issue you mentioned here can be dealt with by trading off capabilities.
Yes. The trend I see is "pursue capabilities, worry about safety as an afterthought if at all". Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad's proposal), but most orgs don't want to bite the bullet of a heavy alignment tax.
I also think you're underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the "HF" part lets the model know what responses are desired. Also, for deception, I wasn't specifically thinking of strategic deception; for general deception, limiting situational awareness doesn't prevent it arising (though it lessens its danger), and if you want to avoid the capability, you'd need to avoid any mention of e.g. honesty in the training.
The "sleeper agent" paper I think needs to be reframed. The model isn't plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn't eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that's not a given.
If you want a 'clean' LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it's training examples are clean.
First of all, it's impossible to get "100% clean data", but there is a question of whether 5 9s of cleanliness is enough; it shouldn't be, if you want a training pipeline that's capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include "power seeking", "sycophancy", and "deception". You can't reliably eliminate them from the training data because they're not purely properties of data.
Since the sleeper agents paper, I've been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the "attention monitoring" approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.
A question I'm curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.
Not my medical miracle post, just my comment on it.
Yes, though I wouldn't restrict it to "clinical" because I care about non-medical outcomes, and "bio-physical" seems restrictive, though based on your example, that seems to be just my interpretation of the term.
These are legitimate biomarkers, but they're not what I want, and I'm struggling to explain specifically why; the two things that come up are that they have low statistical power and they're a particularly lagging indicator (imagine for contrast e.g. being able to tell whether an antidepressant would work for you after taking it for a week, even if it takes two months to feel the effects). They're fine and useful for statistics, and even for measuring the effectiveness of a treatment in an individual, but a lot less useful for experimenting.
That sounds really cool. I'm assuming there's nothing actionable available right now for patients?
Yep. This is basically what I'm hoping to monitor in myself. For example, better vigilance might translate to better focus on work tasks, or better selective attention might imply better impulse control.
QM doesn't work so well on phone and hasn't been updated on years and has major UX issues for my use case that makes it too hard to work with. It also doesn't expose the raw statistics. Cognifit (the only app I've found that does assessment and not just "brain training") reports even less.
Do you have a specific app that you know of?
I don't think this is true. My alternative hypothesis (which I think is also compatible with the data) is that it's not hard, but there's no money in it, so there's not much commercial "free energy" making it happen, and that it's tedious, so there's not much hobbyist "free energy", and academia is slow as things like this.