rpglover64

Wiki Contributions

Comments

Sorted by

Not my medical miracle post, just my comment on it.

"Objectively" for me would translate to "biomarker" i.e., a bio-physical signal that predicts a clinical outcome.

Yes, though I wouldn't restrict it to "clinical" because I care about non-medical outcomes, and "bio-physical" seems restrictive, though based on your example, that seems to be just my interpretation of the term.

Note that for depression and many psychological issues this means that we find the biomarkers by asking people how they feel

These are legitimate biomarkers, but they're not what I want, and I'm struggling to explain specifically why; the two things that come up are that they have low statistical power and they're a particularly lagging indicator (imagine for contrast e.g. being able to tell whether an antidepressant would work for you after taking it for a week, even if it takes two months to feel the effects). They're fine and useful for statistics, and even for measuring the effectiveness of a treatment in an individual, but a lot less useful for experimenting.

I'm assuming you mean biomarkers for psychological / mental health outcomes specifically. This is spiritually pretty close to what my lab studies - ways to predict how TMS will affect individuals, and adjust it to make it work better in each person.

That sounds really cool. I'm assuming there's nothing actionable available right now for patients?

Our philosophy [...] is that the effects of an intervention will manifest most reliably in reactions to very simple cognitive tasks like vigilance, working memory, and so on. Most serious health issues impact your reaction times, accuracy, bias, etc. in subtle but statistically reliable ways.

Yep. This is basically what I'm hoping to monitor in myself. For example, better vigilance might translate to better focus on work tasks, or better selective attention might imply better impulse control.

Measuring these with random sampling from a phone app and doing good statistics on the data is probably your best bet for objectively assessing interventions. Maybe that is what Quantified Mind does, I'm not sure?

QM doesn't work so well on phone and hasn't been updated on years and has major UX issues for my use case that makes it too hard to work with. It also doesn't expose the raw statistics. Cognifit (the only app I've found that does assessment and not just "brain training") reports even less.

Do you have a specific app that you know of?

The short answer is that if this were easy, it would already be popular, because we clearly need it. A lot of academic labs and industry people are trying to do this all the time. There is growing success, but it's slow growing and fraught with non-replicable work.

I don't think this is true. My alternative hypothesis (which I think is also compatible with the data) is that it's not hard, but there's no money in it, so there's not much commercial "free energy" making it happen, and that it's tedious, so there's not much hobbyist "free energy", and academia is slow as things like this.

This isn't directly related to TMS, but I've been trying to get an answer to this question for years, and maybe you have one.

When doing TMS, or any depression treatment, or any supplementation experiment, etc. it would make sense to track the effects objectively (in addition to, not as a replacement for subjective monitoring). I haven't found any particularly good option for this, especially if I want to self-administer it most days. Quantified mind comes close, but it's really hard to use their interface to construct a custom battery and an indefinite experiment.

Do you know of anything?

rpglover64Ω110

Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they're not explicitly weakened in terms of helping people develop capabilities? I'd be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that's still a net negative.

I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled 's reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.

A framing that feels alive for me is that AlphaGo didn't significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.

before: 

after: 

Here the difference seems only to be spacing, but I've also seen bulleted lists appear. I think but I can't recall for sure that I've seen something similar happen to top-level posts.

@Habryka @Raemon I'm experiencing weird rendering behavior on Firefox on Android. Before voting, comments are sometimes rendered incorrectly in a way that gets fixed after I vote on them.

Is this a known issue?

This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.

Note that every issue you mentioned here can be dealt with by trading off capabilities.

Yes. The trend I see is "pursue capabilities, worry about safety as an afterthought if at all". Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad's proposal), but most orgs don't want to bite the bullet of a heavy alignment tax.

I also think you're underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the "HF" part lets the model know what responses are desired. Also, for deception, I wasn't specifically thinking of strategic deception; for general deception, limiting situational awareness doesn't prevent it arising (though it lessens its danger), and if you want to avoid the capability, you'd need to avoid any mention of e.g. honesty in the training.

The "sleeper agent" paper I think needs to be reframed. The model isn't plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.

This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn't eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that's not a given.

If you want a 'clean' LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it's training examples are clean.

First of all, it's impossible to get "100% clean data", but there is a question of whether 5 9s of cleanliness is enough; it shouldn't be, if you want a training pipeline that's capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include "power seeking", "sycophancy", and "deception". You can't reliably eliminate them from the training data because they're not purely properties of data.

Since the sleeper agents paper, I've been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the "attention monitoring" approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.

A question I'm curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.

Load More