If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"
One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.
But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.
A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.
Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.
And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.
Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.
This was a cool, ambitious idea. I'm still confused about your brain score results. Why did the "none" fine-tuned models have good results? Were none of your moddels succesful at learning the brain data?
I got 7/18.
See the discussion section.
We have developed a physics-based, and observable early warning signal characterizing the tipping point of the AMOC: the minimum of the AMOC-induced freshwater transport at 34°S in the Atlantic, here indicated by FovS. The FovS minimum occurs 25 years (9 to 41, 10 and 90% percentiles) before the AMOC tipping event. The quantity FovS has a strong basis in conceptual models, where it is an indicator of the salt-advection feedback strength. Although FovS has been shown to be a useful measure of AMOC stability in GCMs, the minimum feature has so far not been connected to the tipping point because an AMOC tipping event had up to now not been found in these models. The FovS indicator is observable, and reanalysis products show that its value and, more importantly, its trend are negative at the moment. The latest CMIP6 model simulations indicate that FovS is projected to decrease under future climate change. However, because of freshwater biases, the CMIP6 FovS mean starts at positive values and only reaches zero around the year 2075. Hence, no salt-advection feedback–induced tipping is found yet in these models under climate change scenarios up to 2100 and longer simulations under stronger forcing would be needed (as we do here for the CESM) to find this. In observations, the estimated mean value of FovS is already quite negative, and therefore, any further decrease is in the direction of a tipping point (and a stronger salt-advection feedback). A slowdown in the FovS decline indicates that the AMOC tipping point is near.
Model year 1750 does not mean 1750 years from now. The model is subtly different from reality in several ways. Their point is they found some indicator (this FovS thing) that hits a minimum a few decades before the big change, in a way that maybe generalizes from the model to reality.
In the model, this indicator starts at 0.20, slowly decreases, and hits a minimum at -0.14 of whatever units, ~25 years before the AMOC tipping point.
In reality, this indicator was already at -0.5, and is now somewhere around -0.1 or -0.15.
This is a bit concerning, although to reiterate, the model is subtly different from reality in several ways. Exact numerical values don't generalize that well, it's the more qualitative thing - the minimum of their indicator - that has a better chance of warning us, and we have not (as far as we can tell) hit a minimum. Yet.
There's a huge amount of room for you to find whatever patterns are most eye-catching to you, here.
I was sampling random embeddings at various distances from the centroid and prompting GPT-J to define them. One of these random embeddings, sampled at distance 5, produced the definition [...]
How many random embeddings did you try sampling, that weren't titillating? Suppose you kept looking until you found mentions of female sexuality again - would this also sometimes talk about holes, or would it instead sometimes talk about something totally different?
How would the AI do something like this if it ditched the idea that there existed some perfect U*?
Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts "believing in magic" and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.
Is this an alignment approach? How does it solve the problem of getting the AI to do good things and not bad things? Maybe this is splitting hairs, sorry.
It's definitely possible to build AI safely if it's temporally and spatially restricted, if the plans it optimizes are never directly used as they were modeled to be used but are instead run through processing steps that involve human and AI oversight, if it's never used on broad enough problems that oversight becomes challenging, and so on.
But I don't think of this as alignment per se, because there's still tremendous incentive to use AI for things that are temporally and spatially extended, that involve planning based on an accurate model of the world, that react faster than human oversight allows, that are complicated domains that humans struggle to understand.
My take is that they work better the more that the training distribution anticipates the behavior we want to incentivize, and also the better that humans understand what behavior they're aiming for.
So if used as a main alignment technique, they only work in a sort of easy-mode world, where if you get a par-human AI to have kinda-good behavior on the domain we used to create it, that's sufficient for the human-AI team to do better at creating the next one, and so on until you get a stably good outcome. A lot like the profile of RLHF, except trading off human feedback for AI generalization.
I think the biggest complement to activation steering is research on how to improve (from a human perspective) the generalization of AI internal representations. And I think a good selling point for activation steering research is that the reverse is also true - if you can do okay steering by applying a simple function to some intermediate layer, that probably helps do research on all the things that might make that even better.
Overall, though, I'm not that enthusiastic about it as a rich research direction.
the simplistic view that IRL agents hold about ground truth in human values (ie. the human behavior they’re observing is always perfectly displaying the values)
IRL typically involves an error model - a model of how humans make errors. If you've ever seen the phrase "Boltzmann-rational" in an IRL paper, it's the assumption that humans most often do the best thing but can sometimes do arbitrarily bad things (just with an exponentially decreasing probability).
This is still simplistic, but it's simplistic on a higher level :P
If you haven't read Reducing Goodhart, it's pretty related to the topic of this post.
Ultimately I'm not satisfied with any proposals we have so far. There's sort of a philosophy versus engineering culture difference, where in philosophy we'd want to hoard all of these unsatisfying proposals, and occasionally take them out of their drawer and look at them again with fresh eyes, while in engineering the intuition would be that the effort is better spent looking for ways to make progress towards new and different ideas
I think there's a divide here between implementing ethics, and implementing meta-ethics. E.g. trying to give rules for how to weight your past and future selves, vs. trying to give rules for what good rules are. When in doubt, shift gears towards implementing metaethics: it's cheaper because we don't have the time to write down a complete ethics for an AI to follow, it's necessary because we can't write down a complete ethics for an AI to follow, and it's unavoidable because AIs in the real world will naturally do meta-ethics.
To expand on that last point - a sufficiently clever AI operating in the real world will notice that it itself is part of the real world. Actions like modifying itself are on the table, and have meta-ethical implications. This simultaneously makes it hard to prove convergence for any real-world system, while also making it seem likely that all sufficiently clever AIs in the real world will converge to a state that's stable under consideration of self-modifying actions.