I am a scientist by vocation (and also by profession), in particular a biologist. Entangled with this calling is an equally deep and long-standing interest in epistemology. The kinds of scientific explanations I find satisfying are quantitative (often statistical) models or theories that parsimoniously account for empirical biological observations in normative (functional, teleological) terms.
My degrees (BS, PhD) are in biology but my background is interdisciplinary, including also philosophy, psychology, mathematics/statistics, and computer science/machine learning. For the last few decades I have been researching the neurobiology of sensory perception, decision-making, and value-based choice in animals (including humans) and in models.
I also do work in philosophy of science (focusing on the warrant of inductive inference and methodological rigor in exploratory research) and philosophy of biology (focusing on teleology and the biological basis of goal directed behavior).
I consider myself a visitor to your forum, in that my context is mainly from without. A few months ago I had never heard of this group, and had never considered the possibility of catastrophic risk from AGI.
Update: I have updated on AI risk such that I now consider it a good reason to prioritize research that might mitigate that risk over other projects, and to refrain from lines of research that could advance progress toward AGI. I'm considering working directly on AI safety, but it remains to be seen if this is a fit.
I have several ideas that fall within the broad scope of LW's interests, but which I have been developing independently for a long time (decades) outside of this conversation. Many of these ideas seem similar to ones others express here (which is exciting!). But in the interest of preserving potentially important subtle differences, I will be articulating them in my own terms instead of recasting them in LW-ese, at least as a starting point. If it turns out I've arrived at exactly the same idea others have reached entirely independently, this adds more epistemic support than if my line of reasoning was overly influenced by theirs or vice versa.
I think this kind of funding has outsized impact.
Voting or otherwise delegating selection to the hive mind seems like a good way to minimize any potential impact.
Delegating to an already successful, widely respected authority in the domain is better, but those people are probably already steering most of the effort and funding in the field, either directly or by swaying general opinion.
Nomination-based seems like a good way to tap the wisdom of many different highly qualified people without filtering through a consensus process. It also reduces opportunity for gaming by insincere candidates, reduces the number of people who will spend time applying, and limits the number of applications you have to read.
For example: pick a nominating committee of maybe 10 ppl you think are wise and smart and knowledgeable and independent-thinking and different from each other, who would not be candidates now, but would be in a position to be aware of potential candidates. Ask each of them to nominate one candidate per available slot with a brief statement about why. The nominators’ identities should be secret, even to each other. Your goals and criteria should be articulated to them.
You could either screen those based on the nominator’s statements or invite short preliminary applications from all the nominees, but decide which of them you are personally most excited about and invite only 2-3 full applications or interviews per available slot. In the end pick recipients based on your own judgement.
ok, I thought this was a bigger concern than it appears to be, I'll edit accordingly.
What I meant by that is that individual instantiations adapt to individual users' preferences, i.e. they develop personas that can veer off into bad/less aligned directions over extended use. If this is the case, what is this called if not "learning after deployment"?
The wikipedia entry is a little unclear, but it is correct that dopamine is not a pleasure stimulus.
Dopamine is a prediction error signal.
Dopamine is released when animals receive a reward (such as food when hungry, or water when thirsty) that they are not expecting. This causes changes in the brain that increase the probability of the animal initiating whatever action they just did in similar sensory contexts/environments. The animals don’t necessarily “enjoy” the dopamine or want the dopamine, they enjoy and want the food or water.
When we say the food or water is "rewarding" we mean, operationally, that it reinforces (increases the frequency of) the preceding behavior. Mechanistically, we think this reinforcement is mediated by the dopamine signal.
In clicker-training, once the animal knows a clicker sound always predicts a real reward, dopamine is released when they hear the clicker, because this is the moment they get the news that a reward is coming. This brings the timing of dopamine closer in time to the behavior that caused it, improving temporal credit assignment for the neural weight updates. When the actual reward arrives, it is expected, so no additional dopamine is released.
Likewise, once the animal knows (or believes) their behavior in response to a given cue will be rewarded, dopamine is released when they initiate the action, because this is the moment they first get the update that a reward is coming; then no additional dopamine at clicker or reward time as those are expected.
Failure to get an expected reward causes a transient decrease in dopamine at the time the animal had expected the reward. This disappointment signal leads to changes in the brain to decrease the probability of the animal initiating whatever action they just did in similar contexts/environments.
So, this is the sense in which dopamine marks the anticipation or prediction of a reward.
(there is also evidence that dopamine plays a role in signaling unexpected punishments, and that serotonin is also involved in signaling unexpected rewards and/or punishments; but I don't know the literature on those claims as well).
A (beautiful) song for existential despair: “Doomsday” Words by Joseph Hart (1762), music composed and arranged by Abraham Woods (1789) Performed by the group Landless:
An even less-formed thought: in the case of children, another important factor for their buying into restrictions on their own behavior is the realization that universal enforcement of those restrictions also protects them from others' bad behavior. And it might be important to their learning process that they practice enforcing these norms in their own social interactions. (I am speaking from parenting experience, I don't know the research literature on this topic). Not sure how this applies to AGI alignment, but this seems to fall more in the self-other overlap bin?
(Epistemic status: I'm new to Alignment research field; I'm sure I'm not the first to have this thought, but it also does not seem to be a dominant thread in the current conversation)
Most attempts to ensure alignment of LLMs have involved Reinforcement Learning from Human Feedback only after extensive Pretraining on massive bodies of text, and maybe tacking on a checklist of rules to intercept bad behaviors that might nevertheless occur. This seems like too late. I take it people are now filtering pretraining input text to try to remove potentially harmful content from getting baked into the LLM's model in the first place, and/or imposing an RLHF round earlier in pre-training, with some success. It seems to me this is insufficient.
Babies are corrigible. Humans have a protracted period of development during which they are small, weak, and dependent on parents, who provide RLHF on what kinds of behavior are acceptable or unacceptable at a time when this feedback is extremely salient. Children have a lot of time to internalize expectations like "don't steal your sister's blocks" to "don't hit the dog" before there's any need to tackle more complex (and more potentially harmful) bad behaviors.
It's probably a good thing children get a lot of this sort feedback on a relatively limited behavioral repertoire before they are anywhere near big enough to overpower adults, and before they are too independent or agentic. By the time they have the cognitive tools to conceive of and carry out long term plans with substantial impact on the world, much less the physical capacities to realize those plans, they have hopefully internalized a strong innate sense of what sorts of behavior are acceptable, and where redlines are for strictly unacceptable behavior. They will keep developing and potentially changing their values after that, but the baked in primitives are pretty sticky. ("Give me a child until he is five and I'll have him for life")
It seems worth looking more into developmental alignment by mimicking key aspects of human moral development. This might look like "age-appropriate" pre-filtering of input text corpus into a sequence of developmental stages, gradual scaling up of model size/complexity, and more integral continuous RLHF over the course of "pre-training".
I get excited about the possibility of contributing to AI alignment research whenever people talk about the problem being hard in this particular way. The problems are still illegible. The field needs relentless Original Seeing. Every assumption needs to be Truly Doubted. The approaches that will turn out to be fruitful probably haven't even been imagined yet. It will be important to be able to learn and defer while simultaneously questioning, to come up with bold ideas while not being attached to them. It will require knowing your stuff, yet it might be optimal not to know too much about what other people have thought so far ("I would suggest not reading very much more."). It'll be important for people attempting this to be exceptionally good at rigorous, original conceptual thinking, thinking about thinking, grappling with minds, zooming all the way in and all the way out, constantly. It'll probably require making ego and career sacrifices. Bring it on!
However, here's an observation: the genuine potential to make such a contribution is itself highly illegible. Not only to the field, but perhaps to the potential contributor as well.
Apparently a lot of people fancy themselves to have Big Ideas or to be great at big-picture thinking, and most of them aren't nearly as useful as they think they are. I feel like I've seen that sentiment many times on LW, and I'm guessing that's behind statements like:
It's not remotely sufficient, and is often anti-helpful, to just be like "Wait, actually what even is alignment? Alignment to what?".
or the Talent Needs post that said
This presents a bit of a paradox. Suppose there exist a few rare, high-potential contributors not already working in AI who would be willing and able to take up the challenge you describe. It seems like the only way they could make themselves sufficiently legibly useful would be to work their way up through the ranks of much less abstract technical AI research until they distinguish themselves. That's likely to deter horizontal movement of mature talent from other disciplines. I'm curious if you think this is true; or if you think starting out in object-level technical research is necessary training/preparation for the kind of big-picture work you have in mind; or if you think there are other routes of entry.
Hopefully implied by omission, but I probably should have said that nobody was badly hurt; the workers all got out (one tried to go back in to rescue our parakeet, but the firefighters wouldn’t let her).
I think we are agreeing: there may be a small set of core values shared by a large fraction of humans, so there's hope most people will buy into AI aligning on at least those. And most of those are probably also in fact good values, or at least if AI were aligned to those values, that would go a long way to decreasing X-risk. All I am saying is, these are nevertheless specific values, which exclude and reject other possible value systems. But it's taboo to say one set of values is right and another is wrong, so people dance around it.
This is mostly tangential to the alignment conversation; I'm not suggesting we need to solve metaethics or resolve all ethical disputes to move forward with AI alignment efforts. I think sticking with a few widely-accepted values is a good practical strategy. But I hear some people objecting "yes, but whose values?" and I can see that they are not at all satisfied with the answer "oh no, we're not advocating any particular values". So I think it would serve the cause better to be clear that these are, in fact, particular values.