Hi, I am a Physicist, an Effective Altruist and AI Safety student/rehearser.
1) parenting is known to have little effect on children's character
This is not counter evidence to my claim. The value framework a child learns about from their parents is just one of many value frameworks they hear about from many, many people. My claim is that the power lies in noticing the hypothesis at all. Which ideas you get told more times (e.g. by your parents) don't matter.
As far as I know, what culture you are in very much influences your values, which my claim would predict.
2) while children learn to follow rules teens are good at figuring out what is in their interest.
I'm not making any claims about rule following.
Blogposts are the result of noticing difference in beliefs. Either between you and other of between you and you, across time.I have lots of ideas that I don't communicate. Sometimes I read a blogpost and think "yea I knew that, why didn't I write this". And the answer is that I did not have an imagined audience.My blogposts almost always span after I explained a thing ~3 times in meat space. Generalizing from these conversations I form an imagined audience which is some combination of the ~3 people I talked to. And then I can write. (In a conversation I don't need to imagine an audience, I can just probe the person in front of me and try different explanations until it works. When writing a blogpost, I don't have this option. I have to imagine the audience.)Another way to form an imagined audience is to write for your past self. I've noticed that a lot of thig I read are like this. When just learning something or realizing something, and past you who did not know the thing is still fresh in your memory, then it is also easier to write the thing. This short form is of this type.I wonder if I'm unusually bad at remembering the thoughts and belief's of past me? My experience is that I pretty quickly forget what it was like not to know a thing. But I see others writing things aimed at their pasts self from years ago.I think I'm writing short form as a message to my future self, when I have forgotten this insight. I want my future self to remember this idea of how blogposts spawn. I think it will help her guide her writing posts, but also help her not to be annoyed when someone else writes a popular thing that I already knew, and "why did I not write this?" There is an answer to the question "why did I not write this?" and the answer is "because I did not know how to write it". A blogpost is a bridge between a land of not knowing and a land of knowing. Knowing the destination of the bridge is not enough to build the bridge. You also have to know the starting point.
I almost totally agree with this post. This comment is just nit picking and speculation.
Evolution has an other advantage, that is relate to "getting a lot's of tries" but also importantly different.It's not just that evolution got to tinker a lot before landing on a fail proof solution. Evolution don't even need a fail proof solution. Evolution is "trying to find" a genome, which in interaction with reality, forms a brain that causes that human to have lots of kids. Evolution found a solution that mostly works, but sometimes don't. Some humans decided that celibacy was the cool thing to do, or got too obsessed with something else to take the time to have a family. Note that this is different from how the recent distributional shift (mainly access to birth control, but also something about living in a rich country) have caused previously children rich populations to have on average less than replacement birth rate. Evolution is fine with getting the alignment right in most of the minds, or even just a minority, if they are good enough at making babies. We might want better guarantees than that?Going back to alignment with other humans. Evolution did not directly optimise for human to human alignment, but still produced humans that mostly care about other humans. Studying how this works seems like a great idea! But also evolution did not exactly nail human to human alignment. Most, but defiantly not all humans care about other humans. Ideally we want to build something much much more robust.Crazy (probably bad) idea: If we can build a AI design + training regime that mostly but not certainly turn out human aligned AIs, and where the uncertainty is mostly random noise that is uncorrelated between AIs. Then maybe we should build lots of AIs with similar power and hope that because the majority are aligned, this will turn out fine for us. Like how you don't need every single person in a country to care about animals, in order for that country to implement animal protection laws.
This is probably too obvious to write, but I'm going to say it anyway. It's my short form, and approximately no-one reads short forms. Or so I'm told.
Human value formation is to a large part steered by other humans suggesting value systems for you. You get some hard to interpret reward signal from your brainstem, or something. There are lots of "hypothesis" for the "correct reward function" you should learn.
(Quotation marks because there are no ground through for what values you should have. But this is mathematically equivalent to a learning the true statistic generating the data, from a finite number of data points. Also, there is maybe some ground truth of what the brainstem rewards, or maybe not. According to Steve the there is this loop, where when the brainstem don't know if things are good or not, it just mirror back cortex's own opinion to the cortex.)
To locate the hypothesis, you listen to other humans. I make this claim not just for moral values, but for personal preferences. Maybe someone suggest to you "candy is tasty" and since this seems to fit with your observation, no you also like candy. This is a bad example since for taste specifically the brainstem has pretty clear opinions. Except there is acquired taste... so maybe not a terrible example.
Another example: You join a hobby. You notice you like being at the hobby place doing the hobby thing. Your hobby fired says (i.e. offer the hypothesis) "this hobby is great". This seems to fit your data so now you believe you like the hobby. And because you believe you like the hobby, you end up actually liking the hobby because of a self reinforcing loop. Although this don't always work. Maybe after some time your friends quit the hobby and this makes it less fun, and you realise (change your hypothesis) that you manly liked the hobby for the people.
Maybe there is a ground truth about what we want for ourselves? I.e. we can end up with wrong beliefs about what we want due to pear pressure, commercials, etc. But with enough observation we will notice what it is we actually want.
Clearly humans are not 100% malleable, but also, it seems like even our personal preferences are path dependent (i.e. pick up lasting influences from our environment). So maybe some annoying mix...
What is alignment? (operationalisation)
Toy model: Each agent has a utility function they want to maximise. The input to the utility function is a list of values describing the state of the world. Different agents can have different input vectors. Assume that every utility function monotonically increases, decreases or stays constant for changes in each impute variable (I did say it was a toy model!). An agent is said to value something if the utility function increases with increasing quantity of that thing. Note that if an agents utility function decreases with increasing quantity of a thing, then the agent values the negative of that thing.
In this toy model agent A is aligned with agent B if and only if A values everything B values.
Q: However does this operationalisation match my intuitive understanding of alignment? A: Good but not perfect.
This definition of alignment is transitive, but not symmetric. This matches the properties I think a definition of alignment should have.
How about if A values a lot of things that B doesn't care about, and only cares very little about the things A cares about? That would count as aligned in this operationalisation but not necessarily match my intuitive understanding of alignment.
What is alignment? (operationalisation second try)
Agent A is aligned with agent B, if and only if, when we give more power (influence, compute, improved intelligence, etc.) to A, then things get better according to B’s values, and this relation holds for arbitrary increases of power.
This operationalisation points to exactly what we want, but is also not very helpful.
Here's what you wrote:
This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations.
Do you still agree with yourself?
In that case I'm confused about this statement
This interpretation makes sense even in the absence of “agents” with “beliefs”
What is priors in the absence of something like agents with beliefs?
Support for AI safety research is up: 69% of respondents believe society should prioritize AI safety research “more” or “much more” than it is currently prioritized, up from 49% in 2016.
What is this number if you only include people who participated in both surveys?
We’ve shown that the probability P[q|X] summarizes all the information in X relevant to q, and throws out as much irrelevant information as possible.
This seems correct.Lets say two different points in the data configuration space, X_1 and X_2, provide equal evidence for q. Then P[q|X_1] = P[q|X_2]. The two different data possibilities are mapped to the same point in this compressed map. So far so good.(I assume that I should interpret the object P[q|X] as a function over X, not as a point probability for a specific X.)
First, hopefully this provides some intuition for interpreting a probability P[q|X] as a representation of the information in X relevant to q. In short: probabilities directly represent information. This interpretation makes sense even in the absence of “agents” with “beliefs”, or “independent experiments” repeated infinitely many times. It directly talks about maps matching territories, and the role probability plays, without invoking any of the machinery of frequentist or subjectivist interpretations. That means we can potentially apply it in a broader variety of situations - we can talk about simple mechanical processes which produce “maps” of the world, and the probabilistic calculations embedded in those processes.
I don't think this works.
The map P[q|X] have gotten rid of all the irrelevant information in the map, but it still contains some information that never came from the map. I.e. P[q|X] is not generated only from the information in X relevant for q.
E.g. from P[q|X] we can get
P[q] = sum_X P[q|X]
i.e. the prior probability of q. And if the prior of q where different P[q|X] would be different too.
The way you can't (shouldn't) get rid of priors here, feels similar to how you can't (shouldn't) get rid of coordinates in physics. In this analogy, the choice of prior is analogues to the choice of the origin. Your choice of origin is completely subjective (even more so than the prior). Technically you can represent position in a coordinate free way (only relative positions), but no one does it, because doing so destroys other things.
(I'm being maximally critical, because you asked for it)
I don't think this is true:
But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.
I could not find any study that test this directly, but I don't expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Does this matter for understanding gradient hacking in future AGIs? Maybe?
Since humans are the closest thing we have to an AGI, it does make sense to try to understand things like gradient hacking in ourselves. Or if we don't have this problem, it would be very interesting to understand why not.
Are there other examples of biological gradient hacking? (1)I heard that whatever you do while taking nicotine, will be reinforced (don't remember source but seems plausible to me). But this would be more analog to directly over writing the back prop signal, instead of manipulating the gradient via controlling the training data. If we end up with an AI that can just straight forwardly edit its outer learning regime in this way, then I think we are outside the scope of what you are talking about. However, if this nicotine hack works, it is interesting it is not used more? Maybe it is not a strong enough effect to be useful?
(2)You give an other example:
Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.
I can't decide if I think this should count as gradient hacking.
(3)I know that I to some extent absorb the values of people around me, and I have used this for self manipulation. This is the best analog to gradient hacking I can think of for humans. Unfortunately I don't expect this to tell us much about AI's, since this method depends on a specific human drive towards conformism.
I'm curious if an opposite strategy works for contrarians? If you want to self manipulate you should hang out with people who believe/value the opposite of what you want yourself to believe/value?