Question to people working on Technical AI Alignment: How are you currently making a living?
I want to, ideally, focus on Technical Alignment Research fulltime and feel a bit lost. Any advice or encouragement, or even discouragement, would be appreciated.
Edit: Here is johnswentworth's answer to this question in 2021. I don't know how much the space has changed since then.
How long do you think something should be before it is no longer a quick take and should instead be a top level post? Or is it not about the length? Maybe it's about the amount of research and editing that goes into it?
I don't know the exact cutoff, but I think a decent number of quick takes should just be posts. We already have the Personal Blog tag and I just rely on the moderators deciding if something should be promoted to the frontpage.
I think the question of whether something belongs in shortform vs. a post is separate from whether it's a good post. I was just assuming mods wouldn't promote something they think is too niche or undeveloped. The Personal Blogpost tag specifically calls out "niche topics" and "personal ramblings".
I think I agree. I'm imagining short takes fill the role of quick, twitter like back and forth of idea snippits, and general questions like the one I just posed, but it seems like you can write entire article length content in them which makes me wonder what different people think the distinction should be.
Giving things a title is in my experience a very difficult process and changes the structure of an essay from a conversational open-ended style to a more "I know where I want the reader to go" situation. Both have their place, but they do feel quite different.
Oh, that's a really interesting answer, like the difference is the difference between a named and anonymous function. I do think there is a kind of important semiotic power in naming things, so I can understand wanting to avoid that, but I also feel pretty comfortable writing a post and then slapping a random name on it based on how the vibe turned out. This is what I just did with Ball+Gravity has a "Downhill" Preference. It started as a quick take, but then became a long take, so I copied and pasted it into a post and gave it a name. That's also what inspired this question.
A contradiction that isn't a contradiction:
I hold both of these views:
Why might this seem like a contradiction? Either I should think that more money should be put into technical AI alignment so I and other people can get paid to do it, or I should conclude that that AI policy is more important and try to work on that instead.
Why do I believe this is not actually a contradiction? In my worldview, AI is a very important and potentially existentially dangerous technology, and the current shepherds of it's development are not handling their responsibility with commensurate wisdom. AI policy is then, the more important and constrained focus. But I do not believe that technical AI alignment should be completely forgotten, and I do not believe I have the aptitude or desire to do policy work. I intrinsically like research and theory building.
I wonder how many other people feel the way I feel. It would be quite a problem if it was the majority of us.
Given the existence of a hill and gravity, the roundness of a ball encodes it's preference for being at the bottom of the hill. Without the hill and gravity, the roundness of the ball could mean many things.
Given the existence of a ball and gravity. the bottom of a hill encodes a preference for where the ball should be.
Neither the ball nor the hill are modelling the world, yet they steer reality towards repeatable outcomes. Is it wrong to call this a preference even though it does not depend on world modelling?
Is it right to say preferences in this context only exist as the interplay of multiple parts of a system? Are the preferences of world modelling agents fully contained in those agents, or, like the ball and the hill, do the agents preferences exist as an interplay between the world and itself?
Not wrong, if used metaphorically, but I think that preference which implies a agent that is aware of and capable of making choices maybe muddies whatever you're trying to express. In the case of the ball and the hill, that is not the case. Preference, in ordinary parlance, suggests the first of options. Often options are qualitative: "I prefer Chocolate Ice-Cream to Strawberry". In Economics it's about "optimal choice" which - again - do the hill and the ball have the capacity to take alternatives? Is there some utility they are maximizing?
Spinoza says that if a stone which has been projected through the air, had consciousness, it would believe that it was moving of its own free will. I add this only, that the stone would be right. The impulse given it is for the stone what the motive is for me, and what in the case of the stone appears as cohesion, gravitation, rigidity, is in its inner nature the same as that which I recognise in myself as will, and what the stone also, if knowledge were given to it, would recognise as will. -
Arthur Schopenhauer
If your objective is to describe the most probable or likely outcome of a system that is better modeled using Daniel Dennett's Physical Stance than his Intentional Stance, then I'd avoid using "preference". In the example you've given, there's nothing to suggest the ball will be anywhere else, there's nothing to suggest it has "options" therefore there are no preferences to speak of.
Preference implies alternative outcomes.
Thanks for engaging : )
I think the phrase "aware of and capable of making choices" hides most of the complexity I am interested in focusing on. What really is awareness? The word "aware" implies that it is a boolean thing, like "either some system is aware or it is not", but I think that's wrong. I think "awareness" varies in amount and kind.
And "making choices" is similarly complicated. The ball could stay put or roll, but it chooses to roll. You could say it never had the choice to do anything but roll because the mechanism which determined its choice to roll, its roundness, is so obvious and exposed, but suppose I understood the mechanisms of some human's mind well enough to predict that human's actions with the same accuracy? Would it be right to suggest that humans do not make choices since the choices were determined by the mechanisms by which humans choose?
It seems to me the Physical Stance and the Intentional Stance both describe the same systems. It is my feeling that in order to understand complex decision making systems, such as humans, AI, and sociotechnical systems, we need to have language that can describe them clearly. So I guess what I might be doing here is trying to force an exploration of the boundary between where the physical and intentional stance apply.
I could believe that a symbolic representation of other objects is the quality required to say that a system is aware, but then, is roundness symbolic? Where is the distinction between symbolic and mechanical?
Likewise, I could very much imagine alternative outcomes are required for preference, but then either some system having preferences depends on how well understood it is. That has uncomfortable implications. If an ASI understood humans sufficiently well, would that ASI be justified in claiming that humans do not have preferences? I'm much more comfortable admitting any system that affects outcomes has preferences than denying the preferences of any sufficiently well understood system.
... Oh, also, I didn't put as much emphasis on it but I really am interested in the question of whether an agent's preferences exist as an interplay between the world and itself. I feel that would have important implications for Agent Foundations and AI Alignment.
but suppose I understood the mechanisms of some human's mind well enough to predict that human's actions with the same accuracy? Would it be right to suggest that humans do not make choices since the choices were determined by the mechanisms by which humans choose?
I don't think we need to suppose... I'd guess you probably do frequently. You have family members, friends, and/or lovers, or people whom you have intimate knowledge and extremely good track records of predicting their behavior?
If an ASI understood humans sufficiently well, would that ASI be justified in claiming that humans do not have preferences? I'm much more comfortable admitting any system that affects outcomes has preferences than denying the preferences of any sufficiently well understood system.
I don't think it would be any more justified claiming that humans don't have preferences than I can claim that anybody I know really well doesn't have preferences. If you can predict which newspaper or soft-drink your Father buys from the store, that doesn't mean he had no choice in the matter. If there's no other newspapers in stock, or only one brand of soft-drink - then he has no choice. But, realistically, you can't choose alternatives you're not aware of.
A simple test of whether something is not a choice or not is to ask: "if the agent believed something else or had very different desires, would the outcome be very different?". If no matter what the agent desires or believes, the outcome would always be the same. Then that's not a choice.
If someone goes up to the fridge at a store and there's a orange drink, and a strawberry drink. And you know they love Orange flavor, and so they buy the orange. that's still a choice. But - imagine you knew they HATED Orange, or if they loved Strawberry instead - hypothetically they would then choose the Strawberry. Therefore it was a choice.
Conversely, imagine a spectator high up on an embankment at a motorrace. They are in a sea of people, a mere spec as seen from the track, so they have no earthly way of affecting the result of the motorrace. There's twenty racers. It doesn't matter who this single spectator desires or wishes to win - the result is hypothetically always the same. This is not a choice.
I am not familiar of any credible model were a ball can "desire" to go up, and contingent on that alone, it does. This is why it is best represented by the "physical" stance in Dennet's typology.
The word "aware" implies that it is a boolean thing, like "either some system is aware or it is not", but I think that's wrong. I think "awareness" varies in amount and kind.
Abstractly, I agree with this, and I think there's a spectrum of awareness in ways that do influence choices. But I'm struggling for examples right now... the best that comes to mind is when a couple are deciding where to go to dinner, and one of them says "let's have Italian" knowing there is an Italian restaurant, they aren't strictly aware of the menu, it could include Ragù, Calzone, Osso Buco or dozens of others choices - but they are aware of at least one restaurant nearby, in their price-range, that does "Italian".
Likewise preferences themselves often exist in parallel. If Orange isn't available, maybe they go for Banana, or Cherry. And likewise choices made are often prompted by complex decision making models that are operating on dozens of different dimensions or factors, even something as simple as buying a shirt - is it comfortable? do I like the pattern or the colour? is the material breathable? What are the washing instructions? etc. etc. etc.
A lot of this is black box analysis. I'm interested in white box analysis. I guess maybe "black box vs white box" means the same thing as "intentional stance vs physical stance".
You speak of knowing the preferences of something, with the implication that you have observed the past behaviour of the system and can infer it's future behaviour based on an abstract model of it's "intentions" or "preferences". Is this what is meant by the "intentional stance"? I think so, and it is indeed a valid way to examine the world.
But within a person, and within an AI model, there is some mechanism that causes those preferences to be so... and that is the kind of understanding I am focusing on. Predicting choice to get Orange flavour, not based on past behaviour involving flavour choices or hearing statements about preferences, but by examining the body and brain and brainstate with enough skill to see how and where the preference for orange is encoded, and predicting based on that. Is this the "physical stance"? In that case I think I might be interested in merging the physical and intentional stance.
For example, I might know that balls roll down hills not because I have analyzed them as physical objects, but because I have observed them roll down hills before. Is this not the same as the intentional stance? Modelling the preferences of the ball based on it's past behaviour?
On the other hand, it isn't too difficult to understand how roundness vs flatness affect rolling. The flat object stays where it is put and the round object rolls down the hill. You can see mechanically why this is the case, but you could also just as well know this by inference and I would suggest that most people learn about physical laws first by observing the behaviours of objects and only later in life learning about things like friction and force and gravity.
I haven't noticed anything you have said that categorically distinguishes the behaviour of an object rolling down a hill from the behaviour of a person expressing their preferences by choosing what they want.
I think there's a perverse incentive around the learning and usage of math. This is majorly coloured by my experiences in calculus class where it seemed like most students were interested in memorizing how to use the formulas to get the correct answer to get good grades, without necessarily understanding anything about what the calculations they were doing represented or were used for.
Maybe the problem here is fully from Goodharting on grades, but I wonder if there may be a broader phenomenon.
There's three objects here:
The dynamic I'm hypothesizing is:
In general, (2) is more important than (1), and (1) is held in higher esteem than (2). Ideally people have a strong grasp of both (1) and (2), but people are intrinsically and extrinsically motivated to seek (3), and (2) is easier to fake than (1), so people are motivated to fake (2) and put their effort into signalling competence in (1) which would, ideally imply competence in (2), but doesn't necessarily.
I feel this relates to what Grant Sanderson of 3b1b talked about in Math's pedagogical curse. But of course I also worry this is something I'm imagining because I feel motivated to try to understand math that is too difficult for my level of skill and I want to rationalize away my incompetencies. Is it imposter syndrome or honest self knowledge?
What do you think? Is (2) more important than (1), or maybe it's not important to be skilled at both, and we need people better at (1) and people better at (2)? Do you agree the dynamic I described exists, and does it feel common, or marginal? Or maybe my entire framing is flawed in some way?
Thought while doing Transformers from Scratch:
Inside a transformer block, the MLP embeds into a higher dimensional space, applies an activation function, and then projects back into a lower dimensional space. From a semantic distribution perspective, here are three intuitions for why this makes sense:
The more dimensions you have the more complicated "knots" you can untangle. With 2d, you can pull the centre out of a line. With 3d, you can pull the centre out of a disk. With 4d, you can pull the centre out of a ball. Etc...
Each dimension in the activation space is like an independent fold applied to the semantic space. The more folds you have the more you can transform the semantics.
The embedding space can be thought of as partitions of independent copies of the input space, each transformed independently. The higher the dimension of the embedding space, the more copies of larger subspaces of the input (including the entire input space) can be independently transformed.