We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.
I agree with this, in a nutshell. After all, you can put almost whatever values you like and it will work, which is the point of my long commennt.
My point is once you have the instrumental goals done like survival and technological progress down for everyone, alignment in practice should reduce to this:
Everyone have their own personal superintelligence that they can brainwash to do whatever they want.
And the alignment problem is simple enough: How do you brainwash an AI to have your goals?
I mean FTX happened in the last 6 months! That caused incredibly large harm for the world.
I agree, but I have very different takeaways on what FTX means for the Rationalist community.
I think the major takeaway is that human society is somewhat more adequate, relative to our values than we think, and this matters.
To be blunt, FTX was always a fraud, because Bitcoin and cryptocurrency violated a fundamental axiom of good money: It's value must be stable, or at least slowly change, and it's not a good store of value due to the wildly unstable price of say a single Bitcoin or cryptocurrency, and the issue is the deeply stupid idea of fixing the supply, which combined with variable demand, led to wild price swings.
It's possible to salvage some value out of crypto, but they can't be tied to real money.
Most groups have way better ideas for money than Bitcoin and cryptocurrency.
OpenAI and Anthropic are two of the most central players in an extremely bad AI arms race that is causing enormous harm. I really feel like it doesn't take a lot of imagination to think about how our extensive involvement in those organizations could be bad for the world. And a huge component of the Lightcone Offices was causing people to work at those organizations, as well as support them in various other ways.
I don't agree, in this world, and this is related to a very important crux in AI Alignment/AI safety: Can it be solved solely via iteration and empirical work? My answer is yes, and one of the biggest examples is Pretraining from Human Feedback, and I'll explain why it's the first real breakthrough of empirical alignment:
It almost completely avoids deceptive alignment via the fact that it lets us specify the base goal as human values first before it has the generalization capabilities, and the goal is pretty simple and myopic, so simplicity bias doesn't have as much incentive to make the model deceptively aligned. Basically, we first pretrain the base goal, which is way more outer aligned than the standard MLE goal, and then we let the AI generalize, and this inverts the order of alignment and capabilities, where RLHF and other alignment solutions first give capabilities, then try to align the model. This is of course not going to work all that well compared to PHF. In particular, it means that more capabilities means better and better inner alignment by default.
The goal that was best for pretraining from human feedback, conditional training, has a number of outer alignment benefits compared to RLHF and fine-tuning, even without inner alignment being effectively solved and preventing deceptive alignment.
One major benefit is since it's offline training, there is never a way for any model to affect the distribution of data that we use for alignment, so there's never a way or incentive to gradient hack or shift the distribution. In essence, we avoid embedded agency problems by recreating a Cartesian boundary that actually works in an embedded setting. While it will likely fade away in time, we only need to have it work once, and then we can dispense with the Cartesian boundary.
Again, this shows increasing alignment with scale, which is good because we found the holy grail of alignment: A competitive alignment scheme that scales well with model data and allows you to crank capabilities up and get better and better results from alignment.
Here's a link if you're interested:
Finally, I don't think you realize how well we did in getting companies to care about alignment, our how good the fact that LLMs are being pursued first compared to RL first, which means we can have simulators before agentic systems arise.
This is mostly because I think even a best case alignment scenario can't be ever more than "everyone have their own personal superintelligence that they can brainwash to do whatever they want."
This is related to fundamental disagreements I have around morality and values that make me pessimistic around trying to align groups of people, or indeed trying to align with the one true morality/values.
To state the disagreements I have:
Essentially, it's trivialism, applied to morality, with a link below:
The reason reality doesn't face the problem of being trivial is because for our purposes, we don't have the power to warp reality to what you want to (Often talked about by different names, including omnipotentence, administrator access to reality, and more), whereas in morality, we do have the power to change our values to anything else, this generating inconsistent, but complete values, in contrast to the universe we find ourselves in, which is probably consistent and incomplete.
There is no way to coherently talk about something like a society or humanity's values in the general case, and in the case where everyone is aligned, all we can talk about is optimal redistribution of goods.
This makes a lot of attempts to analogize society or humanity's values to say, an individual person rely on two techniques that are subjective:
Carrying out a simplification or homogenization of the multiple preferences of the individuals that make up that society;
Modeling your own personal preferences as if these were the preferences of society as a whole.
That means it is never a nation or humanity that acts on morals or values, but specific people with their own values take those actions.
Here's a link to it.
So my conclusion is, yes I do really bite the bullet here and support "everyone have their own personal superintelligence that they can brainwash to do whatever they want".
This is an uncomfortable conclusion to come to, but I do suspect it will lead to better modeling of people's values.
Final notes: I do want to point out a comment I made that seems relevant to this comment with slight modifications:
One important implication of the post relating to AI Alignment: It is impossible for AI to be aligned with society, conditional on the individuals not being all aligned with each other. Only in the N=1 case can guaranteed alignment be achieved.
In the pointers ontology, you can't point to a real world thing that is a society, culture or group having preferences or values, unless all members have the same preferences.
And thus we need to be more modest in our alignment ambitions. Only AI aligned to individuals is at all feasibly possible. And that makes the technical alignment groups look way better.
It's also the best retort to attempted collectivist cultures and societies.
I admit, my views on this generally favor the first interpretation over the second interpretation in regards to what alignment goals to favor, and I generally don't think that the second goal makes any sense in targeting it.
I'll mention my own issues with IBP, and where the fatal issue lies in my opinion.
The most fatal objection, is as you said the monotonicity principle issue, and I suspect this is an issue because IBP is trying to both unify capabilities and values/morals, when I think they are strictly separate types of things, and in general the unification heuristic is going too far.
To be honest, if Vanessa managed to focus on how capable the IBP agent is, without trying to shoehorn an alignment solution into it, I think the IBP model might actually work.
I disagree on whether maximization of values is advisable, but I agree that the monotonicity principle is pointing to a fatal issue in IBP.
Another issue is that it's trying to solve an impossible problem, that is it's trying to avoid simulation hypotheses forming if the AI already has a well calibrated belief that we are being simulated by a superintelligence. But even under the most optimistic assumptions, if the AI is actually acausally cooperating with the simulator, we are no more equipped to fight against it than we are against alien invasions. Worst case, it would be equivalent to fighting an omnipotent and omniscient god, which pretty obviously is known to be unsolvable.
I think that the best case for full automation is you get the best iteration speeds, and iteration matters more than virtually anything else for making progress.
This is potentially one of the biggest value propositions of AI: The ability to iterate very fast on something is arguably it's main value propositions, and iterability is probably going to be important for AI.
This is also why uploads would be huge for the economy, due to their ability to copy and iterate at a vastly higher level on vastly larger scales.
This also implies existential risk or catastrophic risk , depending on how amenable it is to iterability, would be a significant issue. Indeed, I think a lot of debates on pessimism vs optimism should focus on how much particular risks are iterable.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don't believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there's a 50-70% chance that this plan, or a successor will be adopted for alignment.
Here's the post:
To focus on why I don't think LLMs have an inner life that qualifies as consciousness, I think it has to do with the lack of writeable memory under the LLM's control, and there's no space to store it's subjective experiences.
Gerald Monroe mentioned that current LLMs don't have memories that last beyond the interaction, which is a critical factor for myopia, and in particular prevents deceptive alignment from happening.
If LLMs had memory that could be written into to store their subjective experiences beyond the interaction, this would make it conscious, and also make it way easier for an LLM AI to do deceptive alignment as it's easy to be non-myopic.
But the writable memory under the control of the LLM is critically not in current LLMs (Though GPT-4 and PaLM-E may have writable memories under their hood.)
Writable memory that can store anything is the reason why consciousness can exist at all in humans without appealing to theories that flat out cannot work under the current description of reality.
I do think there's a bit more lurking here, and the basic implication of Dan Luu's tweets is that you can have only priority at all, 2 already is a mess and nothing gets done, and it gets worse with the number of priorities you have.