Software Engineer interested in AI and AI safety.
Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
One thing I've noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don't have any syntax or logical errors. I don't think that was possible with earlier models like GPT-3.5.
I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.
Update: I've now donated $500 in total for philanthropic reasons.
If you're interested in doing a PhD in AI in the UK, I recommend applying for the Centres for Doctoral Training (CDTs) in AI such as:
Note that these programs are competitive so the acceptance rate is ~10%.
I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":
"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.
I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.
I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".
Best-of-n or rejection sampling is an alternative to RLHF which involves generating responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I'd also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, "At each time step of each simulation, an action is selected from state so as to maximize action value plus a bonus" and the formula is: where is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or -1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or -1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.
SummaryBot summary from the EA Forum:
Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.
Key points:
Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.
The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.
Why? O1 is much more capable than GPT-4o at math, programming, and science.