In NZ we have biting bugs called sandflies which don't do this - you can often tell the moment they get you.
Yes, that's fair. I was ignoring scale but you're right that it's a better comparison if it is between a marginal new human and a marginal new AI.
Well, yes, the point of my post is just to point out that the number that actually matters is the end-to-end energy efficiency — and it is completely comparable to humans.
The per-flop efficiency is obviously worse. But, that's irrelevant if AI is already cheaper for a given task in real terms.
I admit the title is a little clickbaity but i am responding to a real argument (that humans are still "superior" to AI because the brain is more thermodynamically efficient per-flop)
I saw some numbers for algae being 1-2% efficient but it was for biomass rather than dietary energy. Even if you put the brain in the same organism, you wouldn't expect as good efficiency as that. The difference is that creating biomass (which is mostly long chains of glucose) is the first step, and then the brain must use the glucose, which is a second lossy step.
But I mean there is definitely far-future biopunk options eg. I'd guess it's easy to create some kind of solar panel organism which grows silicon crystals instead of using chlorophyll.
Fully agree - if the dog were only trying to get biscuits, it wouldn't continue to sit later on in it's life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.
Hear sit -> Get biscuit -> feel good
becomes
Hear sit -> Feel good -> get biscuit -> feel good
becomes
Hear sit -> feel good
At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else
This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.
If we could use negentropy as a cost, rather than computation time or energy use, then the system would be genuinely bounded.
Gender seems unusually likely to have many connotations & thus redundant representations in the model. What if you try testing some information the model has inferred, but which is only ever used for one binary query? Something where the model starts off not representing that thing, then if it represents it perfectly it will only ever change one type of thing. Like idk, whether or not the text is British or American English? Although that probably has some other connotations. Or whether or not the form of some word (lead or lead) is a verb or a noun.
Agree that gender is a more useful example, just not one tha necessarily provides clarity.
Yeah I think this is the fundamental problem. But it's a very simple way to state it. Perhaps useful for someone who doesn't believe ai alignment is a problem?
Here's my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn't manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.
But more importantly, behavior upon self improvement and reflection is likely something we didn't test. Because we can't. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.
Pre-training vs Reinforcement Learning
Pre-training
In pre-training, the model learns to mirror a dataset.
Pre-training trains a model on a dataset - the model is trained to align with a set of tokens/actions, but the model never actually outputs any itself. Any biases (vs. the dataset) in the model towards one action/output sequence or the other cause poorer performance, and the model comes out relatively unbiased (vs. the dataset) as a result. In other words, the model comes out as a close representation of the dataset.
At inference time, the model can be used to produce actions/output sequences - it will then receive its own outputs as input - and any bias in the model can lead to a compounding bias across the sequence (in practice, this is towards repetitiveness in base models).
The model is not robust in inference - sampling injects noise and it quickly goes off distribution.
Reinforcement Learning
In reinforcement learning, the model learns to succeed at a goal.
Reinforcement learning trains a model "recursively" - on its own outputs/decisions. The model produces a sequence, and repeatedly receives its own outputs as input during training. If the AI has a "bias" towards one kind of action/output or another, it can compound, and if it succeeds on its task, then that compounding bias gets reinforced. This leads immediately to models that have much stronger compounding biases.
The model is however robust in inference - it learns to reduce noise and keep the context within distribution so that it does not go "off the rails" and it can then complete tasks.
Good Compounding Biases
Some of those biases are important, such as diversity (defeating base model tendency to repeat), error correction (defeating base model tendency to "go along with" accidental errors).
Interesting Compounding Biases
However, some are much more interesting - for example "adding positive spin." In the pre-training dataset, probably both people chatting were depressed statistically speaking, but in the resulting model after reinforcement learning the second participant is relentlessly positive and helpful. If you make it talk to itself, it goes off the rails into increasingly positive slop.
Gaslighting AI Overseers
Another example: It might be the case that during training, the chance that it gets the maths question right is much lower than the chance that it gaslights its AI "marker" into believing it got the answer right. So, this trains a model to spend a portion of its token budget on producing some working to a maths problem, and the rest of the budget telling the AI marker model all the myriad reasons why the answer is correct.
This doesn't have to work every time, just has to maximize the chance of getting marked correct compared to actually focusing on the maths. So the model learns to lie more often than would be seen in the pre-training dataset.
"Contingent" Training
It's also possibly the case that reinforcement learning is "contingent" on random factors. Let's imagine two models:
Model A that by chance focused on the maths early in training might continue to focus on the maths and find any tokens spent on gaslighting would hurt its performance at getting the correct answer - so it learns to get the correct answer.
Model B, due to random factors, tends to hype itself up early in training. Any tokens spent on trying to do the actual maths tend to hurt its chances at getting marked correct overall due to tricking the marker AI. So it learns to lie profusely in an excessively positive and bizarre fashion.
Luckily in practice it doesn't seem like this scenario is actually contingent - the best solution (as evidenced by current SOTA models) is to do a mix of both.
Compounding biases are values
If (as I stated) a reinforcement learning model reduces noise in it's "environment" so that it can succeed at a task - this is essentially what values are.
This is (one example of) what it means for an AI model to have emergent values. If trained in a sophisticated enough environment (such as with access to a virtual machine and the internet) then the model can absolutely be learning to bring "the real world" in line with its expectations in order to achieve a goal.
Pre-training on reinforcement learning rollouts
The original paper I remember seeing this in was Google Deepmind's Gato 2 - which pretrained a transformer on something like ~600 tasks, including rollouts of video games from separate reinforcement learning systems.
It seems likely to me that if we want to avoid AI systems having values, we can do the following:
The rollouts being processed is important to actually improve the resulting model.
The key alignment-relevant piece is that, hopefully, we can iterate and get better and better models that have only gone through pre-training. It may be the case that pre-training on reinforcement-learned rollouts gives rise to deception, but hopefully rollout processing prevents this.