It does not require superintelligence to share representations between different neural networks
I don’t think you can train one transformer on a dataset that doesn’t contain any mentions of the fact X but mentions fact Y, then train the second transformer on a dataset that contains Y but not X, and then easily share the knowledge of X and Y between them
If you were more like the person you wish to be, and you were smarter, do you think you’d still want our descendants not to optimise when needed to leave alone beings who’d prefer to be left alone? If you would still think that, why is it not CEV?
A guy from Conjecture told me about this proposal in the lines of “let’s create a human-level AI system that’s built kind of like humans and is safe to run at a human speed”, and it seemed like a surprisingly bad proposal, so I looked up this post and it still looks surprisingly bad:
Even if you succeed at this, how exactly do you plan to use it? Running one single human at a human speed seems like the kind of thing one can get by simply, you know, hiring someone; running a thousand of these things at 1000x normal speed means you’re running some completely ...
The bias I'm talking about isn't in its training data, it's in the model, which doesn't perfectly represent the training data.
If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don't expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a syste...
Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.
A: The model doesn't generate text during training, so feedback loop dynamics are not directly penalized.
Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how hu...
The explain me like I’m 5 would be something more in the direction of:
“You trained LLMs to look at a text and think really hard about what would be the next word. They’re not infinitely smart, so they have some habits of how they pick the next word. When they were trained, these habits were useful.
The computer picks one of the words the LLM predicted as the likely next one, and adds it to the text, and then repeats the process, so the LLM has to look at the text with the previous word just added and use its habits again and again, many many times, and this...
“There can potentially be conditions for a Sharp Left Turn-like dynamic in a distribution of simulacra LLMs think about, because LLMs might naturally select entities better at increasing their influence”.
Or, since people mostly don’t understand what Sharp Left Turn is, maybe “some relevant parts of model’s cognition might be gradually taken over by increasingly agentic simulacra, until some specific coherent goal-oriented entity is in control”
By myopic I mean https://www.lesswrong.com/tag/myopia — that it was trained to predict the next token and doesn’t get much lower loss from having goals about anything longer-term than predicting the next token correctly.
I assume the weights are frozen, I’m surprised to see this as a question.
Some quick replies from the top of my head: If GPT-7 has a much larger context window; or if there are kinds of prompts the dynamic converges to that aren’t too long; and you get an AGI that’s smart and goal-oriented and needs to spend some of the space that it has to ...
Huh? The claim in this post is that this might happen even if you don’t explicitly design an AutoGPT-like prompt.
Also, the current LLMs probably don’t spend much resources on simulating themselves generating the text, this didn’t help at all during training, and LLMs aren’t some magical computable Solomonoff induction that converges to itself. You wouldn’t see LLMs trained just to predict the next token in a pre-2021 dataset outputting some random token with a high certainty because they know if they output it and it’s chosen, it becomes the correct nest token.
Is it fair to say that you’re trying to:
(Hi Johannes!)
I strongly disagree. While it’s ok to be on whatever state of the mind that causes you to suffer less and be more productive, I don’t think it makes sense to literally forget for it is that you’re trying to achieve. If nothing you do influences the outcome at all, if there’s literally no way to influence the p(doom), what you do is not what Yudkowsky uses the word “dignity” for. Dignity isn’t about living for longer, and it isn’t about doing something that has no chance of working. Don’t do things you think have literally no chance of actuall...
Anthropic Immortality Alignment
Bro just initialize an enormous neural network with terabytes of quantum randomness; in some branches, it’s already aligned AGI
How do I test whether I actually understand the sort of thing Nate describes and not just consider it obvious in hindsight? (I feel like I was able to explain it to people before reading this post)
The new page will also resemble a random string composed of "0" and "1", so this process will continue indefinitely.
I don't think this is true: there will be patterns emerging, and anything that appeared randomly and looks slightly less than random will make future continuations even less random until the LLM converges to something full of patterns.
Another way to say it- if it was always outputting 50/50% for 0 and 1, and you were running the process indefinitely, it would necessarily go through a string of all zeros, and I don't think you think it would o...
I don’t expect everyone to disregard the danger; I do expect most people building capable AI systems to continue to hide hard problems. Hiding the hard problems is much easier than solving them, but I guess produces plausible-sounding solutions just as well.
Roughly human level humans don’t contribute significantly to AI alignment research and can’t be pivotally used. So I don’t think you think that a roughly human level AI system can contribute significantly to AI alignment research. Maybe you (as many seem to) think that if someone runs not-that-superhuma...
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens tha...
I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.
RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).
[RETRACTED after Scott Aaronson’s reply by email]
I'm surprised by Scott Aaronson's approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he's focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you're searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something ...
Oh. I was thinking the diamond maximizer problem is about making the AI care about the specific goal of maximizing diamond, not about making the AI have some consistent goal slot instead of lots of cognitive spaghetti code. I think it’s simple to make a highly agentic AI with some specific goals it’s really good at maximizing (if you have infinite compute and don’t care about what these goals actually are; I have no idea how to point that at diamond). Should I write a description somewhere?
Thanks for the comment! (And sorry for the delayed reply- I was at a CFAR workshop when I posted this, and after the workshop took some days off.)
The text's target audience was two people who I'd expect to understand my intuitions, so I did not attempt to justify some claims fully. Added that to the post. Also, I apologize for the post title, it's a claim that I haven't justified by the text, and that represents my views to a lesser extent. I stand by the other three mentioned claims. I'm not sure where exactly the crux lies, though. I'd be interested in h...
There’s a scan of 1 mm^3 of a human brain, 1.4 petabytes with hundred(s?) of millions of synapses
https://ai.googleblog.com/2021/06/a-browsable-petascale-reconstruction-of.html
(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.