Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform
Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.
Some quick info about me:
I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).
You can also find me on the EA Forum.
Feel free to reach out by sending me a PM here or on my website.
Yesterday Paul Christiano announced that he left OpenAI.
I think it's worth checking whether the manufacture's website supports some verification procedure (in which the customer types in some unique code that appears on the respirator). Consider googling the term: [manufacturer name] validation.
(example)
To support/add-to what ErickBall wrote, my own personal experience with respirators is that one with headbands (rather than ear loops) and a nose clip + nose foam is more likely to seal well.
[COVID-19 related]
It was nice to see this headline:
My own personal experience with respirators is that one with headbands (rather than ear loops) and a nose clip + nose foam is more likely to seal well.
The topic of risks related to morally relevant computations seems very important, and I hope a lot more work will be done on it!
My tentative intuition is that learning is not directly involved here. If the weights of a trained RL agent are no longer being updated after some point[1], my intuition is that the model is similarly likely to experience pain before and after that point (assuming the environment stays the same).
Consider the following hypothesis which does not involve a direct relationship between learning and pain: In sufficiently large scale (and complex environments), TD learning tends to create components within the network, call them "evaluators", that evaluate certain metrics that correlate with expected return. In practice the model is trained to optimize directly for the output of the evaluators (and maximizing the output of the evaluators becomes the mesa objective). Suppose we label possible outputs of the evaluators with "pain" and "pleasure". We get something that seems analogous to humans. A human cares directly about pleasure and pain (which are things that correlated with expected evolutionary fitness in the ancestral environment), even when those things don't affect their evolutionary fitness accordingly (e.g. pleasure from eating chocolate, and pain from getting a vaccine shot).
In TD learning, if from some point the model always perfectly predicted the future, the gradient would always be zero and no weights would be updated. Also, if an already-trained RL agent is being deployed, and there's no longer reinforcement learning going on after deployment (which seems like a plausible setup in products/services that companies sell to customers), the weights would obviously not be updated. ↩︎
My understanding is that the 2020 algorithms in Ajeya Cotra's draft report refer to algorithms that train a neural network on a given architecture (rather than algorithms that search for a good neural architecture etc.). So the only "special sauce" that can be found by such algorithms is one that corresponds to special weights of a network (rather than special architectures etc.).
Great post!
we’ll either have to brute-force search for the special sauce like evolution did
I would drop the "brute-force" here (evolution is not a random/naive search).
Re the footnote:
This "How much special sauce is needed?" variable is very similar to Ajeya Cotra's variable "how much compute would lead to TAI given 2020's algorithms."
I don't see how they are similar.
One might argue:
We don't need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).
So I think it's important to also note that, getting a neural network to "perform roughly at human-level in an aligned manner" may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:
Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask.
It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.
Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. [...]
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.
I couldn't find such an endorsement in Scott Alexander's linked post. The closest thing I could find was:
If I'm missing something please let me know. I downvoted the OP and wrote this comment because I think and feel that such inaccuracies are bad (even if not intentional) and I don't want them to occur on LW.