Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
but I am broadly a bit confused when this is a commitment for.
Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".
Also, are people going th
Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.
I’d call it our language model adversarial training project, maybe? Your proposal seems fine too
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.
(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)
The first thing I imagine is that nobody asks those questions. But let's set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.
Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a hu
What do you imagine happening if humans ask the AI questions like the following:
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And ... (read more)
[writing quickly, sorry for probably being unclear]
If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.
The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will... (read more)
At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the
My guess is that the Long-Term Future Fund is the best you can do. (I'm a fund manager on a different EA fund.)
Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).EDIT: These results are now posted here.
I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I'm thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans "can do" take much longer to learn than this.
Here are some things that humans "can do," but require >> 1 hour of practi... (read more)
I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there's definitely no way I was going to get close to GPT-2.
Yes, humans are way worse than even GPT-1 at next-token prediction, even after practicing for an hour.EDIT: These results are now posted here
(I run the team that created that game. I made the guess-most-likely-next-token game and Fabien Roger made the other one.)
The optimal strategy for picking probabilities in that game is to say what your probability for those two next tokens would have been if you hadn't updated on being asked about them. What's your problem with this?
It's kind of sad that this scoring system is kind of complicated. But I don't know how to construct simpler games such that we can unbiasedly infer human perplexity from what the humans do.
I think that in that first sentence, OP is comparing PaLM to other large LMs rather than to Chinchilla.
Thanks for all these comments. I agree with a bunch of this. I might try later to explain more precisely where I agree and disagree.
(I'm very unconfident about my takes here)
IMO, I see burnout (people working hard at the expense of their long-term resources and capacities) more often than I expect is optimal for helping with AI risk.
How often do you think is optimal, if you have a quick take? I unconfidently think it seems plausible that there should be high levels of burnout. For example, I think there are a reasonable number of people who are above the hiring bar if they can consistently work obsessively for 60 hours a week, but aren't if they only work 35 hours a week. Such a person... (read more)
I appreciate this comment a lot. Thank you. I appreciate that it’s sharing an inside view, and your actual best guess, despite these things being the sort of thing that might get social push-back!
My own take is that people depleting their long-term resources and capacities is rarely optimal in the present context around AI safety.My attempt to share my reasoning is pretty long, sorry; I tried to use bolding to make it skimmable.
0. &nb... (read more)
I agree that it's inconvenient that these two concepts are often both referred to with the same word. My opinion is that we should settle on using "infohazard" to refer to the thing you're proposing calling "sociohazard", because it's a more important concept that comes up more often, and you should make up a new word for "knowledge that is personally damaging". I suggest "cognitohazard".
I think you'll have an easier time disambiguating this way than disambiguating in the way you proposed, among other reasons because you're more influential among the people who primarily think of "cognitohazard" when they hear "infohazard".
Yeah, but you have the same problem if you were using all three of the nodes in the hidden layer.
Another way of phrasing my core objection: the original question without further assumptions seems equivalent to "given two global minima of a function, is there a path between them entirely comprised of global minima", which is obviously false.
Various takes on this research direction, from someone who hasn't thought about it very much:
Given two optimal models in a neural network’s weight space, is it possible to find a path between them comprised entirely of other optimal models?
I think that the answer here is definitely no, without further assumptions. For example, consider the case where you have a model with two ReLU neurons that, for optimal models, are doing different things; the model where you swap the two neurons is equally performant, but there's no continuous path between these models.... (read more)
I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.
I feel kind of bad about some actions of mine related to this. (This has been on my list to write... (read more)
Which might have an advantage if it can care less about paying alignment taxes, in some ways.
I unconfidently suspect that human-level AIs won't have a much easier time with the alignment problem than we expect to have.
I totally agree. I quite like "Mundane solutions to exotic problems", a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.
Ruby isn't saying that computers have faster clock speeds than biological brains (which is definitely true), he's claiming something like "after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware"; the speed increase is relative to some other computers, so the speed difference between brains and computers isn't relevant.
Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.
I agree we can duplicate models once we’ve trained them, this seems like the strongest argument here.
What do you mean by “run on faster hardware”? Faster than what?
I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
I agree that human-level AIs will definitely try the same thing, but it's not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are; AIs will be easier to optimize for various reasons but I don't think it will be nearly as extreme as this sentence makes it sound.
I naively expect the option of "take whatever model constitutes your mind and run it on faster hardware and/or duplicate it" should be relatively easy and likely to lead to fairly extreme gains.
But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!
I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at... (read more)
I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.
My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac... (read more)
It seems pretty plausible that the AI will trade for compute with some other person around the world.
Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least like
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.
I suspect that some knowledge transfers. For example, I suspect that increasingly large LMs learn features of language roughly in order of their importance for predicting English, and so I'd expect that LMs that get similar language modeling losses usually know roughly the same features of English. (You could just run two LMs on the same text and see their logprobs on the correct next token for every token, and then make a scatter plot; presumably there will be a bunch of correlation, but you might notice patterns in the things that one LM did much better than the other.)
And the methodology for playing with LMs probably transfers.
But I generally have no idea here, and it seems really useful to know more about this.
Yeah I wrote an interface like this for personal use, maybe I should release it publicly.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI c... (read more)
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)
I’m not that excited for projects along the lines of “let’s figure out how to make human feedback more sample efficient”, because I expect that non-takeover-risk-motivated people will eventually be motivated to work on that problem, and will probably succeed quickly given motivation. (Also I guess because I expect capabilities work to largely solve this problem on its own, so maybe this isn’t actually a great example?) I’m fairly excited about projects that try to apply human oversight to problems that the humans find harder to oversee, because I think that this is important for avoiding takeover risk but that the ML research community as a whole will procrastinate on it.
Are any of these ancient discussions available anywhere?
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
[epistemic status: speculative]
A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in ... (read more)
Something I think I’ve been historically wrong about:A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.Similarly with debate... (read more)
Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central). A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).
Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).
It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.
FWIW if you look at Rob Bensinger's survey of people who work on long-term AI risk, the average P(AI doom) is closer to Ord than MIRI. So I'd say that Ord isn't that different from most people he talks to.
You might enjoy these posts where people argue for particular values of P(AI doom), all of which are much lower than Eliezer's:
Ok, that's a very reasonable answer.
What part of the ELK report are you saying felt unworkable?
By "checkable" do you mean "machine checkable"?
I'm confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.
I don't think the floating point numbers matter very much (especially if you're ok with the bound being computed a bit more loosely).
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?
Suppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.
I think we prefer questions on the EA Forum.
Thanks, glad to hear you appreciate us posting updates as we go.