All of Buck's Comments + Replies

Buck1moΩ380

Thanks for the questions :)

I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,

Probably no. 

but I am broadly a bit confused when this is a commitment for.

Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".

Also, are people going th

... (read more)
5Ben Pace1mo
Thanks for the answers! :)
Buck2moΩ231

Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?

No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.

I’d call it our language model adversarial training project, maybe? Your proposal seems fine too

Buck4moΩ384

Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.

(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)

Buck4moΩ664

The first thing I imagine is that nobody asks those questions. But let's set that aside.

I disagree fwiw

The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.

I agree.

Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a hu

... (read more)
1jacquesthibs2mo
Any additional or new thoughts on this? Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it's way more likely that we'd be unable to prompt things out of the model only if it were deceptive? Could you say more? Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We've fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it's thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we'll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?
Buck4moΩ694

What do you imagine happening if humans ask the AI questions like the following:

  • Are you an unaligned AI?
  • If we let you keep running, are you (or some other AI) going to end up disempowering us?
  • If we take the action you just proposed, will we be happy with the outcomes?

I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And ... (read more)

5johnswentworth4mo
The first thing I imagine is that nobody asks those questions. But let's set that aside. The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever. (It's essentially the same mistake as a GOFAI person looking at a node in some causal graph that says "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.) Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, ther
Buck4moΩ254928

[writing quickly, sorry for probably being unclear]

If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.

The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will... (read more)

At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the

... (read more)

Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.

6jacob_cannell3mo
The human brain internally is performing very similar computations [https://www.nature.com/articles/s42003-022-03036-1] to transformer LLMs [https://www.biorxiv.org/content/10.1101/2022.06.08.495348v1.abstract] - as expected from all the prior research indicating strong similarity between DL vision features and primate vision - but that doesn't mean we can immediately extract those outputs and apply them towards game performance.

That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.

 

I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).

EDIT: These results are now posted here.

I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.

There are some tasks where performance improves with practice but plateaus within one hour.  I'm thinking of relatively easy video games.  Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies.  But most interesting things that humans "can do" take much longer to learn than this.

Here are some things that humans "can do," but require >> 1 hour of practi... (read more)

4Owain_Evans4mo
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.

I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there's definitely no way I was going to get close to GPT-2.

Yes, humans are way worse than even GPT-1 at next-token prediction, even after practicing for an hour.
EDIT: These results are now posted here

3Yitz4mo
Is there some reasonable-ish way to think about loss in the domain(s) that humans are (currently) superior at? (This might be equivalent to asking for a test of general intelligence, if one wants to be fully comprehensive)

(I run the team that created that game. I made the guess-most-likely-next-token game and Fabien Roger made the other one.)

The optimal strategy for picking probabilities in that game is to say what your probability for those two next tokens would have been if you hadn't updated on being asked about them. What's your problem with this?

It's kind of sad that this scoring system is kind of complicated. But I don't know how to construct simpler games such that we can unbiasedly infer human perplexity from what the humans do.

I think that in that first sentence, OP is comparing PaLM to other large LMs rather than to Chinchilla.

Thanks for all these comments. I agree with a bunch of this. I might try later to explain more precisely where I agree and disagree.

(I'm very unconfident about my takes here)

IMO, I see burnout (people working hard at the expense of their long-term resources and capacities) more often than I expect is optimal for helping with AI risk.

How often do you think is optimal, if you have a quick take? I unconfidently think it seems plausible that there should be high levels of burnout. For example, I think there are a reasonable number of people who are above the hiring bar if they can consistently work obsessively for 60 hours a week, but aren't if they only work 35 hours a week. Such a person... (read more)

3Sébastien Larivée4mo
Insight volume/quality doesn't seem meaningfully correlated with hours worked (see June Huh for an extreme example), high-insight people tend to have work schedules optimized for their mental comfort. I don't think encouraging someone who's producing insights at 35 hours per week to work 60 hours per week is positive will result in more alignment progress, and I also doubt that the only people producing insight are those working 60 hours per week. EDIT: this of course relies on the prior belief that more insights are what we need for alignment right now.

I appreciate this comment a lot.  Thank you.  I appreciate that it’s sharing an inside view, and your actual best guess, despite these things being the sort of thing that might get social push-back!

My own take is that people depleting their long-term resources and capacities is rarely optimal in the present context around AI safety.

My attempt to share my reasoning is pretty long, sorry; I tried to use bolding to make it skimmable.

In terms of my inside-view disagreement, if I try to reason about people as mere means to an end (e.g. “labor”):

0. &nb... (read more)

6Nisan4mo
Taking on a 60-hour/week job to see if you burn out seems unwise to me. Some better plans: * Try lots of jobs on lots of teams, to see if there is a job you can work 60 hours/week at. * Pay attention to what features of your job are energizing vs. costly. Notice any bad habits that might cause burnout. * Become more productive per hour.

I agree that it's inconvenient that these two concepts are often both referred to with the same word. My opinion is that we should settle on using "infohazard" to refer to the thing you're proposing calling "sociohazard", because it's a more important concept that comes up more often, and you should make up a new word for "knowledge that is personally damaging". I suggest "cognitohazard".

I think you'll have an easier time disambiguating this way than disambiguating in the way you proposed, among other reasons because you're more influential among the people who primarily think of "cognitohazard" when they hear "infohazard".

3gjm4mo
To me "cognitohazard" seems like a good term for basilisks and their less exotic brethren -- things that can somehow mess up your thinking when you hear them -- but not for things more like spoilers. I'm not sure "infohazard" is great for that either but it seems less weird to me. (I don't think I would ever refer to a spoiler as either an "infohazard" or a "cognitohazard".) Separately: Perhaps "infohazard" is, at present, unfixably ambiguous and we should use (say) "cognitohazard" for things that are individually harmful and "sociohazard" for things that are collectively harmful, and "infohazard" not at all.
7Dweomite5mo
My personal exposure to the term "infohazard" comes primarily from fiction where it referred to knowledge that harms the knower. (To give an example I recently encountered:Worth the Candle.) My model predicts that getting scholars to collectively switch terminology is hard, but still easier than getting fiction authors to collectively switch terminology. I don't think there's any action that could plausibly be taken by the LessWrong community that would break the associations that "infohazard" currently has in fiction. Even if you could magically get all the authors to switch to "cognitohazard", I don't think that would help very much, because "infohazard" is similar enough that someone who isn't previously aware of a formal distinction between them is likely to map them onto the same mental bucket. If I had godlike powers to dictate what terms people use, I wouldn't use any term containing the word "hazard" to refer to information that is harmless to you but that someone else doesn't want you to know. This flies in the face of my intuitive sense of how the term "hazard" is commonly used. That's, like...imagine if some plutocrats were trying to keep most people poor so that they could control them better, and they started referring to money as "finance-hazard" or something; this term would strike me as being obviously an attempt at manipulation. If the person calling something a "hazard" does not themselves want to be protected from it, then I call BS.

Yeah, but you have the same problem if you were using all three of the nodes in the hidden layer.

1Quintin Pope5mo
Actual trained neural networks are known to have redundant parameters, as demonstrated by the fact that we can prune them so much.

Another way of phrasing my core objection: the original question without further assumptions seems equivalent to "given two global minima of a function, is there a path between them entirely comprised of global minima", which is obviously false.

2Stephen Fowler5mo
Hey Buck, love the line of thinking. We definitely aren't trying to say "any two ridges can be connected for any loss landscape" but rather that the ridges for overparameterised networks are connected.
1Thomas Larsen5mo
TL;DR: I agree that the answer to the question above definitely isn't always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways. One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are generally more likely to be found by random selection and search, because the free parameters means that the dimensionality of this set is higher, and hence exponentially more likely. Thus, we can move within these high-dimensional regions of locally optimal loss, which could allow us to find more interpretable (or maybe more desirable along another axis), which is the stated motivation in the article: This seems super relevant to alignment! The default path to AGI right now to me seems like something like a LLM world model hooked up to some RL to make it more agenty, and I expect this kind of theory applied to LLMs, because of the large number of parameters. I'm hoping that this theory gets us better predictions on which Rashomon sets are found (this would look like a selection theorem), and the ability to move within a Rashomon set towards parameters that are better. Such a selection theorem seems likely because of the dimensionality argument above.

Various takes on this research direction, from someone who hasn't thought about it very much:

Given two optimal models in a neural network’s weight space, is it possible to find a path between them comprised entirely of other optimal models?

I think that the answer here is definitely no, without further assumptions. For example, consider the case where you have a model with two ReLU neurons that, for optimal models, are doing different things; the model where you swap the two neurons is equally performant, but there's no continuous path between these models.... (read more)

1Lech Mazur5mo
Have you seen this paper [http://proceedings.mlr.press/v119/shevchenko20a.html]? They find that "SGD solutions are connected via a piecewise linear path, and the increase in loss along this path vanishes as the number of neurons grows large."
2DaemonicSigil5mo
In your example, I think even adding just one more node, h3, to the hidden layer would suffice to connect the two solutions. One node per dimension of input suffices to learn the function, but it's also possible for two nodes to share the task between them, where the share of the task they are picking up can vary continuously from 0 to 1. So just have h3 take over x2 from h2, then h2 takes over x1 from h1, and then h1 takes over x2 from h3.
5Buck5mo
Another way of phrasing my core objection: the original question without further assumptions seems equivalent to "given two global minima of a function, is there a path between them entirely comprised of global minima", which is obviously false.

I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge. 

I feel kind of bad about some actions of mine related to this. (This has been on my list to write... (read more)

1Olivier Faure5mo
Strong upvote for this. Doing things you find fun is extremely efficient. Studying things you don't like is inefficient, no matter how useful these things may turn out to be for alignment or x-risk.

Which might have an advantage if it can care less about paying alignment taxes, in some ways.

I unconfidently suspect that human-level AIs won't have a much easier time with the alignment problem than we expect to have.

2Lanrian6mo
Agree it's not clear. Some reasons why they might: * If training environments' inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values. * Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says [https://ai-alignment.com/sympathizing-with-ai-e11a4bf5ef6e] he'd be pretty happy with intelligent life that came from a similar distribution that our civ came from. * Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it's a nice benefit to have cheap copyable trustworthy human-level overseers. * Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values. * (You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)

I totally agree. I quite like "Mundane solutions to exotic problems", a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.

Ruby isn't saying that computers have faster clock speeds than biological brains (which is definitely true), he's claiming something like "after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware"; the speed increase is relative to some other computers, so the speed difference between brains and computers isn't relevant.

Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.

2Beckeck6mo
I think clever duplication of human intelligence is plenty sufficient for general superhuman capacity in the important sense (wherein I mean something like 'it has capacities such that would be extincion causing if (it believes) minimizing its loss function is achieved by turning off humanity (which could turn it off/ start other (proto-)agis)'). for one, I don't think humanity is that robust in the status quo, and 2, a team of internally aligned (because copies) human level intelligence capable of graduate level biology seems plenty existentially scary.

I agree we can duplicate models once we’ve trained them, this seems like the strongest argument here.

What do you mean by “run on faster hardware”? Faster than what?

2Noosphere896mo
Faster than biological brains, by 6 orders of magnitude.

I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.

I agree that human-level AIs will definitely try the same thing, but it's not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are;  AIs will be easier to optimize for various reasons but I don't think it will be nearly as extreme as this sentence makes it sound.

I naively expect the option of "take whatever model constitutes your mind and run it on faster hardware and/or duplicate it" should be relatively easy and likely to lead to fairly extreme gains.

Buck6moΩ5127

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at... (read more)

Buck6moΩ413

I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.

My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single ac... (read more)

It seems pretty plausible that the AI will trade for compute with some other person around the world.

Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least like

... (read more)
6habryka6mo
Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important. Having settled this, my primary response is: Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the "prototypical catastrophic action" feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either. I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it's "mostly", because we are going to die in a low-dignity way as Eliezer would say, but it's not obviously where most of the difficulty lies).

From Twitter:

Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. 

I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

I suspect that some knowledge transfers. For example, I suspect that increasingly large LMs learn features of language roughly in order of their importance for predicting English, and so I'd expect that LMs that get similar language modeling losses usually know roughly the same features of English. (You could just run two LMs on the same text and see their logprobs on the correct next token for every token, and then make a scatter plot; presumably there will be a bunch of correlation, but you might notice patterns in the things that one LM did much better than the other.)

And the methodology for playing with LMs probably transfers.

But I generally have no idea here, and it seems really useful to know more about this.

Buck7moΩ511

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

2Yitz7mo
Please do!
Buck8moΩ212

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI c... (read more)

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which ... (read more)

I’m not that excited for projects along the lines of “let’s figure out how to make human feedback more sample efficient”, because I expect that non-takeover-risk-motivated people will eventually be motivated to work on that problem, and will probably succeed quickly given motivation. (Also I guess because I expect capabilities work to largely solve this problem on its own, so maybe this isn’t actually a great example?) I’m fairly excited about projects that try to apply human oversight to problems that the humans find harder to oversee, because I think that this is important for avoiding takeover risk but that the ML research community as a whole will procrastinate on it.

Buck8moΩ410

Are any of these ancient discussions available anywhere?

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

Buck8moΩ811

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

3Richard_Ngo8mo
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to. How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients. Two potential alternatives to the thing you said: * maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference). * maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can't let them do it too often or else your model just wireheads.)
Buck8moΩ1118

[epistemic status: speculative]

A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.

A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in ... (read more)

2Buck8mo
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7 [https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7]
Buck8moΩ811

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

Buck8moΩ2340

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate... (read more)

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central).  A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

-1[comment deleted]8mo

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

Answer by BuckApr 05, 202216

FWIW if you look at Rob Bensinger's survey of people who work on long-term AI risk, the average P(AI doom) is closer to Ord than MIRI. So I'd say that Ord isn't that different from most people he talks to.

You might enjoy these posts where people argue for particular values of P(AI doom), all of which are much lower than Eliezer's:

What part of the ELK report are you saying felt unworkable?

9P.8mo
ELK itself seems like a potentially important problem to solve, the part that didn't make much sense to me was what they plan to do with the solution, their idea based on recursive delegation.

By "checkable" do you mean "machine checkable"?

I'm confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.

I don't think the floating point numbers matter very much (especially if you're ok with the bound being computed a bit more loosely).

3Zac Hatfield-Dodds1y
Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it's plausibly our best option in expectation. Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell's proposal is simple enough that I can reason through the whole thing, even over float32 rather than R, but for more complex arguments I'd probably want it machine-checkable for at least the tricky numeric parts.
Buck1yΩ24

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

 

You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

3Zac Hatfield-Dodds1y
If you can produce a checkable proof of this over the actual EfficientNet architecture, I'd pay out the prize. Note that this uses floating-point numbers, not the reals!
Buck1yΩ35

Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

4Buck1y
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
3Zac Hatfield-Dodds1y
I'd award half the prize for a non-trivial bound.
Buck1yΩ36

Suppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.

3adamShimi1y
Hum, I see. And is your point that it should not create a problem because you're only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don't really care about the comparison between X and Z?
Buck1yΩ25

I think we prefer questions on the EA Forum.

Buck1yΩ12

Thanks, glad to hear you appreciate us posting updates as we go.

Load More