All of capybaralet's Comments + Replies

[AN #156]: The scaling hypothesis: a plan for building AGI

I think the contradiction may only be apparent, but I thought it was worth mentioning anyways.  
My point was just that we might actually want certifications to say things about specific algorithms.

[AN #156]: The scaling hypothesis: a plan for building AGI

Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

* I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point ab... (read more)

2rohinmshah10dThe idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.
Gwern's "Why Tool AIs Want to Be Agent AIs: The Power of Agency"

I see that is has references to papers from this year, so presumably has been updated to reflect any changes in view.

Gwern's "Why Tool AIs Want to Be Agent AIs: The Power of Agency"

I wonder if gwern has changed their view on RL/meta-learning at all given GPT, scaling laws, and current dominance of training on big offline datasets.  This would be somewhat in line with skybrian's comment on Hacker News: 

1capybaralet25dI see that is has references to papers from this year, so presumably has been updated to reflect any changes in view.
How will OpenAI + GitHub's Copilot affect programming?

One of the major, surprising consequences of this is that it's likely to become infeasible to develop software in secret. 

Or maybe developers will find good ways of hacking value out of cloud hosted AI systems while keeping their actual code base secret.  e.g. maybe you have a parallel code base that is developed in public, and a way of translating code-gen outputs there into something useful for your actual secret code base.


Covid 5/13: Moving On

Honest question: Why are people not concerned about 1) long COVID and 2) variants?


Is there something(s) that I haven't read that other people have? I haven't been following closely...


My best guess is:

1) There's good reason to believe vaccines protect you from it (but I haven't personally seen that)

2) We'll hear about them if they start to be a problem


1/2) Enough people are getting vaccinated that rates of COVID and infectiousness are low, so it's becoming unlikely to be exposed to a significant amount of it in the first place.



What does vaccine effectiveness as a function of time look like?

From that figure, it looks to me like roughly 0 protection until day 10 or 11, and then near perfect protection after that.  Surprisingly non-smooth!

2Phil4moWith a seven-day incubation period, does that mean it's 0 protection until about day 4, then near-perfect protection after that? (As per jimrandomh's comment of 4/17.)
How many micromorts do you get per UV-index-hour?

Oh yeah, sorry I was not clear about this...

I am actually trying to just consider the effects via cancer risk in isolation, and ignoring the potential benefits (which I think do go beyond just Vitamin D... probably a lot of stuff happening that we don't understand... certainly seems to have effect on mood, e.g.)

How many micromorts do you get per UV-index-hour?

Looks like just correlations, tho(?)
I basically wouldn't update on a single study that only looks at correlation.

AI x-risk reduction: why I chose academia over industry

You can try to partner with industry, and/or advocate for big government $$$.
I am generally more optimistic about toy problems than most people, I think, even for things like Debate.
Also, scaling laws can probably help here.

AI x-risk reduction: why I chose academia over industry

um sorta modulo a type error... risk is risk.  It doesn't mean the thing has happened (we need to start using some sort of phrase like "x-event" or something for that, I think).

1Ikaxas5moI've started using the phrase "existential catastrophe" in my thinking about this; "x-catastrophe" doesn't really have much of a ring to it though, so maybe we need something else that abbreviates better?
AI x-risk reduction: why I chose academia over industry

Yeah we've definitely discussed it!  Rereading what I wrote, I did not clearly communicate what I intended to...I wanted to say that "I think the average trend was for people to update in my direction".  I will edit it accordingly.

I think the strength of the "usual reasons" has a lot to do with personal fit and what kind of research one wants to do.  Personally, I basically didn't consider salary as a factor.

AI x-risk reduction: why I chose academia over industry

When you say academia looks like a clear win within 5-10 years, is that assuming "academia" means "starting a tenure-track job now?" If instead one is considering whether to begin a PhD program, for example, would you say that the clear win range is more like 10-15 years?


Also, how important is being at a top-20 institution? If the tenure track offer was instead from University of Nowhere, would you change your recommendation and say go to industry?

My cut-off was probably somewhere between top-50 and top-100, and I was prepared to go anywhere in ... (read more)

2Daniel Kokotajlo5moMakes sense. I think we don't disagree dramatically then. Also makes sense -- just checking, does x-risk-inducing AI roughly match the concept of "AI-induced potential point of no return" [] or is it importantly different? It's certainly less of a mouthful so if it means roughly the same thing maybe I'll switch terms. :)
"Beliefs" vs. "Notions"

Thanks!  Quick question: how do you think these notions compare to factors in an undirected graphical model?  (This is the closest thing I know of to how I imagine "notions" being formalized).

2Vanessa Kosoy5moHmm. I didn't encounter this terminology before, but, given a graph and a factor you can consider the convex hull of all probability distributions compatible with this graph and factor (i.e. all probability distributions obtained by assigning other factors to the other cliques in the graph). This is a crisp infradistribution. So, in this sense you can say factors are a special case of infradistributions (although I don't know how much information this transformation loses). It's more natural to consider, instead of a factor, either the marginal probability distribution of a set of variables or the conditional probability distribution of a set of variables on a different set of variables. Specifying one of those is a linear condition on the full distribution so it gives you a crisp infradistribution without having to take convex hull, and no information is lost.
"Beliefs" vs. "Notions"

Cool!  Can you give a more specific link please?

2Vanessa Kosoy5moThe concept of infradistribution was defined here [] (Definition 7) although for the current purpose it's sufficient to use crisp infradistributions (Definition 9 here [], it's just a compact convex set of probability distributions). Sharp infradistributions (Definition 10 here []) are the special case of "pure (2)". I also talked about the connection to formal logic here [] .
"Beliefs" vs. "Notions"

True, but it seems the meaning I'm using it for is primary:

Imitative Generalisation (AKA 'Learning the Prior')

It seems like z* is meant to represent "what the human thinks the task is, based on looking at D".
So why not just try to extract the posterior directly, instead of the prior an the likelihood separately?
(And then it seems like this whole thing reduces to "ask a human to specify the task".)

1Beth Barnes4moWe're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there. Does that make sense?
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

Intersting... Maybe this comes down to different taste or something.  I understand, but don't agree with, the cow analogy... I'm not sure why, but one difference is that I think we know more about cows than DNNs or something.

I haven't thought about the Zipf-distributed thing.

> Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get a... (read more)

2rohinmshah5moWith this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For thed-dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale asD−c/dfor some constantc(probably c = 1 or 2, depending on what exactly we're quantifying the scaling of). I'm not entirely sure how you'd generalize the Zipf assumption to the "within epsilon" case, since in the original model there was no assumption on the smoothness of the function being predicted (i.e. [0, 0, 0] and [0, 0, 0.000001] could have completely different values.)
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I have a hard time saying which of the scaling laws explanations I like better (I haven't read either paper in detail, but I think I got the gist of both).
What's interesting about Hutter's is that the model is so simple, and doesn't require generalization at all. 
I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.
Or maybe what I want to say is more like "Hutter's model DEMANDS refutation/falsification".

I think both models also are very interesting for underst... (read more)

4rohinmshah5mo?? Overall this claim feels to me like: * Observing that cows don't float into space * Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won't float * Concluding that since the model is so simple, Occam's Razor says that cows must be spherical with constant density. Some ways that you could refute it: * It requires your data to be Zipf-distributed -- why expect that to be true? * The simplicity comes from being further away from normal neural nets -- surely the one that's closer to neural nets is more likely to be true? Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.
The case for aligning narrowly superhuman models

Thanks for the response!
I see the approaches as more complimentary.  
Again, I think this is in keeping with standard/good ML practice.

A prototypical ML paper might first describe a motivating intuition, then formalize it via a formal model and demonstrate the intuition in that model (empirically or theoretically), then finally show the effect on real data.

The problem with only doing the real data (i.e. at scale) experiments is that it can be hard to isolate the phenomena you wish to study.  And so a positive result does less to confirm the motiva... (read more)

The feeling of breaking an Overton window

I think I know the feeling quite well.  I think for me anyways, it's basically "fear of being made fun of", stemming back to childhood.  I got made fun of a lot, and physically bullied as well (a few examples that jump to mind are: having my face shoved into the snow until I was scared of suffocating, being body slammed and squished the whole 45-minute bus ride home because I sat in the back seat (which the "big kids" claimed as their right), being shoulder-checked in the hall).

At some point I developed an attitude of "fuck those people", and dec... (read more)

The case for aligning narrowly superhuman models

I haven't read this in detail (hope to in the future); I only skimmed based on section headers.
I think the stuff about "what kinds of projects count" and "advantages over other genres" seem to miss an important alternative, which is to build and study toy models of the phenomena we care about.  This is a bit like the gridworlds stuff, but I thought the description of that work missed its potential, and didn't provide much of an argument for why working at scale would be more valuable.

This approach (building and studying toy models) is popular in ML re... (read more)

4Ajeya Cotra5moThe case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section [] , I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave. I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.
Fun with +12 OOMs of Compute

There's a ton of work in meta-learning, including Neural Architecture Search (NAS).  AIGA's (Clune) is a paper that argues a similar POV to what I would describe here, so I'd check that out.  

I'll just say "why it would be powerful": the promise of meta-learning is that -- just like learned features outperform engineered features -- learned learning algorithms will eventually outperform engineered learning algorithms.  Taking the analogy seriously would suggest that the performance gap will be large -- a quantitative step-change.  

The u... (read more)

Fun with +12 OOMs of Compute

Sure, but in what way?
Also I'd be happy to do a quick video chat if that would help (PM me).

2Daniel Kokotajlo5moWell, I've got five tentative answers to Question One in this post. Roughly, they are: Souped-up AlphaStar, Souped-up GPT, Evolution Lite, Engineering Simulation, and Emulation Lite. Five different research programs basically. It sounds like what you are talking about is sufficiently different from these five, and also sufficiently promising/powerful/'fun', that it would be a worthy addition to the list basically. So, to flesh it out, maybe you could say something like "Here are some examples of meta-learning/NAS/AIGA in practice today. Here's a sketch of what you could do if you scaled all this up +12 OOMs. Here's some argument for why this would be really powerful."
Fun with +12 OOMs of Compute

I only read the prompt.  
But I want to say: that much compute would be useful for meta-learning/NAS/AIGAs, not just scaling up DNNs.  I think that would likely be a more productive research direction.  And I want to make sure that people are not ONLY imagining bigger DNNs when they imagine having a bunch more compute, but also imagining how it could be used to drive fundamental advances in ML algos, which could plausibly kick of something like recursive self-improvement (even in DNNs are in some sense a dead end).

9antimonyanthony1moSomething I'm wondering, but don't have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I'm curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this [], because they're so expensive. But at least this paper [] found some pretty impressive performance gains from using the second-order terms. Maybe throwing lots of compute at this aspect of meta-learning would help it cross a threshold of viability, like what happened for deep learning in general around 2012. I think meta-learning is a case where we should expect second-order info to be very relevant to optimizing the loss function in question, not just a way of incorporating the loss function's curvature. In the first paper I linked, the second-order term accounts for how the base learner's gradients depend on the meta-learner's parameters. This seems like an important feature of what their meta-learner is trying/supposed to do, i.e., use the meta-learned update rule to guide the base learner - and the performance gains in the second paper are evidence of this. (Not all meta-learners have this structure, though, and MAML [] apparently doesn't get much better when you use Hessians. Hence my lack of confidence in this story.)
2Daniel Kokotajlo5moInteresting, could you elaborate? I'd love to have a nice, fleshed-out answer along those lines to add to the five I came up with. :)
5lsusr5moIt weirds me out how little NAS (Neural Architecture Search) in particular (and throwing compute at architecture search in general) is used in industry.
the scaling “inconsistency”: openAI’s new insight

if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn.

I found this confusing.  It sort of seems like you're assuming that a Bayes-optimal learner achieves the B... (read more)

capybaralet's Shortform

I basically agree, but I do assign it to Moloch. *shrug

Any work on honeypots (to detect treacherous turn attempts)?

I strongly disagree.  
I think this is emblematic of the classic AI safety perspective/attitude, which has impeded and discouraged practical progress towards reducing AI x-risk by supporting an unnecessary and misleading emphasis on "ultimate solutions" that address the "arbitrarily intelligent agent trapped in a computer" threat model.
This is an important threat model, but it is just one of many.

My question is inspired by the situation where a scaled up GPT-3-like model is fine-tuned using RL and/or reward modelling.  In this case, it seems like ... (read more)

Tips for the most immersive video calls

Any tipe for someone who's already bought the C920 and isn't happy with the webcam on their computer?  (e.g. details on the 2 hour process :P)

Has anyone researched specification gaming with biological animals?

There are probably a lot of things that people do with animals that can be viewed as "automatic training", but I don't think people are viewing them this way, or trying to create richer reward signals that would encourage the animals to demonstrate increasingly impressive feats of intelligence.

Industrial literacy

The claim I'm objecting to is:

all soil loses its fertility naturally over time

I guess your interpretation of "naturally" is "when non-sustainably farmed"? ;) 

My impression is that we know how to keep farmland productive without using fertilizers by rotating crops, letting fields lie fallow sometimes, and involving fauna.  Of course, this might be much less efficient than using synthetic fertilizers, so I'm not saying that's what we should be doing. 

1swarriner10moSee my comments above for some discussion of this topic. Broadly speaking we do know how to keep farmland productive but there are uncaptured externalities and other inadequacies to be accounted for.
Is there any work on incorporating aleatoric uncertainty and/or inherent randomness into AIXI?

Is there a reference for this?

I was inspired to think of this by this puzzle (which I interpret as being about the distinction between epistemic and aleatoric uncertainty):

"To present another example, suppose that five tosses of a given coin are planned and that the agent has equal strength of belief for two outcomes, both beginning with H, say the outcomes HTTHT and HHTTH. Suppose the first toss is made, and results in a head. If all that the agent learns is that a head occurred on the first toss it seems unreasonable for him to move to a greater confi... (read more)

3Charlie Steiner10moI don't have a copy of Li and Vitanyi on hand, so I can't give you a specific section, but it's in there somewhere (probably Ch. 3). By "it" here I mean discussion of what happens to Solomonoff induction if we treat the environment as being drawn from a distribution (i.e. having "inherent" randomness). Neat puzzle! Let's do the math real quick: Suppose you have one coin with bias 0.1, and another with bias 0.9. You choose one coin at random and flip it a few times. Before flipping, flipping 3 H and 2 T seems just as likely as flipping 2 H and 3 T, no matter the order. P(HHHTT)= P(HHTTT) =(0.5×0.93×0.12)+(0.5×0.92×0.13)= 0.00405 After your first flip, you notice that it's a H. You now update your probability that you grabbed the heads-biased coin: P(heads bias|H) =0.5×0.90.5= 0.9. Now P(HHTT|H) =(0.9×0.92×0.12)+(0.1×0.92×0.12)= 0.0081 And P(HTTT|H) =(0.1×0.93×0.1)+(0.9×0.9×0.13)= 0.0081. Huh, that's weird. That's, like, super unintuitive. But if you look at the terms for P(HHTT|H) and P(HTTT|H), notice that they both simplify to(0.93×0.12)+(0.92×0.13). You think it's more likely that you have the heads-biased coin, but because you know the coin must be biased, the further sequence "HHTT" isn't as likely as the sequence "HTTT", and both this difference in likelihood and your probability of what coin you have are the same number, the bias of the coin!
Weird Things About Money

I really like this.  I read part 1 as being about the way the economy or society implicitly imposes additional pressures on individuals' utility functions.  Can you provide a reference for the theorem that Kelly betters predominate?

EtA: an observation: the arguments for expected value also assume infinite value is possible, which (module infinite ethics style concerns, a significant caveat...) also isn't realistic. 


AGI safety from first principles: Control

Which previous arguments are you referring to?

3Richard_Ngo10moThe rest of the AGI safety from first principles sequence. This is the penultimate section; sorry if that wasn't apparent. For the rest of it, start here [].
Industrial literacy

That the food you eat is grown using synthetic fertilizers, and that this is needed for agricultural productivity, because all soil loses its fertility naturally over time if it is not deliberately replenished.

This claim doesn't make sense.  If it were true, plants would not have survived to the present day.

Steelmanning (which I would say OP doesn't do a good job of...), I'll interpret this as: "we are technologically reliant on synthetic fertilizers to grow enough food to feed the current population".  But in any case, there are harmful environm... (read more)

9Davidmanheim10moNo, the claim as written is true - agriculture will ruin soil over time, which has happened in recent scientific memory in certain places in Africa. And if you look at the biblical description of parts of the middle east, it's clear that desertification had taken a tremendous toll over the past couple thousand years. That's not because of fertilizer usage, it's because agriculture is about extracting food and moving it elsewhere, usually interrupting the cycle of nutrients, which happens organically otherwise. Obviously, natural habitats don't do this in the same way, because the varieties of plants shift over time, fauna is involved, etc.
capybaralet's Shortform

Some possible implications of more powerful AI/technology for privacy:

1) It's as if all of your logged data gets poured over by a team of super-detectives to make informed guesses about every aspect of your life, even those that seem completely unrelated to those kinds of data.

2) Even data that you try to hide can be read from things like reverse engineering what you type based on the sounds of you typing, etc.

3) Powerful actors will deploy advanced systems to model, predict, and influence your behavior, and extreme privacy precautions starting now may be ... (read more)

capybaralet's Shortform

capybaralet's Shortform

We learned about RICE as a treatment for injuries (e.g. sprains) in middle school, and it's since stuck me as odd that you would want to inhibit the body's natural healing response.

It seems like RICE is being questioned by medical professionals, as well, but consensus is far off.

Anyone have thoughts/knowledge about this?

capybaralet's Shortform

Whelp... that's scary: 
Chip Huyen



Replying to


4. You won’t need to update your models as much One mindboggling fact about DevOps: Etsy deploys 50 times/day. Netflix 1000s times/day. AWS every 11.7 seconds. MLOps isn’t an exemption. For online ML systems, you want to update them as fast as humanly possible. (5/6)

4Dagon10moWhat part is scary? I think they're missing out on the sheer variety of model usage - probably as variable as software deployments. But I don't think there's anything particularly scary about any given point on the curve. Some really do get built, validated, and deployed twice a year. Some have CI pipelines that re-train with new data and re-validate every few minutes. Some are self-updating, and re-sync to a clean state periodically. Some are running continuous a/b tests of many candidate models, picking the best-performer for a customer segment every few minutes, and adding/removing models from the pool many times per day.
Inviting Curated Authors to Give 5-Min Online Talks

I think I've had a few curated posts.  How could I find them?

2Raemon10moGo to your user profile (clicking on your username in the top-right), then, in the posts section, click the gear icon, and change "All Posts" to "Curated"
Radical Probabilism [Transcript]

Abram Demski: But it's like, how do you do that if “I don't have a good hypothesis” doesn't make any predictions?

One way you can imagine this working is that you treat “I don't have a good hypothesis” as a special hypothesis that is not required to normalize to 1.  
For instance, it could say that observing any particular real number, r, has probability epsilon > 0.
So now it "makes predictions", but this doesn't just collapse to including another hypothesis and using Bayes rule.

You can also imagine updating this special hypothesis (which I called a "Socratic hypothesis" in comments on the original blog post on Radical Probabilism) in various ways. 

[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

Regarding ARCHES, as an author:

  • I disagree with Critch that we should expect single/single delegation(/alignment) to be solved "by default" because of economic incentives.  I think economic incentives will not lead to it being solved well-enough, soon enough (e.g. see:  I guess Critch might put this in the "multi/multi" camp, but I think it's more general (e.g. I attribute a lot of the risk here to human irrationality/carelessness)
  • RE: "I fin
... (read more)
2rohinmshah10moIndeed, this is where my 10% comes from, and may be a significant part of the reason I focus on intent alignment whereas Critch would focus on multi/multi stuff. Basically all of my arguments for "we'll be fine" rely on not having a huge discontinuity like that, so while I roughly agree with your prediction in that thought experiment, it's not very persuasive. (The arguments do not rely on technological progress remaining at its current pace.) At least in the US, our institutions are succeeding at providing public infrastructure (roads, water, electricity...), not having nuclear war, ensuring children can read, and allowing me to generally trust the people around me despite not knowing them. Deepfakes and facial recognition are small potatoes compared to that. I agree this is overall a point against my position (though I probably don't think it is as strong as you think it is).
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

these usually don’t assume “no intervention from longtermists”

I think the "don't" is a typo?

3rohinmshah10moNo, I meant it as written. People usually give numbers without any assumptions attached, which I would assume means "I predict that in our actual world there is an X% chance of an existential catastrophe due to AI".
Why GPT wants to mesa-optimize & how we might change this

By managing incentives I expect we can, in practice, do things like: "[telling it to] restrict its lookahead to particular domains"... or remove any incentive for control of the environment.

I think we're talking past each other a bit here.

capybaralet's Shortform

For all of the hubbub about trying to elaborate better arguments for AI x-risk, it seems like a lot of people are describing the arguments in Superintelligence as relying on FOOM, agenty AI systems, etc. without actually justifying that description via references to the text.

It's been a while since I read Superintelligence, but my memory was that it anticipated a lot of counter-arguments quite well.  I'm not convinced that it requires such strong premises to make a compelling case.  So maybe someone interested in this project of clarifying the arguments should start with establishing that the arguments in superintelligence really have the weaknesses they are claimed to?

Why GPT wants to mesa-optimize & how we might change this

My intuitions on this matter are:
1) Stopping mesa-optimizing completely seems mad hard.
2) Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence. 
3) On the other hand, it's probably won't scale forever.

To elaborate on the incentive management thing... if we figure that stuff out and do it right and it has the promise that I think it does... then it won't restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.  

If we're st... (read more)

4John_Maxwell10moAs I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me. I'm less optimistic about this approach. 1. There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y". If Mesa-Optimizing for Y is nearby in model-space, we're liable to stumble across it. 2. Even if your mesa-optimizer is aligned, if it doesn't have a way to stop mesa-optimization, there's the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn't necessarily aligned. 3. I'm picturing [] value learning via (un)supervised learning, and I don't see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.) My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you're shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the "Competent" part in place before the "Human Values" part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous. On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we're liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it's less likely that we'll hit the smaller target of a Competent Mesa-Optimizer.
capybaralet's Shortform

Moloch is not about coordination failures.  Moloch is about the triumph of instrumental goals.  Maybe we can defeat Moloch with sufficiently good coordination.  It's worth a shot at least.

capybaralet's Shortform

Treacherous turns don't necessarily happen all at once. An AI system can start covertly recruiting resources outside its intended purview in preparation for a more overt power grab.

This can happen during training, without a deliberate "deployment" event. Once the AI has started recruiting resources, it can outperform AI systems that haven't done that on-distribution with resources left over which it can devote to pursuing its true objective or instrumental goals.

Why GPT wants to mesa-optimize & how we might change this

I didn't read the post (yet...), but I'm immediately skeptical of the claim that beam search is useful here ("in principle"), since GPT-3 is just doing next step prediction (it is never trained on its own outputs, IIUC). This means it should always just match the conditional P(x_t | x_1, .., x_{t-1}). That conditional itself can be viewed as being informed by possible future sequences, but conservation of expected evidence says we shouldn't be able to gain anything by doing beam search if we already know that conditional. Now it... (read more)

4John_Maxwell10moYeah, that's the possibility the post explores. Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that's (a) possible and (b) wouldn't necessarily involve a big performance penalty.)
Load More