All of Lauro Langosco's Comments + Replies

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the prob... (read more)

Yeah that seems reasonable! (Personally I'd prefer a single break between sentence 3 and 4)

Yes, with one linebreak, I'd put it at (4). With 2 linebreaks, I'd put it at 4+5. With 3 breaks, 4/5/6. (Giving the full standard format: introduction/background, method, results, conclusion.) If I were annotating that, I would go with 3 breaks. I wouldn't want to do a 4th break, and break up 1-3 at all, unless (3) was unusually long and complex and dug into the specialist techniques more than usual so there really was a sort of 'meaningless super universal background of the sort of since-the-dawn-of-time-man-has-yearned-to-x' vs 'ok real talk time, you do X/Y/Z but they all suck for A/B/C reasons; got it? now here's what you actually need to do:' genuine background split making it hard to distinguish where the waffle ends and the meat begins.

IMO ~170 words is a decent length for a well-written abstract (well maybe ~150 is better), and the problem is that abstracts are often badly written. Steve Easterbrook has a great guide on writing scientific abstracts; here's his example template which I think flows nicely:

(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how

... (read more)
I still claim this should be three paragraphs. In this breaking at section 4 and section 6 seems to carve it at reasonable joints.

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

3Steven Byrnes1d
I’m gonna pause to make sure we’re on the same page. We’re talking about this claim I made above: And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”. And then by “prerequisites” we’re referring to the thing you wrote above: OK, now to respond. For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right? For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here [].) For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situa
... (read more)
2Steven Byrnes5d
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second. I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here [], and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place. But anyway… First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote. Second bullet point → Ditto Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head. Fourth bullet point → I disagree for reasons here [].

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particu
... (read more)
* Honesty is an attractor in the cooperative multi-agent system, where one agent relies on the other agents having accurate information to do their part of the work. * I don't think understanding an intent is the hardest part. Even the curent LLMs are mostly able to do that.

(Crossposting some of my twitter comments).

I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.

  1. I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on

... (read more)

Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

4Arthur Conmy24d
It is open sourced here [] and there is material from REMIX to get used to the codebase here []
My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.
3Pranav Gade1mo
I ended up throwing this( []) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
nope, but hopefully we'll release one in the next few weeks.

I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).

And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.

I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").

Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fa... (read more)

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

3Ajeya Cotra3mo
Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

I agree with your general point here, but I think Ajeya's post actually gets this right, eg

There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”


What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from

... (read more)
2Lauro Langosco3mo
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed". (Though of course it's important to spell the argument out)

FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.

  • importance / difficulty of outer vs inner alignment
  • outlining some research directions that seem relatively promising to you, and explain why they seem more promising than others
3Charlie Steiner6mo
I feel like I'm pretty off outer vs. inner alignment. People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI's self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn't about what's in the loss function. People have had a go at outer alignment too, but (if they're named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely
... (read more)
2Evan R. Murphy2mo
Probably true at the time, but in December Jan Leike did write in some detail about why he's optimistic about OpenAI approach:

(Note that I'm not making a claim about how search is central to human capabilities relative to other species; I'm just saying search is useful in general. Plausibly also for other species, though it is more obvious for humans)

From my POV, the "cultural intelligence hypothesis" is not a counterpoint to importance of search. It's obvious that culture is important for human capabilities, but it also seems obvious to me that search is important. Building printing presses or steam engines is not something that a bundle of heuristics can do, IMO, without gainin... (read more)

1Ivan Vendrov7mo
Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection. []

I think you overestimate the importance of the genomic bottleneck. It seems unlikely that humans would have been as successful as we are if we were... the alternative to the kind of algorithm that does search, which you don't really describe.

Performing search to optimize an objective seems really central to our (human's) capabilities, and if you want to argue against that I think you should say something about what an algorithm is supposed to look like that is anywhere near as capable as humans but doesn't do any search.

3Ivan Vendrov7mo
I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis [] seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI). You're right though, I didn't say much about alternative algorithms other than point vaguely in the direction of hierarchical control. I mostly want to warn people not to reason about inner optimizers the way they would about search algorithms. But if it helps, I think AlphaStar [] is a good example of an algorithm that is superhuman in a very complex strategic domain but is very likely not doing anything like "evaluating many possibilities before settling on an action". In contrast to AlphaZero (with rollouts), which considers tens of thousands of positions [] before selecting an action. AlphaZero (just the policy network) I'm more confused about... I expect it still isn't doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?

Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!

...though now I'm confused about why we would assume that. Surely that assumption is wrong?

  • Humans are very constrained in terms of brain size and data, so we shouldn't assume that these quantities are scaled optimally in some sense that generalizes to deep learning models.
  • Anyhow we don't need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate
... (read more)

But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.

I'm confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?

(Except if you had expected even better scaling laws by now, but it didn't sound like that was your argument?)

If you assume the human brain was trained roughly optimally, then requiring more data, at a given parameter number, to be optimal pushes timelines out. If instead you had a specific loss number in mind, then a more efficient scaling law would pull timelines in.

What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

4Rob Bensinger9mo
From A central AI alignment problem: capabilities generalization, and the sharp left turn []:

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or co
... (read more)
I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list [] when I kept referring to AGI-as-a-contract-engineer. Even your point about it doesn't have to succeed, it's ok for it to fail at a task if it can't reach it in some obvious, non-insane way -- that's what I'd expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn't alert or tell a human about that plan prior to taking it has always been confusing to me. In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn't close), but they find some other solution that would work, people discuss that novel solution first and agree to it.  That's a core behavior I'd want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it's actually very context dependent -- if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I'd want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B. EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it's relevant to discussion of what it means to "optimize" or "solve" a problem as well. I've suggested this in my other posts but I don't think I ever said it explicitly: if you consider human infrastructure, and human ec

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'

Good point—edited.

But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don't know how the inductive biases will work out here). Once the consequentialist ac... (read more)

You're still assuming that you have a perfect consequentialist trapped in a box. And sure, if you have an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed. But that's not specific to doing some sort of dangerous consequentialist thing. If you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed. The magical step we're missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn't seem like the standard ways AI gets made, i.e. by gradient descent, would generate that. Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it's in training, you're going to have an imperfect consequentialist that cannot accurately predict whether it's in training. And this might sometimes accidentally decide that it's not in training, and output a prediction that's "intended" to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to noti

Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.

Taking a guess at where the disagreement lies, I think it's where you say

And seems much more tame than L to me.

does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

You haven't given any instrume

... (read more)
Fundamentally, the problem is this: The worry is that the predictive model will output suboptimal predictions in the immediate run in order to set up conditions for better predictions later. Now, suppose somehow some part of the predictive model gets the idea to do that. In that case, the predictions will be, well, suboptimal; it will make errors, so this part of the predictive model will have a negative gradient against it. If we were optimizing it to be agentic (e.g. using L), this negative gradient would be counterbalanced by a positive gradient that could strongly reinforce it. But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer. Well, you assume you'll end up with a consequentialist reasoner with an inner objective along the lines of L. Suppose the model outputs a prediction that makes future predictions easier somehow. What effect will that have on L∗? Well, L∗(μ)=Lμ(μ)−maxmLμ(m), and it may increase Lμ(μ), so you might think it would be expected to increase L∗. But presumably it would also increase maxmLμ(m), cancelling out the increase in Lμ(μ).

I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.

In the post I was assuming offline training, that is in your notation where is the distribution of the training data unaffected by the model. This seems even more tame than , but still dangerous because AGI can just figure out how to affect the data distribution 'one-shot' without having to trial-and-error learn how during training.

Well, I still don't find your argument convincing. You haven't given any instrumental convergence theorem, nor have you updated your informal instrumental convergence argument to bypass my objection.

Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it's likely there are a bunch more possible failure modes here.

(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)

This sounds like what Fix #2 is saying, meant to be addressed in the paragraph 'Third Problem'.

To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren't, and generality is strongly selected for in domains that are very hard.

Curious if you disagree with anything in particular in that paragraph or what I just said.

Let's be formal about it. Suppose you've got some loss function L(^x,x) measuring the difference between your prediction ^x=m(y) and the reality x, and you use this to train a predictor m. Once you deploy this predictor, it will face a probability distribution Pm(x,y). So when we collect data from reality and use this as input for our predicted, this means that we are actually optimizing the function Lm(μ)=Ex,y∼Pm[L(μ(y),x)]. Reasoning about the model μ that you get by increasing Lm(μ) is confusing, so you seem to want to shortcut it by considering what models are selected for according to the function L(μ)=Lμ(μ). It is indeed true that optimizing for L(μ) would give you the sort of agent that you are worried about. However, optimizing through L is really hard, because you have to reason about the effects of μ on Pμ. Furthermore, as you've mentioned, optimizing it generates malevolent AIs, which is not what people creating prediction models are aiming for. Therefore nobody is going to use L(μ) to create predictive AI. But isn't L still a reasonable shortcut for thinking about what you're selecting for when creating a predictive model? No, not at all, because you're not passing the gradients through Pμ. Instead, when you work out the math for what you're selecting for [], then it looks more like optimizing the loss function L∗(μ)=Lμ(μ)−maxmLμ(m) (or min, depending on whether you are maximizing or minimizing). And L∗ seems much more tame than L to me.

Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).

If it's not myopic you're right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.

Even if the oracle is myopic, there are still potential failure modes of the form "start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer", no? (I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I'd be worried that there are other more subtle failure modes still lurking.)

Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn't help the oracles coordinate as far as I can see).

I'm admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit. Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.

Thanks for the rec! I knew TRC was awesome but wasn't aware you could get that much compute.

Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.

Overall this updates me towards "we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile"

Curious if you disagree with any of this

Yes, it's possible TRC could shut down or scale back its grants. But then you are no worse off than you are now. And if you begin building up a shared cluster as a backup or alternative, you are losing the time-value of the money/research and it will be increasingly obsolete in terms of power or efficiency, and you aren't really at much 'risk': a shutdown means that a researcher switches gears for a bit or has to pay normal prices like everyone else etc, but there's no really catastrophic outcome like going-bankrupt. OK, you lose the time and effort you invested in learning GCP and setting up such an 'org' in it, but that's small potatoes - probably buying a single A100 costs more! For DL researchers, the rent vs buy dichotomy is always heavily skewed towards 'rent'. (Going the GCP route has other advantages in terms of getting running faster and building up a name and practical experience and a community who would even be interested in using your hypothetical shared cluster.)

Proof: The only situation in which the iteration scheme does not update the decision boundary B is when we fail to find a predictor that does useful computation relative to E. By hypothesis, the only way this can happen is if E does not contain all of E0 or E = C. Since we start with E0 and only grow the easy set, it must be that E = C.

(emphasis mine)

To me it looks like the emphasized assumption (that it's always possible to find a predictor that does useful computation) is the main source of your surprising result, as without it the iteration would not... (read more)

Here's a few more questions about the same strategy:

If I understand correctly, the IG strategy is to learn a joint model for observations and actions , where , , and are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using , where is optimized for predictive usefulness.

This fails because there's no easy way to get from .

A simple way around this would be to learn instead, where if the diamond is in the vault and otherwise.

  1. Is my understanding
... (read more)

Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?

Counter: In the worst case, imitative generalization / learning the human prior is not competitive. In particular, it might just be harder for a model to match the human inference than to simply learn . Here is the set of instructions as in learning the prior (I think in the context of ELK would be the proposed change to the human Bayes net?)

1Lauro Langosco1y
Here's a few more questions about the same strategy: If I understand correctly, the IG strategy is to learn a joint model for observations and actions pθ(v,a;Z), where v, a, and Z are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using pθ(v,a;Z∗), where Z∗ is optimized for predictive usefulness. This fails because there's no easy way to get P(diamond is in the vault) from pθ. A simple way around this would be to learn pθ(v,a,y;Z) instead, where y=1 if the diamond is in the vault and 0 otherwise. 1. Is my understanding correct? 2. If so, I would guess that my simple workaround doesn't count as a strategy because we can only use this to predict whether the diamond is in the vault (or some other set of questions that must be fixed at training time), as opposed to any question we want an answer to. Is this correct? Is there some other reason this wouldn't count, or does it in fact count?