Wiki Contributions


strictly weaker

add "than Condorcet" in this sentence since its only implied but not said


I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.

For the Krakovna paper, you're right that it has a different flavor than I remembered - it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states - I think this is still true even if the new terminal states are entirely unreachable as well?

With respect to the CNN example I agree, at least at a high-level - though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN - without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there's a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn't be greater than one.

Anyways, as the comments from Turntrout talk about, as soon as there's a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn't go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring - there may be an adaptation to the argument that uses some prior over generalizations and stuff, though - but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :')

I'll try and add a concise caveat to your doc, thanks for the discussion :)


Hi thanks for the response :) So I'm not sure what the distinction you're making between utility and reward functions, but as far as I can tell we're referring to the same object - the thing which is changed in the 'retargeting' process, the parameters theta - but feel free to correct me if the paper distinguishes between these in a way I'm forgetting; I'll be using "utility function", "reward function" and "parameters theta" interchangably, but will correct if so.

I think perhaps we're just calling different objects as "agents" - I mean p(__ | theta) for some fixed theta (i.e. you can't swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we'd have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.

I'll outline my meaning more below to try and clarify:

The salient distinction I wanted to make was that this theorem applies over something like ensembles of agents - that if you choose a parameter-vector theta at random, the agent will have a random preference over observations; then because some actions immediately foreclose a lot of observations (e.g., dying prevents you from leaving Moctezuma temple room 1), these actions become unlikely. My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.

The difference is that this doesn't super clearly apply to a single agent. In particular, in real life, we do not select these theta uniformly-at-random at all; the model example is RLHF, the point of which is, schematically, to concentrate a bunch of probability mass within this overall space of thetas to ones we like. Given that any selection process we actually use should be ~invariant to option variegation (i.e. shouldn't favour a family of outcomes more just because it has 'more members'), this severely limits the applicability of the theorems to practical agent-selection-processes - their premises (I think) basically amount to assuming option-variegation is the only thing that matters.

As a concrete example of what I mean, consider the MDP where you can choose either 0 to end the game, or 1 to continue, up to a total of T steps. Say the universe of possible reward functions is the integers 1 to T, representing the one timestep at which an agent gets reward for quitting, getting 0 otherwise. Also ignore time-discounting for simplicity. Then the average optimal policy will be to go to T/2 (past which point half of all possible rewards favor continuing, and half would have ended it earlier) - however this says nothing about what a particular trained agent will do. If we wanted to train an RL agent to end at, say, step 436, even if T is 10^6, the power-seeking theorems aren't a meaningful barrier to doing so because we specify the reward function rather than selecting it from an ensemble at random. Similarly, even though a room with an extra handful of atoms has combinatorially many more states than one without those extra atoms, we wouldn't reasonably expect an RL agent to prefer the former to the latter solely on the grounds that having more states means that more reward functions will favor those states.

That said, the stronger view that individual agents trained will likely seek power isn't without support even with these caveats - V. Krakovna's work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn't a super-duper realistic model of the generalization, since it still depends on the option-variegation. I expect that if the universe of possible reward functions doesn't scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.

Let me know your thoughts :)

Remarks on the Slow Boring post about "Kritiks" in debate, copied here for a friend who wanted to reference them, and lightly edited to fit the shortform format:

In the format of debate I do ("British Parliamentary" (BP), also my favorite, having also done "Public Forum"), this sort of thing is sorta-kinda explicitly not allowed. Ironically, I think BP is somewhere where this would be most appropriate - because you don't have the 'motion' until 15 minutes prior to start, focus on ground-facts is low (only relatively, compared to other research-heavy styles); you spend a lot of time arguing about "moral weighing-criteria" and "framing" and things like this.

We do have a pretty clear line where you can, say, go full-Marxist in saying some particular impact is the most important - you have to defend this framing, not just assert "Marxism says X" - but you definitely can't tell the motion to go fuck itself. (A late-debate reframing is called a "squirrel" and only allowed if the opening-half was batshit insane with the framing).

Specifically, you pretty much can get away with saying "zoos are anthropocentric and that's bad because...", but then you have to justify that and "weigh it out" against the other cases people bring. That is, you can win with a case that would be Kritik-adjacent (i.e. radical but you still can't challenge the motion), but you then have to argue why that's the relevant framing of the debate, why it's the most important thing, etc.

I only know EU debate for BP, so it's possible its just that US debate as a whole is weird, but I don't think so since I've seen e.g. Princeton compete. In the slow-boring article they note that its worst in "Policy" debate which sounds reasonable since that's always been the most Goodharted format (I've heard of classes where they teach you to speak at 350 WPM). Also worth noting that I never debated at an especially high level, and only in the Netherlands, but I've watched a few world-finals debates and some online workshops and nothing Kritik-flavored has come up.

I find BP is actually one of the best antidotes to Twitter hot-takes, in part I'd guess because framing etc. is an integral part of the debate anyways, so what would be Kritik "flip-the-table" BS in Policy can be (is required to be) much more productive in BP. Perhaps this is the structure of the format, but it may also just be the community.

Hi just a quick comment regarding the power-seeking theorems post: the definition you give of "power" as expected utility of optimal behavior is not the same as that used in the power-seeking theorems.

The theorems are not about any particular agent, but are statements about processes which produce agents. The definition of power is more about the number of states an agent can access. Colloquially, they're more of the flavor "for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states"

Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions. Because you average over all possible utility-functions, states with high-power are just those which have more utility functions associated with them, which is, speaking loosely, proportional to the number of states the agent can get to from the state in question.

Moderate (85%) confidence; based on having read two of the three listed power-seeking papers a few months ago for MATS apps, and writing an unfinished summary about them.

Tl;dr is that your argument doesn't meaningfully engage the counterproposition, and I think this not only harms your argument, but severely limits the extent to which the discussion in the comments can be productive. I'll confess that the wall of text below was written because you made me angry, not because I'm so invested in epistemic virtue - that said, I hope it will be taken as constructive criticism which will help the comments-section be more valuable for discussion :)

  • Missing argument pieces: you lack an argument for why higher fertility rates are good, but perhaps more importantly, to whom such benefits accrue (ie how much of the alleged benefit is spillover/externalities). Your proposal also requires a metric of justification (i.e. "X is good" is typically insufficient to entail "the government should do X" - more is needed ). I think you engage this somewhat when you discuss rationale for the severity of the law, but your proposal would require the deliberate denial of a right of free association to certain people - if you think this is okay, you should explicitly state the criterion of severity by which such a decision should be made. If this restriction is ultimately an acceptable cost, why does the state - in particular - have the right / obligation to enforce this, as opposed to leaving it to individual good judgement / families / community / etc? (though you address this in some of the comments, you really only address the "worst-case" scenario; see my next remark)

  • You need more nuanced characterizations, and a comparison of the different types of outcomes you could expect to see. While you take a mainline case with respect to technology, your characterization of "who is seeking AI romantic partners and why" is, with respect, very caricatured - there's no real engagement with the cases your contralocutor would bring up - what about the former incel who is now happily married to his AI gf? Of the polycule that has an AI just because they think its neat? Is this extremely unlikely in your reasoning? Relatively unimportant? More engagement with the cases inconvenient to your argument would make it much stronger (and incidentally tends to make it sound more considerate).

  • You should be more comparative: compare your primary impacts with other possible concerns in terms of magnitude, group-of-incidence (i.e. is it harming groups to whom society has incurred some special obligation to protect?), and severity. You also need a stronger support for some assertions on which your case hinges: why should it be so much more likely that an AI is going to fall far short of a human partner? If that's not true, how important is this compared to your points on fertility? How should these benefits be compared to what a contralocutor would argue occurs in the best case (in likelihood, magnitude, and relative importance)?

(these last two are a little bit more specific - I get that you're not trying to really write the bill right now, so if this is unfair to the spirit of your suggestion, no worries, feel free to ignore it)

  • your legislation proposal is highly specific and also socially nonstandard - the "18 is an adult" line is, of course, essentially arbitrary and there's a reasonable case to make either way, but because 30 is very high (I'm not aware of any similarly-sweeping law that sets it above 21 in US at least), you do assume a burden of proof that, imo, is heavier than just stating that the frontal cortex stops developing at 25, then tack on five years (why?). Similarly for the remaining two criteria - to suggest that mental illnesses disqualify someone from being in a romantic relationship I think clearly requires some qualification / justification - it may or may not be the case that one might be a danger to one's partner, but again, what is the comparative likelihood? What justifies legal intervention when the partner could presumably leave or not according to their judgement? How harmful would this be to someone wrongly diagnosed? (to be clear, I interpreted you as arguing that some people should be prevented from having partners, and only then should AI partners be made available to people - it's ambiguous to me if that was the case you were trying to make, or if restringting peoples' partners was a ground assumption that you assumed - if the latter, it's not clear why)

  • for what its worth, your comment on porn/onlyfans, etc. doesn't actually stop the whataboutism - you compare AI romance to things nominally in the same reference class, assert that they're bad too, and assert that it has no bearing on your argument that these things are not restricted in the same way. It's fine to bite the bullet and say "I stand by my reasoning above, these should be banned too", but you should argue that that is true explicitly if that's what you think. It is also not a broadly accepted fact that online-dating/porn/onlyfans/etc. constitute a net harm; a well-fleshed out argument above could lead someone to conclude this, but it weakens your argument to merely assert it in the teeth of plausible objections.

Thanks for reading, hope it was respectful and productive :)

Answer by CharlesRW72

(Personal bias is heavily towards the upskilling side of the scale) There are three big advantages to “problem first, fundamentals later”:

  1. You get experience doing research directly
  2. You save time
  3. Anytime you go to learn something for the problem, you will always have the context of “what does this mean in my case?”

3 is a mixed bag - sometimes this will be useful because it brings together ideas from far-away in idea-space; other times it will make it harder for you to learn things on their own terms - you may end up basing your understanding on non-central examples of a thing and end up having trouble putting it in its own context, thus making it harder to make it a central node in your knowledge-web. This makes it harder to use and less cognitively available.

By contrast, the advantages of going hard for context-learning:

  1. Learning bottom-up helps resolve this “tools in-context” problem
  2. I’d bet that if you focus on learning this way, you will feel less of the “why can’t I just get this and move past it already” pressure - which imo is highly likely to end up with poor learning overall
  3. Studying things “in order” will give you a lot more knowledge of how a field progresses - giving a feel for what moving things forward should “feel like from the inside”.
  4. Spaced repetition basically solves the “how do I remember basic isolated facts” problem - an integrated “bottom-up” approach is better-suited to building a web-of knowledge which will then give you affordances as to when to use particular tools.

The “bottom-up” approach has the risk of making you learn only central examples of concepts - this is best mitigated by taking the time and effort to be playful with the whatever you’re learning. Having your “Hamming problem” in mind will also help - it doesn’t need to be a dichotomy between bottom-up and top-down, in that regard.

My recommendation would be to split it and try: for alignment, there’s clearly a “base” of linalg, probability, etc. that in my estimation would be best consumed in its own context, while much of the rest of the work in the field is conceptual enough that mentally tagging what the theories are about (“natural abstractions” or “myopia of LLMs”) is probably sufficient for you to know what you’ll need and when, thus good to index as needed.

Hi!  I have a pretty good amount of experience with playing this game - I have a google spreadsheet collecting all sorts of data wrt food, exercise, habits etc.  that I've been collecting for quite a while.  I've had some solid successes (I'd say improved quality-of-life by 100x, but starting from an unrepresentatively low baseline), but also can share a few difficulties I've had with this approach; I'll just note some general difficulties and then talk about how this might translate into a useful app sorta thing that one could use.

1. It's hard to know what data to collect in advance of having the theory you want to test (this applies mainly to whole classes of hypothesis - tracking what you eat is helpful for food intolerance hypotheses, but not as much for things affecting your sleep).
   - I would recommend starting with a few classes of hypothesis and doing exploratory data analysis once you have data.  e.g. I divide my spreadsheet into "food", "sleep", "habits tracking", "activities", and resultant variables "energy", "mood", and "misc. notes" (each estimated for each ~1/3 of the day).  These might be different depending on your symptoms, but for non-specific symptoms, are a good place to start.  Gut intuitions are, I find, more worth heeding in choosing hypothesis classes than specific hypotheses.

2. Multiple concurrent problems can mean that you might tend to discard hypotheses prematurely.  As an example, I'm both Celiac (gluten-free) and soy-intolerant (though in a way that has no relation to soy-intolerance symptoms that I've seen online - take this as a datapoint).  Getting rid of gluten helped a little, getting rid of soy helped a little, but each individually was barely above the signal-noise threshold; if I were unluckier, I would have missed both.  It's also worth noting that I've found "intervened on X, no effect on Y after 3 weeks of perfect compliance" is, in my experience, only moderately strong evidence to discard the hypothesis (rather than damningly strong like it feels)
   - Things like elimination-diets can help with this if you are willing to invest the effort.  If you do it one at a time, it might be worth trying more than once, with a large time interval in between.
   - If you're looking for small effects, be aware of what your inter-day variability is; at least for me, there's a bias to assume that my status on day N is a direct result of days 1-(N-1).  Really there's just some fundamental amount of variability that you have to know to take as your noise threshold.  You can measure this by 'living the same day' - sounds a little lame, but the value-of-information I've found to be high.
   - If you have cognitive symptoms (in my case, mood fluctuations), note that there might also be some  interference in estimating-in-the-moment, which can be counteracted a little bit by taking recordings once at the time, and retroactively again the next ~day.

3.  Conditioning on success: as DirectedEvolution says in another comment, the space of hypotheses of things you could do to improve becomes enormous if you allow for a large window of time-delay and conjunctions of hypotheses.  I find a useful thinking technique for looking at complicated hypotheses is to ask "would I expect to succeed if the ground-truth were in a hypothesis-class of this level of complexity, and with no incremental improvements by implementing only parts of the plan?".  I don't feel like I've ever had luck testing conjunctions, but timescales are trickier - a physicists estimate would be to take whatever is the variability timescale of symptom severity, and look in this range.

As an overall comment on the app idea: definitely a good idea and I'd love to use it!  And super-duper would double-love to have a data-set over "mysterious chronic illnesses like mine".  I think there could ba a lot of value-added also in a different aspect of what you're talking about accomplishing - specifically, having a well-curated list of "things people have found success by tracking in the past", and "types of hypothesis which people have found success by testing in the past" might be more valuable than the ability to do a lot of statistics on your data (I've found that any hypothesis which is complex enough to need statistics fails my 'conditioning-on-success' test)

Hope there's something useful in here; just something I think about a lot, so sorry if I went on too long ahah.  I expect this advice is biased towards reflecting what actually worked for me - food-eliminations rather than other interventions, named-disease-diagnoses, etc. so feel free to correct as you see fit.

There are also a number of good posts about self-experimentation on, and a few more good ones at

From one playing the same game, best wishes in making things better using the scientific method :)

Fwiw, my experience with MOOCs/OCW  has been extremely positive (mainly in math and physics). Regarding the issue of insufficient depth, for a given 'popular topic' like ML, there are indeed strong incentives for lots of folks to put out courses on these, so there'll be lots to choose from, albeit with a wide variance in quality - c.f. all the calculus courses on edX.

That said, I find that as you move away from 'intro level', courses are generally of a more consistent quality, where they tend to rely a lot less on a gimmicky MOOC structure and follow much more closely the traditional style of lecture --> reading --> exercises/psets.  I find this to be true of things that are about equivalent to courses you'd take in ~3rd year uni - e.g.  "Introduction to Operating System Scheduling" as opposed to the "Introduction to Coding" courses or whatever the equivalent is for your area of study.  If you start to look at more advanced/specific courses, you may find the coverage quality improves.

I definitely can't promise this would work for ML, but I've found it useful to think about what I want to learn from a course (concepts?  technique?  'Deep understanding'?) before actually searching for one.  This provides a good gauge that you can usually figure out if a course meets within a few minutes, and I think may be especially relevant to ML where courses are fairly strongly divided along a tradeoff of 'practice-heavy' vs 'theory-heavy'.

It's worth noting though, that my physics/math perspective might not be as valid in learning ML, as in the former the most effective way to learn is more-or-less by following some traditional course of studies, whereas the latter has a lot of different websites which teach both in technique and theory at a basic-intermediate level; I'd be surprised if they, combined with projects, were less effective than MOOCs at the level for which they're written.  As others have noted, it may be worth looking at OpenCourseWare - MIT's being the most extensive that I'm aware of.  They also offer a 'curriculum map' so you can get a feel for which courses have prerequisites and what the general skillset of someone taking a given class should be.