All of Shoshannah Tekofsky's Comments + Replies

Should we have a rewrite the Rationalist Basics Discourse contest?

Not that I think anything is gonna beat this. But still :D

Ps: can be both content and/or style

5the gears to ascension2mo
rewrite contests are, in general, a wonderful idea, if you ask me.

Thank you! I appreciate the in-depth comment.

Do you think any of these groups hold that all of the alignment problem can be solved without advancing capabilities?


And I appreciate the correction -- I admit I was confused about this, and may not have done enough of a deep-dive to untangle this properly. Originally I wanted to say "empiricists versus theorists" but I'm not sure where I got the term "theorist" from either.


And to both examples, how are you conceptualizing a "new idea"? Cause I suspect we don't have the same model on what an idea is.

Good question. I'm using the term "idea" pretty loosely and glossily.  Things that would meet this vague definition of "idea": * The ELK problem (like going from nothing to "ah, we'll need a way of eliciting latent knowledge from AIs") * Identifying the ELK program as a priority/non-priority (generating the arguments/ideas that go from "this ELK thing exists" to "ah, I think ELK is one of the most important alignment directions" or "nope, this particular problem/approach doesn't matter much" * An ELK proposal * A specific modification to an ELK proposal that makes it 5% better.  So new ideas could include new problems/subproblems we haven't discovered, solutions/proposals, code to help us implement proposals, ideas that help us prioritize between approaches, etc.  How are you defining "idea" (or do you have a totally different way of looking at things)?

Two things that worked for me:

  1. Produce stuff, a lot of stuff, and make it findable online. This makes it possible for people to see your potential and reach out to you.

  2. Send an email to anyone you admire asking if they are interested in going for a coffee (if you have the funds to fly out to them) or do a video call. Explain why you admire them and why this would be high value to you. I did this for 4 people without limit of 'how likely are they to answer' and one of them said 'yeah sure' and I think the email made them happy cause a reasonable subset of people like learning how they have touched other's lives in a positive way.

Even in experiments, I think most of the value is usually from observing lots of stuff, more than from carefully controlling things.

I think I mostly agree with you but have the "observing lots of stuff" categorized as "exploratory studies" which are badly controlled affairs where you just try to collect more observations to inform your actual eventual experiment. If you want to pin down a fact about reality, you'd still need to devise a well-controlled experiment that actually shows the effect you hypothesize to exist from your observations so far.

If you a

... (read more)

There is an EU telegram group where they are, among other things, collecting data on where people are in Europe. I'll DM an invite.

That makes a lot of sense! And was indeed also thinking of Elicit

Note: The meetup this month is Wednesday, Jan 4th, at 15:00. I'm in Berkeley currently, and I couldn't see how times were displayed for you guys cause I have no option to change time zones on LW. I apologize if this has been confusing! I'll get a local person to verify dates and times next time (or even set them).

Did you accidentally forget to add this post to your research journal sequence?

I thought I added it but apparently hadn't pressed submit. Thank you for pointing that out!


  1. optimization algorithms (finitely terminating)
  2. iterative methods (convergent)

That sounds as if as if they are always finitely terminating or convergent, which they're not. (I don't think you wanted to say they are)

I was going by the Wikipedia definition:

To solve problems, researchers may use algorithms that terminate in a finite number of steps, or iterative methods that converge to a

... (read more)
5Leon Lang3mo
I see. I think I was confused since, in my mind, there are many Turing machines that simply do not "optimize" anything. They just compute a function.   I think I wanted to point to a difference in the computational approach of different algorithms that find a path through the universe. If you chain together many locally found heuristics, then you carve out a path through reality over time that may lead to some "desirable outcome". But the computation would be vastly different from another algorithm that thinks about the end result and then makes a whole plan of how to reach this. It's basically the difference between deontology and consequentialism. This post is on similar themes [].  I'm not at all sure if we disagree about anything here, though.      I would say that if you remember the plan and retrieve it later for repeated use, then you do this by learning and the resulting computation is not planning anymore. Planning is always the thing you do at the moment to find good results now, and learning is the thing you do to be able to use a solution repeatedly.  Part of my opinion also comes from the intuition that planning is the thing that derives its use from the fact that it is applied in complex environments in which learning by heart is often useless. The very reason why planning is useful for intelligent agents is that they cannot simply learn heuristics to navigate the world.  To be fair, it might be that I don't have the same intuitive connection between planning and learning in my head that you do, so if my comments are beside the point, then feel free to ignore :)    Conceptually it does, thank you! I wouldn't call these parameters and hyperparameters, though. Low-level and high-level features might be better terms.  Again, I think the shard theory of human values might be an inspiration for these thoughts, as well as this post on AGI motivation [https://w

Oh my, this looks really great. I suspect between this and the other list of AIS researchers, we're all just taking different cracks at generating a central registry of AIS folk so we can coordinate at all different levels on knowing what people are doing and knowing who to contact for which kind of connection. However, maintaining such an overarching registry is probably a full time job for someone with high organizational and documentation skills.

Yup, another instance of this is the longtermist census [], that likely has the most entries but is not public. Then there's AI Safety Watch [], the EA Hub (with the right filters) [], the mailing list of people who went through AGISF, I'm sure SERI MATS has one, other mailing lists like AISS's opportunities one, other training programs, student groups, people in various entries on []... Yeah, there's some organizing to do. Maybe the EA forum's proposed new profile features will end up being the killer app?

Great idea!

So my intuition is that letting people edit a file that is publicly linked is inviting a high probability of undesirable results (like accidental wipes, unnoticed changes to the file, etc). I'm open to looking in to this if the format gains a lot of traction and people find it very useful. For the moment, I'll leave the file as-is so no one's entry can be accidentally affected by someone else's edits. Thank you for the offer though!

Yeah, that is a risk. Have you checked out ASAP? Seems pretty related [] [] []

Thank you for sharing! I actually have a similar response myself but assumed it was not general. I'm going to edit the image out.

EDIT: Both are points are moot using Stuart Armstrong's narrower definition of the Orthogonality thesis that he argues in General purpose intelligence: arguing the Orthogonality thesis:

High-intelligence agents can exist having more or less any final goals (as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence).

Old post:

I was just working through my own thoughts on the Orthogonality thesis and did a search on LW on existing material and found this essay. I had pretty much the same thoughts on intel... (read more)

Hmm, that wouldn't explain the different qualia of the rewards, but maybe it doesn't have to. I see your point that they can mathematically still be encoded in to one reward signal that we optimize through weighted factors.

I guess my deeper question would be: do the different qualias of different reward signals achieve anything in our behavior that can't be encoded through summing the weighted factors of different reward systems in to one reward signal that is optimized?

Another framing here would be homeostasis - if you accept humans aren't happiness optim... (read more)

1Aprillion (Peter Hozák)4mo
Allostasis [] is a more biologically plausible explanation of "what a brain does" than homeostasis, but to your point: I do think optimizing for happiness and doing kinda-homeostasis are "just the same somehow". I have a slightly circular view that the extension of happiness exists as an output of a network with 86 billion neurons and 60 trillion connections, and that it is a thing that the brain can optimize for. Even if the intension of happiness as defined by a few English sentences is not the thing, and even if optimization for slightly different things would be very fragile, the attractor of happiness might be very small and surrounded by dystopian tar pits, I do think it is something that exists in the real world and is worth searching for. Though if we cannot find any intension that is useful, perhaps other approaches to AI Alignment and not the "search for human happiness" will be more practical.

Clawbacks refer to grants that have already been distributed but would need to be returned. You seem to be thinking of grants that haven't been distributed yet. I hope both get resolved but they would require different solutions. The post above is only about clawbacks though.

Good point. I meant both, since the same logic applies

As a grantee, I'd be very interested in hearing what informs your estimate, if you feel comfortable sharing.

I don't have any special insights. But I would assume that the total amount of undistributed grants is not huge, and there are EA-adjacent funding orgs that have funds available. Using previously selected now-unfunded grantees saves them the work of identifying promising projects. Plus it limits the fallout from the current FTX debacle. Win/win.

Sure. For instance, hugging/touch, good food, or finishing a task all deliver a different type of reward signal. You can be saturated on one but not the others and then you'll seek out the other reward signals. Furthermore, I think these rewards are biochemically implemented through different systems (oxytocin, something-sugar-related-unsure-what, and dopamine). What would be the analogue of this in AI?

1Aprillion (Peter Hozák)4mo
I see. These are implemented differently in humans, but my intuition about the implementation details is that "reward signal" as a mathematically abstract object can be modeled by single value even if individual components are physically implemented by different mechanisms, e.g. an animal could be modeled as if was optimizing for a pareto optimum between a bunch of normalized criteria. reward = S(hugs) + S(food) + S(finishing tasks) + S(free time) - S(pain) ... People spend their time cooking, risk cutting fingers, in order to have better food and build relationships. But no one would want to get cancer to obtain more hugs, presumably not even to increase number of hugs from 0 to 1, so I don't feel human rewards are completely independent magisteria, there must be some biological mechanism to integrate the different expected rewards and pains into decisions. Spending energy on computation of expected value can be included in the model, we might decide that we would get lower reward if we overthink the current decision and that would be possible to model as included in the one "reward signal" in theory, even though it would complicate predictability of humans in practice (however, it turns out that humans can be, in fact, hard to predict, so I would say this is a complication of reality, not a useless complication in the model).

ah, like that. Thank you for explaining. I wouldn't consider that a reversal cause you're then still converting intuitions into testable hypotheses. But the emphasis on discussion versus experimentation is then reversed indeed.

What would the sensible reverse of number 5? I can generate those them for 1-4 and 6, but I am unsure what the benefit could be of confusing intuitions with testable hypotheses?

Reversal: when you have different intuitions about high-level questions, it's often not worth spending a lot of time debating them extensively - instead, move onto doing whatever research your intuitions imply will be valuable.

I really appreciate that thought! I think there were a few things going on:

  • Definitons and Degrees: I think in common speech and intuitions it is the case that failing to pick the optimal option doesn't mean something is not an optimizer. I think this goes back to the definition confusion, where 'optimizer' in CS or math literally picks the best option to maximize X no matter the other concerns. While in daily life, if one says they optimize on X then trading off against lower concerns at some value greater than zero is still considered optimizing. E.g. s
... (read more)
I wouldn't say "picks the best option" is the most interesting thing in the conceptual cluster around "actual optimizer". A more interesting thing is "runs an ongoing, open-ended, creative, recursive, combinatorial search for further ways to greatly increase X".    I mean certainly this is pointing at something deep and important. But the shift here I would say couldn't be coming from agentic IGF maximization, because agentic IGF maximization would have already, before your pregnancy, cared in the same qualitative way, with the same orientation to the intergenerational organism, though about 1/8th as much, about your cousins, and 1/16th as much about the children of your cousins. Like, of course you care about those people, maybe in a similar way as you care about your children, and maybe connected to IGF in some way; but something got turned on, which looks a lot like a genetically programmed mother-child caring, which wouldn't be an additional event if you'd been an IGF maxer. (One could say, you care about your children mostly intrinsically, not mostly because of an IGF calculation. Yes this intrinsic care is in some sense put there by evolution for IGF reasons, but that doesn't make them your reasons.) Hm. I don't agree that this is very plausible; what I agreed with was that human evolution is closer to an IGF maxer, or at least some sort of myopic []  IGF maxer, in the sense that it only "takes actions" according to the criterion of IGF.  It's a little plausible. I think it would have to look like a partial Baldwinization [] of pointers to the non-genetic memeplex of explicit IGF maximization; I don't think evolution would be able to assemble brainware that reliably in relative isolation does IGF, because that's an abstract calculative idea whose full abstractly calculated implications are weird a

On further reflection, I changed my mind (see title and edit at top of article). Your comment was one of the items that helped me understand the concepts better, so just wanted to add a small thank you note. Thank you!


On that note, I was wondering if there was any way I could tag the people that engaged me on this (cause it's spread between 2 articles) just so I can say thanks? Seems like the right thing to do to high five everyone after a lost duel or something? Dunno, there is some sentiment there where a lightweight acknowledgement/update would be a useful thing to deliver in this case, I feel, to signal that people's comments actually had an effect. DM'ing everyone or replying to each comment again would give everyone a notification but generates a lot of clutter and overhead, so that's why tagging seemed like a good route.

4Ben Pace6mo
No especially good suggestion from me. Obvious options: * You could make a comment that links to the most helpful comments. * You could make one PM convo that includes everyone (you can add multiple people to a PM convo) and link them to the comment Agree that tagging/mentions would be nice here.

I wasn't sure how I hadn't argued that, but between all the difference comments, I've now pieced it together. I appreciate everyone engaging me on this, and I've updated the essay to "deprecated" with an explanation at the top that I no longer endorse these views.

Applause for putting your thoughts out there, and applause for updating. Also maybe worth saying: It's maybe worth "steelmanning" your past self; maybe the intuitions you expressed in the post are still saying something relevant that wasn't integrated into the picture, even if it wasn't exactly "actually some humans are literally IGF maximizers".  Like, you said something true about X, and you thought that IGF meant X, but now you don't think IGF means X, but you still maybe said something worthwhile about X. 

Thank you. Between all the helpful comments, I've updated my point of view and updated this essay to deprecated with an explanation + acknowledgement at the top.

In return, your new disclaimer at the beginning of the article made me notice something I was confused about -- whether we should apply the label "X maximizer" only to someone who actually achieves the highest possible value of X, or also to someone who tries but maybe fails. In other words, are we only talking about internal motivation, or describing the actual outcome and expecting perfection? To use an analogy, imagine a chess-playing algorithm. It is correct to call it a "chess victory maximizer"? On one hand, the algorithm does not care about anything other than winning at chess. On the other hand, if a better algorithm comes later and defeats the former one, will we say that the former one is not an actual chess victory maximizer, because it did some (in hindsight) non-victory-maximizing moves, which is how it lost the game? When talking about humans, imagine that a random sci-fi mutation turns someone into a literal fitness maximizer, but at the same time, that human's IQ remains only 100. So the human would literally stop caring about anything other than reproduction, but maybe would not be smart enough to notice the most efficient strategy, and would use a less efficient one. Would it still be okay to call such human a fitness maximizer? Is it about "trying, within your limits", or is it "doing the theoretically best thing"? I suppose, if I talked to such guy, and told him e.g. "hey, do you realize that donating at sperm clinic would result in way more babies than just hooking up with someone every night and having unprotected sex?", if the guy would immediately react by "oh shit, no more sex anymore, I need to save all my sperms for donation" then I would see no objection to calling him a maximizer. His cognitive skills are weak, but his motivation is flawless. (But I still stand by my original point, that humans are not even like this. The guys who supposedly maximize the number of their children would actually not be willing to give up sex forever, i
5Ben Pace6mo
Woop, take credit for changing your mind!

The surrogacy example originally struck me as very unrealistic cause I presumed it was mostly illegal (it is in Europe but apparently not in some States of the US) and heavily frowned upon here for ethical reasons (but possibly not in the US?). So my original reasoning was that you'd get in far more trouble for applying for many surrogates than for swapping out sperm at the sperm bank.

I guess if this is not the case then it might have been a fetish for those doctors? I'm slightly confused about the matter now what internal experience put them up to it if t... (read more)

Yes, good point. I was looking at those statistics for a bit. Poorer parents do indeed tend to maximize their number of offspring no matter the cost while richer parents do not. It might be that parents overestimate the IGF payoffs of quality, but then that just makes them bad/incorrect optimizers. It wouldn't make them less of an optimizer.

I think there also some other subtle nuances going on, like for instance, I'd consider myself fairly close to an IGF optimizer but I don't care about all genes/traits equally. There is a multigenerational "strain" I ide... (read more)

I think the notion that people are adaptation-executors, who like lots of things a little bit in context-relevant situations, predicts our world more than the model of fitness-maximizers, who would jump on this medical technology and aim to have 100,000s of children soon after it was built.

I think this skips the actual social trade-offs of the strategy you outline above:

  1. The likely back lash in society against any woman who tries this is very high. Any given rich woman would have to find surrogate women who are willing to accept the money and avoid bei
... (read more)

My claim was purely that some people do actually optimize on this. It's just fairly hard, and their success also relies on how their abilities to game the system compares to how strong the system is. E.g. There was that fertility doctor that just used his own sperm all the time, for instance.

Yes, the story of the doctor was the inspiration for my comment. Compared to him, other "maximizers" clearly did not do enough. And as Gwern wrote, even the doctor could have done much better. (Also, I have no evidence here, but I wonder how much of what the doctor did was a strategy, and how much was just exploiting a random opportunity. Did he become a fertility doctor on purpose to do this, or did he just choose a random high-status job, and then noticed an opportunity? I suppose we will never know.)
I'm not sure which one you mean because there's a few examples of that, but he still has not maximized even for quite generous interpretations of 'maximize': none of those doctors so much as lobbied their fellow doctors to use him as their exclusive sperm donor, for example, nor offered to bribe them; none of the doctors I've read about appear to have spent any money at all attempting to get more offspring, much less to the extent of making any dent in their high doctor-SES standard of living (certainly no one went, 'oh, so that is what he had secretly devoted his life to maximizing, we were wondering'), much less paid for a dozen surrogacies with the few million net assets they'd accumulate over a lifetime. You can't excuse this as a typical human incompetence [] because it requires only money to cut a check, which they had.

Makes sense. I'm starting to suspect I overestimated the number of people who would take these deals, but I think there still would be more for the above than for the original thought experiments.

Here is my best attempt at working out my thoughts on this, but I noticed I reached some confusion at various points. I figured I'd post it anyway in case it either actually makes sense or people have thoughts they feel like sharing that might help my confusion.

Edit: The article is now deprecated. Thanks for everyone commenting here for helping me understand the different definitions of optimizer. I do suspect my misunderstanding of Nate's point might mirror why there is relatively common pushback against his claim? But maybe I'm typical minding.

They are a small minority currently cause the environment changes so quickly right now. Things have been changing insanely fast in the last century or so but before the industrial revolution and especially before the agriculture revolution, humans were much better optimized for IGF, I think. Evolution is still 'training' us and these last 100 years have been a huge change compared to the generation length of humans. Nate is stating that humans genetically are not IGF maximizers, and that is false. We are, we are just currently heavily being 'retrained'.

Re:... (read more)

I disagree humans don't optimize IGF:

  1. We seem to have different observational data. I do know some people who make all their major life decisions based on quality and quantity of offspring. Most of them are female but this might be a bias in my sample. Specifically, quality trades off against quantity: waiting to find a fitter partner and thus losing part of your reproductive window is a common trade off. Similarly, making sure your children have much better lives than you by making sure your own material circumstances (or health!) are better is another.
... (read more)
5Ben Pace6mo
* Given the ability to medically remove, store, and artificially inseminate eggs, current technologies make it possible for a woman to produce many more children than the historical limit of ~50 (i.e. one every 9 months for a woman's entire reproductive years), and closer to the limit (note that each woman produces 100,000s of eggs).  * I don't have a worked out plan, but I could see a woman removing most of her eggs, somehow causing many other women to use her eggs to have children (whether it's by finding infertile women, or paying people, or showing that the eggs would be healthier than others'), and having many more children than historically possible. * I suspect many women could have 50-100 children this way, and that peak women could have 10,000s of children this way, closer to the male model of reproduction. * I'd be interested to know the maximum number of children any woman has had in history, and also since the invention of this sort of medical technology. * I imagine that such a world would have a market (and class system) based around being able to get your eggs born. There are services where a different woman will have your children, but I think the maximizer world would look more like poor women primarily being paid to have children (and being pregnant >50% of their lives) and rich women primarily paying to have children (and having 1000s of children born). * I think the notion that people are adaptation-executors, who like lots of things a little bit in context-relevant situations, predicts our world more than the model of fitness-maximizers, who would jump on this medical technology and aim to have 100,000s of children soon after it was built. * I also suspect that population would skyrocket relative to the current numbers (e.g. be 10-1000x the current size). Perhaps efforts to colonize Mars would have been sustained during the 20th century, as this planet would have been more
2Thomas Kwa6mo
The reason why we're talking about humans and IGF is because there's an analogy to AGI. If we select on the AI to be corrigible (or whatever nice property) in subhuman domains, will it generalize out-of-distribution to be corrigible when superhuman and performing coherent optimization? Humans are not generalizing out of distribution. The average woman who wants to raise high quality children does not have the goal of maximizing IGF; she does try to instill the value of maximizing IGF into them, nor use the far more effective strategies of donating eggs, trying to get around egg donation limits [], or getting her male relatives to donate sperm. If the environment stabilizes, additional selection pressure might cause these people to become a majority. But we might not have additional selection pressure [] in the AGI case.
6Rob Bensinger6mo
Is this the best strategy for maximizing IGF? Do happier and wealthier kids have more offspring? Given that wealthier countries tend to have lower birth rates, I wonder if the IGF-maximizing strategy would instead often look like trying to have lots of poor children with few options? (I'll note as an aside that even if this is false, it should definitely be a thing many parents seriously consider doing and are strongly tempted by, if the parents are really maximizing IGF rather than maximizing proxies like "their kids' happiness". It would be very weird, for example, if an IGF maximizer reacted to this strategy with revulsion.) I'd be similarly curious if there are cases where making your kids less happy, less intelligent, less psychologically stable, etc. increased their expected offspring. This would test to what extent 'I want lots and lots and lots of kids' parents are maximizing IGF per se, versus maximizing some combination of 'have lots of descendants', 'make my descendants happy (even if this means having fewer of them)', etc.
3Shoshannah Tekofsky6mo
Here is my best attempt [] at working out my thoughts on this, but I noticed I reached some confusion at various points. I figured I'd post it anyway in case it either actually makes sense or people have thoughts they feel like sharing that might help my confusion. Edit: The article is now deprecated. Thanks for everyone commenting here for helping me understand the different definitions of optimizer. I do suspect my misunderstanding of Nate's point might mirror why there is relatively common pushback against his claim? But maybe I'm typical minding.
In the long term, we would expect humans to end up directly optimizing IGF (assuming no revolutions like AI doom or similar) due to evolution. The way this proceeds in practice is that people vary on the extent to which they optimize IGF vs other things, and those who optimize IGF pass on their genes, leading to higher optimization of IGF. So yes eventually these sorts of people will win, but as you admit yourself they are a small minority, so humans as they currently exist are mostly not IGF maximizers. Also, regarding quality vs quantity, it's my impression that society massively overinvests in quality relative to what would be implied by IGF. Society is incredibly safe compared to the past, so you don't need much effort to make them survive. Insofar as there is an IGF value in quality, it's probably in somehow convincing your children to also optimize for IGF, rather than do other things.

Thank you for the comment!

Possibly such a proof exists. With more assumptions, you can get better information on human values, see here. This obviously doesn't solve all concerns.

Those are great references! I'm going to add them to my reading list, thank you.

Only a few people think about this a lot -- I currently can only think of the Center on Long-Term Risk on the intersection of suffering focus and AI Safety. Given how bad suffering is, I'm glad that there are people thinking about it, and do not think that a simple inefficiency argument is enough.

I'd h... (read more)

1Leon Lang6mo
I think I basically agree (though maybe not with as much high confidence as you), but I think that doesn't mean that huge amounts of suffering will not dominate the future. For example, if there will be not one but many superintelligent AI systems determining the future, this might create suffering due to cooperation failures. 

What distinguishes capabilities and intelligence to your mind, and what grounds that distinction? I think I'd have to understand that to begin to formulate an answer.

I've unfortunately been quite distracted, but better a late reply than no reply. With capabilities I mean how well a system accomplishes different tasks. This is potentially high dimensional (there can be many tasks that two systems are not equally good at). Also it can be more and less general (optical character recognition is very narrow because it can only be used for one thing, generating / predicting text is quite general). Also, systems without agency can have strong and general capabilities (a system might generate text or images without being agentic). This is quite different from the definition by Legg and Hutter, which is more specific to agents. However, since last week I have updated on strongly and generally capable non-agentic systems being less likely to actually be built (especially before agentic systems). In consequence, the difference between my notion of capabilities and a more agent related notion of intelligence is less important than I thought.

Great job writing up your thoughts, insights, and model!

My mind is mainly attracted to the distinction you make between capabilities and agency. In my own model, agency is a necessary part of increasing capabilities, and will per definition emerge in superhuman intelligence. I think the same conclusion follows from the definitions you use as follows:

You define "capabilities" by the Legg and Hutter definition you linked to, which reads:

 Intelligence measures an agent's ability to achieve goals in a wide range of environments

You define "agency" as... (read more)

Thanks for your replies. I think our intuitions regarding intelligence and agency are quite different. I deliberately mostly stickest to the word ‘capabilities’, because in my intuition you can have systems with very strong and quite general capabilities, that are not agentic. One very interesting point is that you : “Presumably the problem happens somewhere between "the smartest animal we know" and "our intelligence", and once we are near that, recursive self-improvement will make the distinction moot”. Can you explain this position more? In my intuition building and improving intelligent systems is far harder than that. I hope to later come back to your answer to information about the real world.

Yes, agreed. The technique is only aimed at the "soft" edge of this, where people might in reality even disagree if something is still in or outside the Overton Window. I do think a gradient-type model of controversiality is a more realistic model of how people are socially penalized than a binary model. The exercise is not aimed at sharing views that would lead to heavy social penalties indeed, and I don't think anyone would benefit from running it that way. It's a very relevant distinction you are raising.

Good question!

My thinking on this is slightly different than @omark's. Specifically:

  • Everyone commits to being vulnerable by sharing their own controversial statements. This symmetry is often not present in normal conversation, where you focus on one topic where one person might have a controversial opinion and the other does not.
  • It's much higher density on iterating through controversial opinions than a normal conversation would be.
  • It's a session you can sign up for where you can trust everyone is coming to the session with the same intention to grow and s
... (read more)
1M. Y. Zuo6mo
That’s interesting though I don’t see how the commitment mechanism could work without some arbiter to decide if the follow up statement is actually controversial How do you envision disputes along the lines of not-actually-that-controversial will be resolved?

My intuition is that there is a gradient from controversial statements to this-will-cause-unrecoverable-social-status damage. I think I might have implicitly employed a 'softer' definition of Overton window as 'statements that make others or yourself uncomfortable to express/debate', where the 'harder' definition would be statements you can't socially recover from. I think intuitively I wouldn't presume anyone wants to share the latter and I don't see much benefit in doing so. But overall, my concept of Overton window is much more gradient than a binary, and this exercise aims to allow people to stretch through the (perceived) low range.

Moved the addendum in to the comments, cause it seemed to mess up the navigation. This seems like a more elegant solution.


Addendum: Experiments

These are experiments we ran at an AIS-x-rationality meetup to explore novelty generation strategies. I've added a short review to each exercise description.

Session 1

Exercise 1: Inside View

  • Split in pairs
  • 5 minute timer
  • Instructions: explain your internal model of the AI Alignment problem. If someone is done talking, then remaining time can be filled with questions.
  • Switch

Review: This was great priming but h... (read more)


I dug through the comments too and someone referred to this article by Holden Karnofsky, but I don't actually agree with that for adults (kids, sure).

Yes, but that's not what I meant by my question. It's more like ... do we have a way of applying kinds of reward signals to AI, or can we only apply different amounts of reward signals? My impression is the latter, but humans seem to have the former. So what's the missing piece?

1Aprillion (Peter Hozák)4mo
hm, I gave it some time, but still confused .. can you name some types of reward that humans have?

I was thinking of the structure of Generative Adversarial Networks. Would that not apply in this case? It would involve 2 competing AGI's in the end though. I'm not sure if they'd just collaborate to set both their reward functions to max, or if that will never happen due to possible game theoretic considerations.

3Donald Hobson8mo
In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks. "exploits in the objective function" could be considered as "solutions that score  highly that the programmers didn't really intend". The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can't be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all. A lot of novice AI safety ideas look like "AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium"

Thank you for your thoughtful reply!


Did you check out the list of specification gaming or the article? It's quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.

I found that I couldn't follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...

But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant... (read more)

2mako yass8mo
The reward function that you wrote out is, in a sense, never the one you want them to have, because you can't write out the entirety of human values. We want them to figure out human values to a greater level of detail than we understand them ourselves. There's a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it. Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).

Thanks for doing this!

I was trying to work out how the alignment problem could be framed as a game design problem and I got stuck on this idea of rewards being of different 'types'. Like, when considering reward hacking, how would one hack the reward of reading a book or exploring a world in a video game? Is there such a thing as 'types' of reward in how reward functions are currently created? Or is it that I'm failing to introspect on reward types and they are essentially all the same pain/pleasure axis attached to different items?

That last explanation se... (read more)

You should distinguish between “reward signal” as in the information that the outer optimization process uses to update the weights of the AI, and “reward signal” as in observations that the AI gets from the environment that an inner optimizer within the AI might pay attention to and care about. From evolution’s perspective, your pain, pleasure, and other qualia are the second type of reward, while your inclusive genetic fitness is the first type. You can’t see your inclusive genetic fitness directly, though your observations of the environment can let you guess at it, and your qualia will only affect your inclusive genetic fitness indirectly by affecting what actions you take. To answer your question about using multiple types of reward: For the “outer optimization” type of reward, in modern ML the loss function used to train a network can have multiple components. For example, an update on an image-generating AI might say that the image it generated had too much blue in it, and didn’t look enough like a cat, and the discriminator network was able to tell it apart from a human generated image. Then the optimizer would generate a gradient descent step that improves the model on all those metrics simultaneously for that input. For “intrinsic motivation” type rewards, the AI could have any reaction whatsoever to any particular input, depending on what reactions were useful to the outer optimization process that produced it. But in order for an environmental reward signal to do anything, the AI has to already be able to react to it.
1Aprillion (Peter Hozák)8mo
Sounds like an AI would be searching for Pareto optimality to satisfy multiple (types of) objectives in such a case - [] ..

I've recently started looking at AIS and I'm trying to figure out how I would like to contribute to the field. My sole motivation is that all timelines see either my kids or grandkids dying from AGI. I want them to die of old age after having lived a decent life.

I get the sense that motivation ranks as quite selfish, but it's a powerful one for me. If working on AIS is the one best thing I can do for their current and future wellbeing, then I'll do that.

>My sole motivation is that all timelines see either my kids or grandkids dying from AGI.

Would that all people were so selfish!

Load More