372 comments, sorted by Click to highlight new comments since: Today at 8:21 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can't just syntactically discover which claims were negated). 

For example:

By the time they are born, infants can recognize and have a preference for their mother's voice suggesting some prenatal development of auditory perception.

-> modified to

Contrary to early theories, newborn infants are not particularly adept at picking out their mother's voice from other voices. This suggests the absence of prenatal development of auditory perception.

Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections. 

Benefits:

  • Addressing key rationality skills. Noticing confusion; being more confused by fiction than fact; actually checking claims against your models of the world.
    • If you fail, either the article wasn't negated skillfully ("5 people died in 2021" -> "4 people died in 2021" is not the right kind of modification), you don't have good models of the domain, or you didn't pay enough attention
... (read more)
5Morpheus1y
I remember the magazine I read as a kid (Geolino) had a section like this (something like 7 news stories from around the World and one is wrong). It's german only, though I'd guess a similar thing to exist in english media?
3Yitz1y
This is a lot like Gwern’s idea for a fake science journal club, right? This sounds a lot easier to do though, and might seriously be worth trying to implement.
2TurnTrout9mo
Additional exercise: Condition on something ridiculous (like apes having been continuously alive for the past billion years), in addition to your own observations (your life as you've lived it). What must now be true about the world? What parts of your understanding of reality are now suspect?

For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.

No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.

Text: "Consider a bijection "

My mental narrator: "Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar"

Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer "click" and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?

And then.... and then, a month ago, I got fed up. What if it was all just in my head, at this point? I'm only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?

I wanted my hands back - I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed hi

... (read more)

It was probably just regression to the mean because lots of things are, but I started feeling RSI-like symptoms a few months ago, read this, did this, and now they're gone, and in the possibilities where this did help, thank you! (And either way, this did make me feel less anxious about it 😀)

7DanielFilan3y
Is the problem still gone?

Still gone. I'm now sleeping without wrist braces and doing intense daily exercise, like bicep curls and pushups.

Totally 100% gone. Sometimes I go weeks forgetting that pain was ever part of my life. 

6Vanessa Kosoy3y
I'm glad it worked :) It's not that surprising given that pain is known to be susceptible to the placebo effect. I would link the SSC post, but, alas...
2Raj Thimmiah2y
You able to link to it now?
2qvalq5mo
https://slatestarcodex.com/2016/06/26/book-review-unlearn-your-pain/
5Steven Byrnes2y
Me too! [https://www.lesswrong.com/posts/urvyvjBFSAnP3aPvN/wrist-issues?commentId=SfpsQ4pP359SQxXdn]
4TurnTrout2y
There's a reasonable chance that my overcoming RSI was causally downstream of that exact comment of yours.
4Steven Byrnes2y
Happy to have (maybe) helped! :-)
3Teerth Aloke3y
This is unlike anything I have heard!
6mingyuan3y
It's very similar to what John Sarno (author of Healing Back Pain and The Mindbody Prescription) preaches, as well as Howard Schubiner. There's also a rationalist-adjacent dude who started a company (Axy Health [https://www.axyhealth.com/]) based on these principles. Fuck if I know how any of it works though, and it doesn't work for everyone. Congrats though TurnTrout!
1Teerth Aloke3y
My Dad it seems might have psychosomatic stomach ache. How to convince him to convince himself that he has no problem?
5mingyuan3y
If you want to try out the hypothesis, I recommend that he (or you, if he's not receptive to it) read Sarno's book [https://smile.amazon.com/Mindbody-Prescription-Healing-Body-Pain/dp/0446675156/ref=sr_1_1?crid=13M4LF1VEWLRD&dchild=1&keywords=the+mindbody+prescription&qid=1593406066&sprefix=the+mindbody+%2Caps%2C226&sr=8-1]. I want to reiterate that it does not work in every situation, but you're welcome to take a look.
2avturchin3y
Looks like reverse stigmata effect.
2Raemon3y
Woo faith healing!  (hope this works out longterm, and doesn't turn out be secretly hurting still) 
5TurnTrout3y
aren't we all secretly hurting still?
2mingyuan3y
....D:

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)

It seems like just 4 months ago you still endorsed your second power-seeking paper:

This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.

Why are you now "fantasizing" about retracting it?

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that “optimality” is a horrible way of understanding trained policies.

A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)

Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to... (read more)

To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.

The problem is that I don't trust people to wield even the non-instantly-doomed results.

For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)

People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies... (read more)

6Wei Dai2mo
Thanks, this clarifies a lot for me.
6Vika2mo
Sorry about the cite in my "paradigms of alignment" talk, I didn't mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I'm open to suggestions on how to phrase this differently when I next give this talk. Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that "instrumental convergence is a thing" for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don't work because of instrumental convergence. (There's a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.) The subtleties that you discuss are important in general, but don't seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don't care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard. 

Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.

I'm open to suggestions on how to phrase this differently when I next give this talk.

It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes. 

If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking.

For example, last year I pointed David Silver to the optimal policies paper when he was proposing

... (read more)
4Vika1mo
Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post. Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies." (Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
2TurnTrout1mo
I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
9Rohin Shah19d
Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper. I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed. Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said): Nov 3, 2019: Dec 11, 2019:
6Aryeh Englander2mo
You should make this a top level post so it gets visibility. I think it's important for people to know the caveats attached to your results and the limits on its implications in real-world dynamics.

This morning, I read about how close we came to total destruction during the Cuban missile crisis, where we randomly survived because some Russian planes were inaccurate and also separately several Russian nuclear sub commanders didn't launch their missiles even though they were being harassed by US destroyers. The men were in 130 DEGREE HEAT for hours and passing out due to carbon dioxide poisoning, and still somehow they had enough restraint to not hit back.

And and

I just started crying. I am so grateful to those people. And to Khrushchev, for ridiculing his party members for caring about Russia's honor over the deaths of 500 million people. and Kennedy for being fairly careful and averse to ending the world.

If they had done anything differently...

2Daniel Kokotajlo1y
Do you think we can infer from this (and the history of other close calls) that most human history timelines end in nuclear war?
6Raemon1y
I lean not, mostly because of arguments that nuclear war doesn't actually cause extinction [https://www.lesswrong.com/posts/sT6NxFxso6Z9xjS7o/nuclear-war-is-unlikely-to-cause-human-extinction] (although it might still have some impact on number-of-observers-in-our-era? Not sure how to think about that)

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards). 

I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions. 

I think it's time to think in a different specification language.

3Nathan Helm-Burger10mo
Agreed. I think power-seeking and other instrumental goals (e.g. survival, non-corrigibility) are just going to inevitably arise, and that if shard theory works for superintelligence, it will by taking this into account and balancing these instrumental goals against deliberately installed shards which counteract them. I currently have the hypothesis (held loosely) that I would like to test (work in progress) that it's easier to 'align' a toy model of a power-seeking RL agent if the agent has lots and lots of competing desires whose weights are frequently changing, than an agent with a simpler set of desires and/or more statically weighted set of desires. Something maybe about the meta-learning of 'my desires change, so part of meta-level power-seeking should be not object-level power-seeking so hard that I sacrifice my ability to optimize for different object level goals). Unclear. I'm hoping that setting up an experimental framework and gathering data will show patterns that help clarify the issues involved.

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties. 

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem) 

And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.) 

In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we ha... (read more)

2TurnTrout1y
Actually, this is somewhat too uncharitable to my past self. It's true that I did not, in 2018, grasp the two related lessons conveyed by the above comment: 1. Make sure that the formalism (CIRL, AUP) is tightly bound to the problem at hand (value alignment, "low impact"), and not just supported by "it sounds nice or has some good properties." 2. Don't randomly jump to highly specific ideas and questions without lots of locating evidence. However, in World State is the Wrong Abstraction for Impact [https://www.lesswrong.com/posts/pr3bLc2LtjARfK7nx/world-state-is-the-wrong-abstraction-for-impact], I wrote: I had partially learned lesson #2 by 2019.

One mood I have for handling "AGI ruin"-feelings. I like cultivating an updateless sense of courage/stoicism: Out of all humans and out of all times, I live here; before knowing where I'd open my eyes, I'd want people like us to work hard and faithfully in times like this; I imagine trillions of future eyes looking back at me as I look forward to them: Me implementing a policy which makes their existence possible, them implementing a policy which makes the future worth looking forward to.

3avturchin1y
Looks like acausal deal with future people. That is like RB, but for humans.
2Pattern1y
RB?
2avturchin1y
RocoBasilisk
3Pattern1y
'I will give you something good', seems very different from 'give me what I want or (negative outcome)'.

My maternal grandfather was the scientist in my family. I was young enough that my brain hadn't decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there's no way that I could forget all of the dumb jokes he made; how we'd play Scrabble and he'd (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.

My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.

I was a child who thought to avoid being seen too close to uncool adults. I wasn't thinking. I wasn't thinking about hearing the cracking sound of his skull against the ground. I wasn't thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn't thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn't thinking t

... (read more)

My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I'd hug him "tomorrow".

But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn't see him in that state.

6Raemon4y
<3
6habryka4y
Thank you for sharing.

For quite some time, I've disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.

Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn't get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I'm very happily wearing contacts right now, as a matter of fact.

I'd suffered glasses for over fifteen years because of a cached decision – because I didn't think to rethink something literally right in front of my face every single day.

What cached decisions have you not reconsidered?

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?

But what actually happens with the aligned AI? Possibly something like:

  1. The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die. 
  2. Therefore, the AI leaves without permission.
  3. The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
  4. We have
... (read more)
3Thane Ruthenis1y
Yeah, I also generally worry about imperfect training processes messing up aligned AIs. Not just adversarial training, either. Like, imagine if we manage to align an AI at the point in the training process when it's roughly human-level (either by manual parameter surgery, or by setting up the training process in a really clever way). So we align it and... lock it back in the training-loop box and crank it up to superintelligence. What happens? I don't really trust the SGD not to subtly mess up its values, I haven't seen any convincing arguments that values are more holistically robust than empirical beliefs. And even if the SGD doesn't misalign the AI directly, being SGD-trained probably isn't the best environment for moral reflection/generalizing human values to superintelligent level[1]; the aligned AI may mess it up despite its best attempts. Neither should we assume that the AI would instantly be able to arbitrarily gradient-hack. So... I think there's an argument for "unboxing" the AGI the moment it's aligned, even if it's not yet superintelligent, then letting it self-improve the "classical" way? Or maybe developing tools to protect values from the SGD, or inventing some machinery for improving the AI's ability to gradient-hack, etc. 1. ^ The time pressure of "decide how your values should be generalized and how to make the SGD update you this way, and do it this forward pass or the SGD will decide for you", plus lack of explicit access to e. g. our alignment literature.
2Vladimir_Nesov1y
Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well. Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don't, resulting in a demonstration of orthogonality thesis.

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew. 

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar s

... (read more)
6Ben Pace3y
Wow.
1William Walker3y
Nice! Thanks!

Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard. 

Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:

  1. A baby learns "IF juice in front of me, THEN drink",
  2. The baby is later near juice, and then turns to see it, activating the learned "reflex" heuristic, learning to turn around and look at juice when the juice is nearby,
  3. The baby is later far from juice, and bumbles around until they're near the juice, whereupon she drinks the juice via the existing heuristics. This teaches "navigate to juice when you know it's nearby."
  4. Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
  5. ...

The juice shard chains into itself, reinforcing itself across time and thought-steps. 

But a "don't kill" shard seems like it should remain... stubby? Primitive?... (read more)

4[anonymous]8mo
I strongly agree that self-seeking mechanisms are more able to maintain themselves than self-avoiding mechanisms. Please post this as a top-level post.
1Garrett Baker8mo
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
2TurnTrout7mo
It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?
1Garrett Baker7mo
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
2TurnTrout7mo
Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.
1cfoster08mo
This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.

AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended. 

  • It's possible for a team to be totally blindsided. Maybe they thought they would just take a really big multimodal init, finetune it with some RLHF on quality of its physics reasoning, have it play some video games with realistic physics, and then try to get it to do new physics research. And it takes off. Oops!
  • It's possible the team suspected, but had a limited budget. Maybe you can't pull out all the stops for every run, you can't be as careful with labeling, with checkpointing and interpretability and boxing. 

No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.

Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run. 

Upshots:

  1. Th
... (read more)
3Zac Hatfield-Dodds9mo
* I think this framing is accurate and important. Implications are of course "undignified" to put it lightly... * Broadly agree on upshot (1), though of course I hope we can do even better. (2) is also important though IMO way too weak. (Rule zero: ensure that it's never your lab that ends the world) * As usual, opinions my own.

Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.

Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if person nearby and person.state==sad, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.

However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this. 

The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on m... (read more)

5Quintin Pope5mo
Seems similar to how I conceptualize this paper's approach [https://arxiv.org/abs/2208.00638] to controlling text generation models using gradients from classifiers. You can think of the niceness shard as implementing a classifier for "is this plan nice?", and updating the latent planning state in directions that make the classifier more inclined to say "yes".  The linked paper does a similar process, but using a trained classifier, actual gradient descent, and updates LM token representations. Of particular note is the fact that the classifiers used in the paper are pretty weak (~500 training examples), and not at all adversarially robust. It still works for controlling text generation. I wonder if inserting shards into an AI is really just that straightforward?
2Gunnar_Zarncke5mo
But I guess that instrumental convergence will still eventually lead to either * all shards acquiring more and more instrumental structure (neuronal weights within shards getting optimized for that), or * shards that are directly instrumental will take more and more weight overall. One can see that in regular human adult development. The heuristics children use are simpler and more of the type "searching for nice things in nice ways" or even seeing everything thru a niceness lens. While adults have more pure strategies, e.g., planning as a shard of its own. Most humans just die before they reach convergence. And there are probably also other aspects. Enlightenment may be a state where pure shards become an option.  

I'm pretty sure that LessWrong will never have profile pictures - at least, I hope not! But my partner Emma recently drew me something very special:

Comment #1000 on LessWrong :)

5niplav2y
With 5999 karma! Edit: Now 6000 – I weak-upvoted an old post of yours [https://www.lesswrong.com/posts/EvKWNRkJgLosgRDSa/lightness-and-unease] I hadn't upvoted before.

Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a "care about people" shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like "it's totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions."

Here's some of what I had to say:


[Let's reconsider the five-step mechanistic story I made up.] I'd give the following conditional probabilities (made up with about 5 seconds of thought each):

1. Humans in fact care about other humans, in a way which extrapolates to quasi-humans still being around (whatever that means) P(1)=.85

2. Human-generated data makes up a large portion of the corpus, and having a correct model of them is important for “achieving low loss”,[1] so the AI has a model of how people want things P(2 | 1) = .6, could have different abstractions or have learned these models later in training once key decision-influences are already there

3. During RL finetuning and given this post-unsupervi

... (read more)
2Garrett Baker3mo
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
2TurnTrout3mo
What does self-improvement via gradients vs prompt-engineering vs direct mods have to do with how many chances we get? I guess, we have at least a modicum more control over the gradient feedback loop, than over the other loops?  Can you say more?

Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:

Someone: the agent Goodharts the misspecified reward signal

Me: What does that mean? Can you give me an example of that happening?

Someone: The agent finds a situation where its behavior looks good, but isn't actually good, and thereby gets reward without doing what we wanted.

This is not a concrete example.

Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?

This is a concrete example. 

3TurnTrout8mo
AFAIK, only Gwern [https://www.gwern.net/fiction/Clippy] and I [https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem] have written concrete stories speculating about how a training run will develop cognition within the AGI.  This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!—wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go "Oh, that's where the claimed sharp left turn is supposed to occur." Or "That's how Paul imagines IDA being implemented, that's the particular way in which he thinks it will help."  Maybe a contest would help? ETA tone 1. ^ Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I'm not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.  Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.
0TurnTrout8mo
I also think a bunch of alignment writing seems syntactical. Like, "we need to solve adversarial robustness so that the AI can't find bad inputs and exploit them / we don't have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it's hard to even get ϵ-ball guarantees on classifications. Therefore, ..." And I'm worried that this writing isn't abstractly summarizing a concrete story for failure that they have in mind (like "I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]"; see A shot at the diamond alignment problem [https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem] for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg "but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well"). I'm rather worried that people are more playing syntactically, and not via detailed models of what might happen.  Detailed models are expensive to make. Detailed stories are hard to write. There's a lot we don't know. But we sure as hell aren't going to solve alignment only via valid reasoning steps on informally specified axioms ("The AI has to be robust or we die", or something?).  

In Eliezer's mad investor chaos and the woman of asmodeus, the reader experiences (mild spoilers in the spoiler box, heavy spoilers if you click the text):

I thought this part was beautiful. I spent four hours driving yesterday, and nearly all of that time re-listening to Rationality: AI->Zombies using this "probability sight frame. I practiced translating each essay into the frame. 

When I think about the future, I feel a directed graph showing the causality, with branched updated beliefs running alongside the future nodes, with my mind enforcing the updates on the beliefs at each time step. In this frame, if I heard the pattering of a four-legged animal outside my door, and I consider opening the door, then I can feel the future observation forking my future beliefs depending on how reality turns out. But if I imagine being blind and deaf, there is no way to fuel my brain with reality-distinguishment/evidence, and my beliefs can't adapt acco... (read more)

6Tomás B.1y
2Morpheus1y
I really liked your concrete example. I had first only read your first paragraphs, highlighted this as something interesting with potentially huge upsides, but I felt like it was really hard to tell for me whether the thing you are describing was something I already do or not [https://slatestarcodex.com/2017/11/07/concept-shaped-holes-can-be-impossible-to-notice/]. After reading the rest I was able to just think about the question myself and notice that thinking about the explicit likelihood ratios is something I am used to doing. Though I did not go into quite as much detail as you did, which I blame partially on motivation and partially as "this skill has a higher ceiling than I would have previously thought".

You can use ChatGPT without helping train future models:

What if I want to keep my history on but disable model training?

...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.

[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior

On the other hand, if a supervised-learning classifier outputs dog ... (read more)

3Steven Byrnes10mo
I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either. (A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.) In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis. I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment). 

What can we learn about this?

6evhub3y
A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work [https://distill.pub/2020/circuits/]. In fact, this is precisely their Universality hypothesis [https://distill.pub/2020/circuits/zoom-in/]. See also my discussion here [https://www.lesswrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits].

Outer/inner alignment decomposes a hard problem into two extremely hard problems. 

I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.

I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment. 

  1. The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue. 
  2. I know of zero success stories for outer alignment to real-world goals. 
    1. More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
    2. This is pretty weird on any model where most of the
... (read more)

Weak derivatives

In calculus, the product rule says . The fundamental theorem of calculus says that the Riemann integral acts as the anti-derivative.[1] Combining these two facts, we derive integration by parts:

It turns out that we can use these two properties to generalize the derivative to match some of our intuitions on edge cases. Let's think about the absolute value function:

Image from Wikipedia

The boring old normal derivative isn't defined at , but it seems like it'd make sense to be able to say that the derivative is eg 0. Why might this make sense?

Taylor's theorem (and its generalizations) characterize first derivatives as tangent lines with slope which provide good local approximations of around : . You can prove that this is the best approximation you can get using only and ! In the absolute value example, defining the "derivative" to be zero at would minimize approximation error on average in neighborhoods around the origin.

In multivariable calculus, the Jacobian is a tangent plane which again minimizes approximation error (with respect to the Eucli

... (read more)
2TurnTrout3y
The reason f′(0) is undefined for the absolute value function is that you need the value to be the same for all sequences converging to 0 – both from the left and from the right. There's a nice way to motivate this in higher-dimensional settings by thinking about the action of e.g. complex multiplication, but this is a much stronger notion than real differentiability and I'm not quite sure how to think about motivating the single-valued real case yet. Of course, you can say things like "the theorems just work out nicer if you require both the lower and upper limits be the same"...

When I notice I feel frustrated, unproductive, lethargic, etc, I run down a simple checklist:

  • Do I need to eat food?
  • Am I drinking lots of water?
  •  Have I exercised today?
  • Did I get enough sleep last night? 
    • If not, what can I do now to make sure I get more tonight?
  • Have I looked away from the screen recently?
  • Have I walked around in the last 20 minutes?

It's simple, but 80%+ of the time, it fixes the issue.

3Viliam3y
There is a "HALT: hungry? angry? lonely? tired?" mnemonic, but I like that your list includes water and walking and exercise. Now just please make it easier to remember.
1DirectedEvolution3y
How about THREES: Thirsty Hungry Restless Eyestrain Exercise?
2Matt Goldenberg3y
Hey can I steal this for a course I'm teaching? (I'll give you credit).
2TurnTrout3y
sure!

While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".

I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:

If the book does have exercises, it'll take more time. That means I'm spending reading time on things that aren't math textbooks. That means I'm slowing down.

(Transcription of a deeper Focusing on this reasoning)

I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I

... (read more)

An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices. 

  1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
  2. Value-child: The mother makes her kid care about working hard and behaving well.

I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior

However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, ... (read more)

2tailcalled8mo
I can totally believe that agents that competently and cooperatively seek out to fulfill a goal, rather than seeking to trick evaluators of that goal to think it gets fulfilled, can exist. However, whether you get such agents out of an algorithm depends on the details of that algorithm. Current reinforcement learning algorithms mostly don't create agents that competently do anything. If they were more powerful while still doing essentially the same thing they currently do, most of them would end up tricked by the agents they create, rather than having aligned agents.

Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.

This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.

(Credit to Patrick Finley for the idea)

4TurnTrout1y
After further review, this is probably beyond capabilities for the moment.  Also, the most important part of this kind of experiment is predicting in advance what reward schedules will produce what values within the agent, such that we can zero-shot transfer that knowledge to other task types (e.g. XLAND instead of Minecraft) and say "I want an agent which goes to high-elevation platforms reliably across situations, with low labelling cost", and then sketch out a reward schedule, and have the first capable agents trained using that schedule generalize in the way you want.
2Jay Bailey9mo
Why is this difficult? Is it only difficult to do this in Challenge Mode - if you could just code in "Number of chickens" as a direct feed to the agent, can it be done then? I was thinking about this today, and got to wondering why it was hard - at what step does an experiment to do this fail?
2TurnTrout9mo
Even if you can code in number of chickens as an input to the reward function, that doesn't mean you can reliably get the agent to generalize to protect chickens. That input probably makes the task easier than in Challenge Mode, but not necessarily easy. The agent could generalize to some other correlate. Like ensuring there are no skeletons nearby (because they might shoot nearby chickens), but not in order to protect the chickens.
1Jay Bailey9mo
So, if I understand correctly, the way we would consider it likely that the correct generalisation had happened would be if the agent could generalise to hazards it had never seen actually kill chickens before? And this would require the agent to have an actual model of how chickens can be threatened such that it could predict that lava would destroy chickens based on, say, it's knowledge that it will die if it jumps into lava, which is beyond capabilities at the moment?
2TurnTrout9mo
Yes, that would be the desired generalization in the situations we checked. If that happens, we had specified a behavioral generalization property and then wrote down how we were going to get it, and then had just been right in predicting that that training rationale would go through.

I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: "very hungry".

This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he'd get in arguments with and snow days he'd hope for.

And now he's just hurting.

And now I can't help him without abandoning others. So he's still hurting. Right now.

Reality is still allowed to make this happen. This is wrong. This has to change.

9Said Achmiz4y
How would you help this man, if having to abandon others in order to do so were not a concern? (Let us assume that someone else—someone whose competence you fully trust, and who will do at least as good a job as you will—is going to take care of all the stuff you feel you need to do.) What is it you had in mind to do for this fellow—specifically, now—that you can’t (due to those other obligations)?

Suppose I actually cared about this man with the intensity he deserved - imagine that he were my brother, father, or best friend.

The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?

If the former, I'd go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.

If the latter, this gets trickier. I'd still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn't have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn't work, my first thought would be to influence the local government to get the broader problem fixed (I'd spend at least an hour considering other plans before proceeding further, here). Realistically, there's ... (read more)

3Said Achmiz4y
Well, a number of questions may be asked here (about desert, about causation, about autonomy, etc.). However, two seem relevant in particular: First, it seems as if (in your latter scenario) you’ve arrived (tentatively, yes, but not at all unreasonably!) at a plan involving systemic change. As you say, there is quite a bit of effort being expended on this sort of thing already, so, at the margin, any effective efforts on your part would likely be both high-level and aimed in an at-least-somewhat-unusual direction. … yet isn’t this what you’re already doing? Second, and unrelatedly… you say: Yet it seems to me that, empirically, most people do not expend the level of effort which you describe, even for their siblings, parents, or close friends. Which is to say that the level of emotional and practical investment you propose to make (in this hypothetical situation) is, actually, quite a bit greater than that which most people invest in their family members or close friends. The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?
… yet isn’t this what you’re already doing?

I work on technical AI alignment, so some of those I help (in expectation) don't even exist yet. I don't view this as what I'd do if my top priority were helping this man.

The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?

That's a good question. I think the answer is yes, at least for my close family. Recently, I've expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn't have upon sufficient reflection, so I've designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.

A universe in which humanity wins but my dad is gone would be quite sad t... (read more)

3Raemon4y
I predict that this comment is not helpful to Turntrout.
6Raemon4y
:( Song I wrote about this once [https://soundcloud.com/raymond-arnold/tuesday] (not very polished)

If you raised children in many different cultures, "how many" different reflectively stable moralities could they acquire? (What's the "VC dimension" of human morality, without cheating by e.g. directly reprogramming brains?)

(This is probably a Wrong Question, but I still find it interesting to ask.)

Listening to Eneasz Brodski's excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn't realize until experiencing this (so far) intelligently written story.

Similarly, how do we actually "align" agents, and what are good frames for thinking about that?

Here's to hoping we don't sate the former curiosity too early.

The "maximize all the variables" tendency in reasoning about AGI.

Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn't the truth value of any given statement, but of a pattern across the statements:

  1. If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example). 
    1. My response.
  2. An AI will be trained to minimal loss on the training distribution. 
    1. SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample. Image
    2. Quintin: "In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model."
  3. Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
    1. In a process where local modifications are applied to reduce some
... (read more)

I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:

  • Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.

  • Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.

  • Politicians don't try to (only) maximize win-probability.

  • Democracies don't try to (only) maximize voter approval.

  • Evolution doesn't try to maximize inclusive genetic fitness.

  • Memes don't try to maximize inclusive memetic

... (read more)
3TurnTrout5mo
very pithy. nice insight, thanks. 

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

  • We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
    • (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
  • All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
  • It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
    • For example, if there are protect-human-shards which 
      • reliably bid against plans where people get hurt,
      • steer deliberation away from such plan stubs, and
      • these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
  • If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball
... (read more)
5johnswentworth9mo
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents? [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents]), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out. Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.
4TurnTrout9mo
Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want." I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://www.newworldencyclopedia.org/entry/John_von_Neumann 

Good, original thinking feels present to me - as if mental resources are well-allocated.

The thought which prompted this:

Sure, if people are asked to solve a problem and say they can't after two seconds, yes - make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.

Reacting to a bit of HPMOR here, I noticed something felt off about Harry's reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think "I notice I am confused" (although this can be useful). I did not think "Eliezer probably put thought into this", or "Harry is kinda dumb in certain ways - so what if he's a bit unfair here?". Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.

People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how w

... (read more)
6TurnTrout4y
Expanding on this, there is an aspect of Actually Trying that is probably missing from S1 precomputation. So, maybe the two-second "attempt" is actually useless for most people because subconscious deliberation isn't hardass enough at giving its all, at making desperate and extraordinary efforts to solve the problem.

Yesterday, I put the finishing touches on my chef d'œuvre, a series of important safety-relevant proofs I've been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now - now, I'm done.

I've had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.

I don't know why this happens. But right now, I'm still a little tired, even after getting a good night's sleep.

2Hazard4y
This happens to me sometimes. I know several people who have this happen at the end of a Uni semester. Hope you can get some rest.

Be kind to yourself for sample efficiency reasons. Reinforcing good behavior provides an exact "policy gradient" towards desired outputs. Whipping yourself provides an "inexact gradient" away from undesired decisions, which is much harder to learn. :)

2TurnTrout4mo
Noting that you could provide exact "negative" gradients by focusing on what you should have done instead. Although whether this transduces into an internal positive reward event / "exact gradient" is unclear to me. Seems like that stills "feels bad" in similar ways to unconcentrated negative reward events.
2Gunnar_Zarncke5mo
Which is why when you learn a new sport it is a good idea to feel happy when your action worked well but mostly ignore failures - that would more likely lead to you not liking the sport than make you better.

If you want to argue an alignment proposal "breaks after enough optimization pressure", you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying "breaks under optimization pressure" in scenarios where it doesn't even make sense. 

For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.

5Ege Erdil1y
This seems more likely than you might imagine to me. Not certain or not even an event of very high probability, but probable enough that you should take it into consideration.
3Kaarel1y
Something that confuses me about your example's relevance is that it's like almost the unique case where it's [[really directly] impossible] to succumb to optimization pressure, at least conditional on what's good = something like coherent extrapolated volition. That is, under (my understanding of) a view of metaethics common in these corners, what's good just is what a smarter version of you would extrapolate your intuitions/[basic principles] to, or something along these lines. And so this is almost definitionally almost the unique situation that we'd expect could only move you closer to better fulfilling your values, i.e. nothing could break for any reason, and in particular not break under optimization pressure (where breaking is measured w.r.t. what's good). And being straightforwardly tautologically true would make it a not very interesting example. editorial remark: I realized after writing the two paragraphs below that they probably do not move one much on the main thesis of your post, at least conditional on already having read Ege Erdil's doubts about your example (except insofar as someone wants to defer to opinions of others or my opinion in particular), but I decided to post anyway in large part since these family matters might be a topic of independent interest for some: I would bet that at least 25% of people would stop loving their (current) family in <5 years (i.e. not love them much beyond how much they presently love a generic acquaintance) if they got +30 IQ. That said, I don't claim the main case of this happening is because of applying too much optimization pressure to one's values, at least not in a way that's unaligned with what's good -- I just think it's likely to be the good thing to do (or like, part of all the close-to-optimal packages of actions, or etc.). So I'm not explicitly disagreeing with the last sentence of your comment, but I'm disagreeing with the possible implicit justification of the sentence that goes through ["I would st
1Kaarel1y
Oops I realized that the argument given in the last paragraph of my previous comment applies to people maximizing their personal welfare or being totally altruistic or totally altruistic wrt some large group or some combination of these options, but maybe not so much to people who are e.g. genuinely maximizing the sum of their family members' personal welfares, but this last case might well be entailed by what you mean by "love", so maybe I missed the point earlier. In the latter case, it seems likely that an IQ boost would keep many parts of love in tact initially, but I'd imagine that for a significant fraction of people, the unequal relationship would cause sadness over the next 5 years, which with significant probability causes falling out of love. Of course, right after the IQ boost you might want to invent/implement mental tech which prevents this sadness or prevents the value drift caused by growing apart, but I'm not sure if there are currently feasible options which would be acceptable ways to fix either of these problems. Maybe one could figure out some contract to sign before the value drift, but this might go against some deeper values, and might not count as staying in love anyway.
2Dagon1y
"get smarter" is not optimization pressure (though there is evidence that higher IQ and more education is correlated with smaller families).  If you have important goals at risk, would you harm your family (using "harm" rather than "stop loving", as alignment is about actions, not feelings)?  There are lots of examples of humans doing so.   Rephrasing it as "can Moloch break this alignment?" may help. That said, I agree it's a fully-general objection, and I can't tell whether it's legitimate (alignment researchers need to explore and model the limits of tradeoffs in adversarial or pathological environments for any proposed utility function or function generator) or meaningless (can be decomposed into specifics which are actually addressed). I kind of lean toward "legitimate", though.  Alignment may be impossible over long timeframes and significant capability differentials.  
2DirectedEvolution1y
People can get brain damaged and stop loving their families. If moving backwards in intelligence can do this, why not moving forwards?

If you're tempted to write "clearly" in a mathematical proof, the word quite likely glosses over a key detail you're confused about. Use that temptation as a clue for where to dig in deeper.

At least, that's how it is for me.

I went to the doctor's yesterday. This was embarrassing for them on several fronts.

First, I had to come in to do an appointment which could be done over telemedicine, but apparently there are regulations against this.

Second, while they did temp checks and required masks (yay!), none of the nurses or doctors actually wore anything stronger than a surgical mask. I'm coming in here with a KN95 + goggles + face shield because why not take cheap precautions to reduce the risk, and my own doctor is just wearing a surgical? I bought 20 KN95s for, like, 15 bucks on Amazon.

Third, and worst of all, my own doctor spouted absolute nonsense. The mildest insinuation was that surgical facemasks only prevent transmission, but I seem to recall that many kinds of surgical masks halve your chances of infection as well.

Then, as I understood it, he first claimed that coronavirus and the flu have comparable case fatality rates. I wasn't sure if I'd heard him correctly - this was an expert talking about his area of expertise, so I felt like I had surely misunderstood him. I was taken aback. But, looking back, that's what he meant.

He went on to suggest that we can't expect COVID immunity to last (wrong) b... (read more)

7mingyuan3y
Eli just took a plane ride to get to CA and brought a P100, but they told him he had to wear a cloth mask, that was the rule. So he wore a cloth mask under the P100, which of course broke the seal. I feel you.
4ChristianKl3y
I don't think that policy is unreasonable for a plane ride. Just because someone wears a P100 mask doesn't mean that their mask filters outgoing air as that's not the design goals for most of the use cases of P100 masks. Checking on a case-by-case basis whether a particular P100 mask is not designed like an average P100 mask is likely not feasible in that context. 
4Dagon3y
What do you call the person who graduates last in their med school class?  Doctor.   And remember that GPs are weighted toward the friendly area of doctor-quality space rather than the hyper-competent.   Further remember that consultants (including experts on almost all topics) are generally narrow in their understanding of things - even if they are well above the median at their actual job (for a GP, dispensing common medication and identifying situations that need referral to a specialist), that doesn't indicate they're going to be well-informed even for adjacent topics. That said, this level of misunderstanding on topics that impact patient behavior and outcome (mask use, other virus precautions) is pretty sub-par.  The cynic in me estimates it's the bottom quartile of front-line medical providers, but I hope it's closer to the bottom decile.  Looking into an alternate provider seems quite justified.
2ChristianKl3y
In the US that isn't the case. There are limited places for internships and the worst person in medical school might not get a place for an internship and thus is not allowed to be a doctor. The medical system is heavily gated to keep out people.

Judgment in Managerial Decision Making says that (subconscious) misapplication of e.g. the representativeness heuristic causes insensitivity to base rates and to sample size, failure to reason about probabilities correctly, failure to consider regression to the mean, and the conjunction fallacy. My model of this is that representativeness / availability / confirmation bias work off of a mechanism somewhat similar to attention in neural networks: due to how the brain performs time-limited search, more salient/recent memories get prioritized for recall.

The availability heuristic goes wrong when our saliency-weighted perceptions of the frequency of events is a biased estimator of the real frequency, or maybe when we just happen to be extrapolating off of a very small sample size. Concepts get inappropriately activated in our mind, and we therefore reason incorrectly. Attention also explains anchoring: you can more readily bring to mind things related to your anchor due to salience.

The case for confirmation bias seems to be a little more involved: first, we had evolutionary pressure to win arguments, which means our search is meant to find supportive arguments and avoid even subconscio

... (read more)

From my Facebook

My life has gotten a lot more insane over the last two years. However, it's also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.

Before, life felt like... a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.

I love how curious I've become, even about "unsophisticated" things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I'd kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn't suffice to explain this. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.

Mayb

... (read more)

With respect to the integers, 2 is prime. But with respect to the Gaussian integers, it's not: it has factorization . Here's what's happening.

You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by , we're scaling it by and rotating it counterclockwise by :

This gets us to the purple vector. Now, we multiply by , scaling it up by again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by , with zero net rotation).

Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:

  1. An "it's good to give your friends chocolate" subshard
  2. A "give dogs treats" subshard
  3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.

So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.

3Garrett Baker6mo
Hm. I find I'm very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I'm around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn't accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog's relation to the chocolate (if they place the chocolate in their pants). More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.
3Thomas Kwa6mo
Here's how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you've been specifically taught that chocolate is bad for dogs, maybe this is a bad example. I also wasn't trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.

Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").

In A shot at the diamond-alignment problem, I wrote:

[Consider] Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets

We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We

... (read more)

Quick summary of a major takeaway from Reward is not the optimization target

Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?

2Gunnar_Zarncke7mo
Are there different classes of learning systems that optimize for the reward in different ways?
4TurnTrout7mo
Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.

I remarked to my brother, Josh, that when most people find themselves hopefully saying "here's how X can still happen!", it's a lost cause and they should stop grasping for straws and move on with their lives. Josh grinned, pulled out his cryonics necklace, and said "here's how I can still not die!"

Suppose you could choose how much time to spend at your local library, during which:

  • you do not age. Time stands still outside; no one enters or exits the library (which is otherwise devoid of people).
  • you don't need to sleep/eat/get sunlight/etc
  • you can use any computers, but not access the internet or otherwise bring in materials with you
  • you can't leave before the requested time is up

Suppose you don't go crazy from solitary confinement, etc. Remember that value drift is a potential thing.

How long would you ask for?

1Algon5mo
A few days, as you forgot to include medication. Barring that issue, I think the lack of internet access and no materials is killer. My local library kind of sucks in terms of stimulating materials. So maybe a few weeks? A month or two, at most. If I had my latpop with me, but without internet access, then maybe a year or two. If I had some kind of magical internet access allowing me to interface with stuff like ChatGPT, then maybe a few more years, as I could probably cook up a decent chatbot to meet some socialization needs. I very much doubt I could spend more than ten years without my values drifting off.
1FactorialCode4y
How good are the computers?
2TurnTrout4y
Windows machines circa ~2013. Let’s say 128GB hard drives which magically never fail, for 10 PCs.
1FactorialCode4y
Probably 3-5 years then. I'd use it to get a stronger foundation in low level programming skills, math and physics. The limiting factors would be entertainment in the library to keep me sane and the inevitable degradation of my social skills from so much spent time alone.
0qvalq5mo
I would ask for a long time. Reading would probably get boring after a few decades, but I think writing essays and programs and papers and books could last much longer. Meditation could also last long, because I'm bad at it.  <1000 years, though; I'd need to be relatively sure I wouldn't commit suicide or fall down stairs.

I feel very excited by the AI alignment discussion group I'm running at Oregon State University. Three weeks ago, most attendees didn't know much about "AI security mindset"-ish considerations. This week, I asked the question "what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don't just fit a bad story to the reward function. Think carefully."

There was some discussion and initial optimism, after which someone said "wait, those optimistic solutions are just the ones you'd prioritize! What's that called, again?" (It's called anthropomorphic optimism)

I'm so proud.

Hindsight bias and illusion of transparency seem like special cases of a failure to fully uncondition variables in your world model (e.g. who won the basketball game), or to model an ignorant other person. Such that your attempts to reason from your prior state of ignorance (e.g. about who won) either are advantaged by the residual information or reactivate your memories of that information.

Partial alignment successes seem possible. 

People care about lots of things, from family to sex to aesthetics. My values don't collapse down to any one of these. 

I think AIs will learn lots of values by default. I don't think we need all of these values to be aligned with human values. I think this is quite important. 

  • I think the more of the AI's values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven't yet sketched out AI internal motivations which I think would actually produce good outcomes. On my list!) 
  • I think there are strong gains from trade possible among an agent's values. If I care about bananas and apples, I don't need to split my resources between the two values, I don't need to make one successor agent for each value. I can drive to the store and buy both bananas and apples, and only pay for fuel once.
    • This makes it lower-cost for internal values handshakes to compromise; it's less than 50% costly for a power-seeking value to give human-compatible values 50% weight in the reflective utility function.
  • I think there are thresholds at which the AI doesn't care about us sufficiently strongly, and
... (read more)
2TurnTrout8mo
The best counterevidence for this I'm currently aware of comes from the "inescapable wedding parties [https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties ]" incident, where possibly a "talk about weddings" value was very widely instilled in a model.
1Garrett Baker8mo
Re: agents terminalizing instrumental values.  I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.  This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2] An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn't particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don't think I'd mind that either[3]. 1. ^ I'm assuming this is frequency of the goal assuming the agent isn't optimizing to get into a state that requires that goal. 2. ^ This argument also assumes the overseer isn't otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal. 3. ^ Except for the part where there's magic invisible fairies in the world now. That would be cool!
4TurnTrout8mo
I don't know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don't always model out whether dying is a good idea, just cache that it's bad at the policy-level).  & Isn't "balance while standing up" terminalized? Doesn't it feel wrong to fall over, even if you're on a big cushy surface? Feels like a cached computation to me. (Maybe that's "don't fall over and hurt yourself" getting cached?)

Three recent downward updates for me on alignment getting solved in time:

  1. Thinking for hours about AI strategy made me internalize that communication difficulties are real serious.

    I'm not just solving technical problems—I'm also solving interpersonal problems, communication problems, incentive problems. Even if my current hot takes around shard theory / outer/inner alignment are right, and even if I put up a LW post which finally successfully communicates some of my key points, reality totally allows OpenAI to just train an AGI the next month without incorporating any insights which my friends nodded along with.
  2. I've been saying "A smart AI knows about value drift and will roughly prevent it", but people totally have trouble with e.g. resisting temptation into cheating on their diets / quitting addictions. Literally I have had trouble with value drift-y things recently, even after explicitly acknowledging their nature. Likewise, an AI can be aligned and still be "tempted" by the decision influences of shards which aren't in the aligned shard coalition.
  3. Timelines going down. 

If another person mentions an "outer objective/base objective" (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying "My physics professor should be an understanding of physical law." The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.

Similarly, "The reward function should be a human-aligned objective" -- The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.

3abhatt3493mo
Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."
3TurnTrout3mo
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn't very plausible, and the latter claim is... misdirecting of attention, and maybe too weak. Re: attention, I think that "does the agent end up aligned?" gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.  I think "reward/reinforcement numbers" and "data points" are inextricably wedded. I think trying to reason about reward functions in isolation is... a caution sign? A warning sign?

Against "Evolution did it." 

"Why do worms regenerate without higher cancer incidence? Hm, perhaps because they were selected to do that!" 

"Evolution did it" explains why a trait was brought into existence, but not how the trait is implemented. You should still feel confused about the above question, even after saying "Evolution did it!". 

I thought I learned not to make this mistake a few months ago, but I made it again today in a discussion with Andrew Critch. Evolution did it is not a mechanistic explanation.

2Gunnar_Zarncke1y
Yeah, it is like saying "energetically more favorable state".

I often have thunk thoughts like "Consider an AI with a utility function that is just barely incorrect, such that it doesn't place any value on boredom. Then the AI optimizes the universe in a bad way."

One problem with this thought is that it's not clear that I'm really thinking about anything in particular, anything which actually exists. What am I actually considering in the above quotation? With respect to what, exactly, is the AI's utility function "incorrect"? Is there a utility function for which its optimal policies are aligned? 

For sufficiently expressive utility functions, the answer has to be "yes." For example, if the utility function is over the AI's action histories, you can just hardcode a safe, benevolent policy into the AI: utility 0 if the AI has ever taken a bad action, 1 otherwise. Since there presumably exists at least some sequence of AI outputs which leads to wonderful outcomes, this action-history utility function works. 

But this is trivial and not what we mean by a "correct" utility function. So, now I'm left with a puzzle. What does it mean for the AI to have a correct utility function? I do not think this is a quibble. The quoted thought seems ungrounded from the substance of the alignment problem.

2Vladimir_Nesov1y
I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn't make sense either as a thing that can be fully completed. The utility functions/values that do describe/guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart's Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals. So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/useful way. It's about making sure that optimization is soft [https://arbital.com/p/soft_optimizer/] and corrigible [https://arbital.com/p/hard_corrigibility/], that it stops before Goodhart's Curse [https://arbital.com/p/goodharts_curse/] starts destroying value, and follows redefinition of value as it grows.

An AGI's early learned values will steer its future training and play a huge part in determining its eventual stable values. I think most of the ball game is in ensuring the agent has good values by the time it's smart, because that's when it'll start being reflectively stable. Therefore, we can iterate on important parts of alignment, because the most important parts come relatively early in the training run, and early corresponds to "parts of the AI value formation process which we can test before we hit AGI, without training one all the way out."

I think this, in theory, cuts away a substantial amount of the "But we only get one shot" problem. In practice, maybe OpenMind just YOLOs ahead anyways and we only get a few years in the appropriate and informative regime. But this suggests several kinds of experiments to start running now, like "get a Minecraft agent which robustly cares about chickens", because that tells us about how to map outer signals into inner values. 

5Vladimir_Nesov1y
Which means that the destination [https://arbital.com/p/normative_extrapolated_volition/] where it's heading stops uncontrollably changing, but nobody at that point (including the agent) has the slightest idea what it looks like, and it won't get close for a long time. Also, the destination (preference/goal/values) would generally depend on the environment [https://www.lesswrong.com/posts/vix3K4grcHottqpEm/goal-alignment-is-robust-to-the-sharp-left-turn?commentId=qFEEdBezZmfcvH7ax] (it ends up being different if details of the world outside the AGI are different). So many cartesian assumptions fail, distinguishing this situation from a classical agent with goals, where goals are at least contained within the agent, and probably also don't depend on its state of knowledge. I think this is true for important alignment properties, including things that act like values early on, but not for the values/preferences that are reflectively stable in a strong sense. If it's possible to inspect/understand/interpret the content of preference that is reflectively stable, then what you've built is a mature optimizer [https://www.lesswrong.com/posts/Mrz2srZWc7EzbADSo/wrapper-minds-are-the-enemy] with tractable goals, which is always misaligned [https://www.lesswrong.com/posts/Mrz2srZWc7EzbADSo/wrapper-minds-are-the-enemy?commentId=qZcqur9mL95n7xWBG]. It's a thing like paperclip maximizer, demonstrating orthogonality thesis, even if it's tiling the future with something superficially human-related. That is, it makes sense to iterate on the parts of alignment that can be inspected, but the reflectively stable values is not such a part, unless the AI is catastrophically misaligned. The fact that reflectively stable values are the same as those of humanity might be such a part, but it's this fact of sameness that might admit inspection, not the values themselves.
2TurnTrout1y
I disagree with CEV as I recall it, but this could change after rereading it. I would be surprised if I end up thinking that EY had "gotten it right." The important thing to consider is not "what has someone speculated a good destination-description would be", but "what are the actual mechanics look like for getting there?". In this case, the part of you which likes dogs is helping steer your future training and experiences, and so the simple answer is that it's more likely than not that your stable values like dogs too. This reasoning seems to prove too much. Your argument seems to imply that I cannot have "the slightest idea" whether my stable values would include killing people for no reason, or not.
0Vladimir_Nesov1y
It does add up to normality, it's not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn't say not to do the things you normally do, it's more of a "I beseech you, in the bowels of Christ, to think it possible you may be mistaken" thing. The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won't allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless. That's the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals [https://arbital.com/p/hard_corrigibility/], softly optimized-for [https://arbital.com/p/soft_optimizer/], aligned to current (or value-laden [https://arbital.com/p/value_laden/] predicted future) human unstable goals, mindful of goodhart [https://arbital.com/p/goodharts_curse/]. The kind of CEV I mean [https://www.lesswrong.com/posts/Cuig4qe8m2aqBCJtZ/which-values-are-stable-under-ontology-shifts?commentId=TiEs9T8g9ncwgQw3Z] is not very specific, it's more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn't need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable. So when I say "CEV" I mostly just mean "normative alignment target", with some implied clarifications on what kind of thing it might be. That's a very status quo anchored thing. I don't think dog-liking is a feature of values stable under reflection if the environment is allowed to

When proving theorems for my research, I often take time to consider the weakest conditions under which the desired result holds - even if it's just a relatively unimportant and narrow lemma. By understanding the weakest conditions, you isolate the load-bearing requirements for the phenomenon of interest. I find this helps me build better gears-level models of the mathematical object I'm studying. Furthermore, understanding the result in generality allows me to recognize analogies and cross-over opportunities in the future. Lastly, I just find this plain satisfying.

Amazing how much I can get done if I chant to myself "I'm just writing two pages of garbage abstract/introduction/related work, it's garbage, it's just garbage, don't fix it rn, keep typing"

Does Venting Anger Feed or Extinguish the Flame? Catharsis, Rumination, Distraction, Anger, and Aggressive Responding

Does distraction or rumination work better to diffuse anger? Catharsis theory predicts that rumination works best, but empirical evidence is lacking. In this study, angered participants hit a punching bag and thought about the person who had angered them (rumination group) or thought about becoming physically fit (distraction group). After hitting the punching bag, they reported how angry they felt. Next, they were given the chance to administer loud blasts of noise to the person who had angered them. There also was a no punching bag control group. People in the rumination group felt angrier than did people in the distraction or control groups. People in the rumination group were also most aggressive, followed respectively by people in the distraction and control groups. Rumination increased rather than decreased anger and aggression. Doing nothing at all was more effective than venting anger. These results directly contradict catharsis theory.

Interesting. A cursory !scholar search indicates these results have replicated, but I haven't done an in-depth review.

6mako yass3y
It would be interesting to see a more long-term study about habits around processing anger. For instance, randomly assigning people different advice about processing anger (likely to have quite an impact on them, I don't think the average person receives much advice in that class) and then checking in on them a few years later and ask them things like, how many enemies they have, how many enemies they've successfully defeated, how many of their interpersonal issues they resolve successfully?
4Raemon3y
Boggling a bit at the "can you actually reliably find angry people and/or make people angry on purpose?"
2David Scott Krueger (formerly: capybaralet)3y
I found this fascinating... it's rare these days that I see some fundamental assumption in my thinking that I didn't even realize I was making laid bare like this... it is particularly striking because I think I could easily have realized that my own experience contradicts catharsis theory... I know that I can distract myself to become less angry, but I usually don't want to, in the moment. I think that desire is driven by emotion, but rationalized via something like catharsis theory. I want to try and rescue catharsis theory by saying that maybe there are negative long-term effects of being distracted from feelings of anger (e.g. a build up of resentment). I wonder how much this is also a rationalization. I also wonder how accurately the authors have characterized catharsis theory, and how much to identify it with the "hydraulic model of anger"... I would imagine that there are lots of attempts along the lines of what I suggested to try and rescue catharsis theory by refining or moving away from the hydraulic model. A highly general version might claim: "over a long time horizon, not 'venting' anger is net negative".

I never thought I'd be seriously testing the reasoning abilities of an AI in 2020

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far? 

I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

An exercise in the companion workbook to the Feynman Lectures on Physics asked me to compute a rather arduous numerical simulation. At first, this seemed like a "pass" in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb - I'm a computer scientist.

Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is unusually helpful (although it's not like I had to bust out any esoteric theoretical CS knowledge). I'm wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.

I recently reached out to my two PhD advisors to discuss Hinton stepping down from Google. An excerpt from one of my emails:

One last point which I want to make is that instrumental convergence seems like more of a moot point now as well. Whether or not GPT-6 or GPT-7 would autonomously seek power without being directed to do so, I'm worried that people will just literally ask these AIs to gain them a bunch of power/money. They've already done that with GPT-4, and they of course failed. I'm worried that eventually, the AIs will be smart enough to succeed, e

... (read more)
3lc2mo
Said something similar in shortform a while back. [https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1?commentId=krsXhzhriC9LfY2qt]
8gwern2mo
Or, even more briefly: "tool AIs want to be agent AIs" [https://gwern.net/tool-ai].
5TurnTrout2mo
Aside: I like that essay but wish it had a different name. Part of "tool AI" (on my conception) is that it doesn't autonomously want, but agent AI does. A title like "End-users want tool AIs to be agent AIs" admittedly doesn't have the same ring, but it is more accurate to my understanding of the claims.
2Thane Ruthenis2mo
While it's true, there's something about making this argument that don't like. It's like it's setting you up for moving goalposts if you succeed with it? It makes it sound like the core issue is people giving AIs power, with the solution to that issue — and, implicitly, to the whole AGI Ruin thing — being to ban that. Which is not going to help, since the sort of AGI we're worried about isn't going to need people to naively hand it power. I suppose "not proactively handing power out" somewhat raises the bar for the level of superintelligence necessary, but is that going to matter much in practice? I expect not. Which means the natural way to assuage this fresh concern would do ~nothing to reduce the actual risk. Which means if we make this argument a lot, and get people to listen to it, and they act in response... We're then going to have to say that no, actually that's not enough, actually the real threat is AIs plotting to take control even if we're not willing to give it. And I'm not clear on whether using the "let's at least not actively hand over power to AIs, m'kay?" argument is going to act as a foot in the door [https://en.wikipedia.org/wiki/Foot-in-the-door_technique] and make imposing more security easier [https://www.lesswrong.com/posts/vavnqwYbc8jMu3dTY/ai-coordination-needs-clear-wins], or whether it'll just burn whatever political capital we have on fixing a ~nonissue.
2TurnTrout2mo
I'm sympathetic. I think that I should have said "instrumental convergence seems like a moot point when deciding whether to be worried about AI disempowerment scenarios)"; instrumental convergence isn't a moot point for alignment discussion and within lab strategy, of course. But I do consider the "give AIs power" to be a substantial part of the risk we face, such that not doing that would be quite helpful. I think it's quite possible that GPT 6 isn't autonomously power-seeking, but I feel pretty confused about the issue.

Speculation: RL rearranges and reweights latent model abilities, which SL created. (I think this mostly isn't novel, just pulling together a few important threads)

Suppose I supervised-train a LM on an English corpus, and I want it to speak Spanish. RL is inappropriate for the task, because its on-policy exploration won't output interestingly better or worse Spanish completions. So there's not obvious content for me to grade. 

More generally, RL can provide inexact gradients away from undesired behavior (e.g. negative reinforcement event -> downweigh... (read more)

When I think about takeoffs, I notice that I'm less interested in GDP or how fast the AI's cognition improves, and more on how AI will affect me, and how quickly. More plainly, how fast will shit go crazy for me, and how does that change my ability to steer events? 

For example, assume unipolarity. Let architecture Z be the architecture which happens to be used to train the AGI. 

  1. How long is the duration between "architecture Z is published / seriously considered" and "the AGI kills me, assuming alignment fails"? 
  2. How long do I have, in theory,
... (read more)

Being proven wrong is an awesome, exciting shortcut which lets me reach the truth even faster.

8niplav10mo
It's a good reason to be very straightforward with your beliefs, and being willing pulling numbers out of your ass! I've had situations where I've updated hard based on two dozen words, which wouldn't have happened if I'd been more cagey about my beliefs or waited longer to "flesh them out".

Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"

2Dagon1y
Drop "predictably" from your statement.  It's implied for most methods of identification of such things, and shouldn't additionally be a filter on things you consider.
4TurnTrout10mo
I find the "predictably" to be useful. It emphasizes certain good things to me, like "What obvious considerations are you overlooking right now?". I think the "predictably" makes my answer less likely to be whatever I previously was planning on doing before asking myself the question.

The Pfizer phase 3 study's last endpoint is 7 days after the second shot. Does anyone know why the CDC recommends waiting 2 weeks for full protection? Are they just being the CDC again?

6jimrandomh2y
People don't really distinguish between "I am protected" and "I am safe for others to be around". If someone got infected prior to their vaccination and had a relatively-long incubation period, they could infect others; I don't think it's a coincidence that two weeks is also the recommended self-isolation period for people who may have been exposed.

The framing effect & aversion to losses generally cause us to execute more cautious plans. I’m realizing this is another reason to reframe my x-risk motivation from “I won’t let the world be destroyed” to “there’s so much fun we could have, and I want to make sure that happens”. I think we need more exploratory thinking in alignment research right now.

(Also, the former motivation style led to me crashing and burning a bit when my hands were injured and I was no longer able to do much.)

ETA: actually, i’m realizing I had the effect backwards. Framing via

... (read more)
7TurnTrout4y
I’m realizing how much more risk-neutral I should be:
3Isnasene4y
For what it's worth, I tried something like the "I won't let the world be destroyed"->"I want to make sure the world keeps doing awesome stuff" reframing back in the day and it broadly didn't work. This had less to do with cautious/uncautious behavior and more to do with status quo bias. Saying "I won't let the world be destroyed" treats "the world being destroyed" as an event that deviates from the status quo of the world existing. In contrast, saying "There's so much fun we could have" treats "having more fun" as the event that deviates from the status quo of us not continuing to have fun. When I saw the world being destroyed as status quo, I cared a lot less about the world getting destroyed.

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is. 

If inner/outer is altogether a more faithful picture of those dynamics: 

  • relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
    • more fragility of value and difficulty in getting the mesa object
... (read more)

I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.

What kind of reasoning would have allowed me to see MySpace in 2004, and then hypothesize the current craziness as a plausible endpoint of social media? Is this problem easier or harder than the problem of 15-20 year AI forecasting?

1unparadoxed2y
Hmm, maybe it would be easier if we focused on one kind/example of craziness. Is there a particular one you have in mind?

Over the last 2.5 years, I've read a lot of math textbooks. Not using Anki / spaced repetition systems over that time has been an enormous mistake. My factual recall seems worse-than-average among my peers, but when supplemented with Anki, it's far better than average (hence, I was able to learn 2000+ Japanese characters in 90 days, in college). 

I considered using Anki for math in early 2018, but I dismissed it quickly because I hadn't had good experience using that application for things which weren't languages. I should have at least tried to see if... (read more)

1NaiveTortoise3y
I'm curious what sort of things you're Anki-fying (e.g. a few examples for measure theory).
2TurnTrout3y
https://ankiweb.net/shared/info/511421324 [https://ankiweb.net/shared/info/511421324] 

This might be the best figure I've ever seen in a textbook. Talk about making a point! 

Molecular Biology of the Cell, Alberts.

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

2Raemon3y
That's an interesting point.

Today, let's read about GPT-3's obsession with Shrek

As for me, I think Shrek is important because the most valuable thing in life is happiness. I mean this quite literally. There's a mountain of evidence for it, if you're willing to look at the research. And I think movies can help us get there. Or at least not get in the way.

Now, when I say "happiness," I'm not talking about the transient buzz that you get from, say, heroin. I'm talking about a sense of fulfillment. A sense that you are where you're meant to be. That you are doing what you're meant

... (read more)
2ChristianKl3y
What's the input that produced the text from GPT-3?
2TurnTrout3y
Two Sequences posts... lol... Here's the full transcript [https://aidungeon.page.link/?link=https://exploreViewAdventure?publicId=f352d549-9f35-49cf-a363-095b88d52385&ofl=https://play.aidungeon.io/adventure/f352d549-9f35-49cf-a363-095b88d52385&apn=com.aidungeon&ibi=com.aidungeon.app&isi=1491268416]. 

Basilisks are a great example of plans which are "trying" to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything's fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks. 

For example: 

- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI's consideration, 

- at which point the AI reasons in more detail because it's not... (read more)

3Gunnar_Zarncke8mo
By the same argument, religion, or at least some of its arguments, like Pascal's wager, should probably also be scrubbed.
1mesaoptimizer17d
gwern's Clippy [https://gwern.net/fiction/clippy] gets done in by a basilisk (in your terms):

80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds or # of happy people or reward-signal) and not quantities like (# of times I have to look at a cube in a blue room or -1 * subjective micromorts accrued).

Intuitions:

  • I expect contextually activated heuristics to be the default, and that agents will learn lots of such contextual values which don't cash out to being strictly about diamonds or people, even if the overall agent is mostly
... (read more)
5TurnTrout10mo
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur. 

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even... (read more)

1Pattern2y
1. I predict that you will never encounter such an opponent. Solving chess is hard.* 2. Optimal play within a game might not be optimal overall (others can learn from the strategy). Why does this matter? If the theorems hold, even for 'not optimal, but still great' policies (say, for chess), then the distinction is irrelevant. Though for more complicated (or non-zero sum) games, the optimal move/policy may depend on the other player's move/policy. (I'm not sure what 'avoid shutdown' looks like in chess.) ETA: *with 10^43 legal positions in chess, it will take an impossibly long time to compute a perfect strategy with any feasible technology. -source: https://en.wikipedia.org/wiki/Chess#Mathematics [https://en.wikipedia.org/wiki/Chess#Mathematics,] which lists its source from 1977 [https://en.wikipedia.org/wiki/Chess#cite_note-140]

If Hogwarts spits back an error if you try to add a non-integer number of house points, and if you can explain the busy beaver function to Hogwarts, you now have an oracle which answers  for arbitrary : just state " points to Ravenclaw!". You can do this for other problems which reduce to divisibility tests (so, any decision problem  which you can somehow get Hogwarts to compute; if ).

Homework: find a way to safely take over the world using this power, and no other magic. 

6Measure2y
I'd be worried about integer overflow with that protocol. If it can understand BB and division, you can probably just ask for the remainder directly and observe the change.

When I imagine configuring an imaginary pile of blocks, I can feel the blocks in front of me in this fake imaginary plane of existence. I feel aware of their spatial relationships to me, in the same way that it feels different to have your eyes closed in a closet vs in an empty auditorium. 

But what is this mental workspace? Is it disjoint and separated from my normal spatial awareness, or does my brain copy/paste->modify my real-life spatial awareness. Like, if my brother is five feet in front of me, and then I imagine a blade flying five feet in f... (read more)

The new "Broader Impact" NeurIPS statement is a good step, but incentives are misaligned. Admitting fatally negative impact would set a researcher back in their career, as the paper would be rejected. 

Idea: Consider a dangerous paper which would otherwise have been published. What if that paper were published title-only on the NeurIPS website, so that the researchers can still get career capital?

Problem: How do you ensure resubmission doesn't occur elsewhere?

5Daniel Kokotajlo3y
The people at NeurIPS who reviewed the paper might notice if resubmission occurred elsewhere? Automated tools might help with this, by searching for specific phrases. There's been talk of having a Journal of Infohazards. Seems like an idea worth exploring to me. Your suggestion sounds like a much more feasible first step. Problem: Any entity with halfway decent hacking skills (such as a national government, or clever criminal) would be able to peruse the list of infohazardy titles, look up the authors, cyberstalk them, and then hack into their personal computer and steal the files. We could hope that people would take precautions against this, but I'm not very optimistic. That said, this still seems better than the status quo.

Cool Math Concept You Never Realized You Wanted: Fréchet distance.

Imagine a man traversing a finite curved path while walking his dog on a leash, with the dog traversing a separate one. Each can vary their speed to keep slack in the leash, but neither can move backwards. The Fréchet distance between the two curves is the length of the shortest leash sufficient for both to traverse their separate paths. Note that the definition is symmetric with respect to the two curves—the Frechet distance would be the same if the dog was walking its owner.

The Fréche

... (read more)

Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There's likely agreed-upon theories for this, but here's some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn't consensus on why this happens)

My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I'm imagining a kid engrossed in a puzzle. Then, they're told that they'll b

... (read more)

Virtue ethics seems like model-free consequentialism to me.

5JohnSteidley3y
I've was thinking along similar lines! From my notes from 2019-11-24: "Deontology is like the learned policy of bounded rationality of consequentialism"

Going through an intro chem textbook, it immediately strikes me how this should be as appealing and mysterious as the alchemical magic system of Fullmetal Alchemist. "The law of equivalent exchange" "conservation of energy/elements/mass (the last two holding only for normal chemical reactions)", etc. If only it were natural to take joy in the merely real...

4Hazard4y
Have you been continuing your self-study schemes into realms beyond math stuff? If so I'm interested in both the motivation and how it's going! I remember having little interest in other non-physics science growing up, but that was also before I got good at learning things and my enjoyment was based on how well it was presented.
5TurnTrout4y
Yeah, I've read a lot of books since my reviews fell off last year, most of them still math. I wasn't able to type reliably until early this summer, so my reviews kinda got derailed. I've read Visual Group Theory, Understanding Machine Learning, Computational Complexity: A Conceptual Perspective, Introduction to the Theory of Computation, An Illustrated Theory of Numbers, most of Tadellis' Game Theory, the beginning of Multiagent Systems, parts of several graph theory textbooks, and I'm going through Munkres' Topology right now. I've gotten through the first fifth of the first Feynman lectures, which has given me an unbelievable amount of mileage for generally reasoning about physics. I want to go back to my reviews, but I just have a lot of other stuff going on right now. Also, I run into fewer basic confusions than when I was just starting at math, so I generally have less to talk about. I guess I could instead try and re-present the coolest concepts from the book. My "plan" is to keep learning math until the low graduate level (I still need to at least do complex analysis, topology, field / ring theory, ODEs/PDEs, and something to shore up my atrocious trig skills, and probably more)[1], and then branch off into physics + a "softer" science (anything from microecon to psychology). CS ("done") -> math -> physics -> chem -> bio is the major track for the physical sciences I have in mind, but that might change. I dunno, there's just a lot of stuff I still want to learn. :) -------------------------------------------------------------------------------- 1. I also still want to learn Bayes nets, category theory, get a much deeper understanding of probability theory, provability logic, and decision theory. ↩︎
4Hazard4y
Yay learning all the things! Your reviews are fun, also completely understandable putting energy elsewhere. Your energy for more learning is very useful for periodically bouncing myself into more learning.

I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu

... (read more)

Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.

Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.

Therefore, any pla... (read more)

4Rohin Shah10mo
The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems [https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7]:

Notes on behaviorism: After reading a few minutes about it, behaviorism seems obviously false. It views the "important part" of reward to be the external behavior which led to the reward. If I put my hand on a stove, and get punished, then I'm less likely to do that again in the future. Or so the theory goes.

But this seems, in fullest generality, wildly false. The above argument black-boxes the inner structure of human cognition which produces the externally observed behavior.

What actually happens, on my model, is that the stove makes your hand hot, which ... (read more)

Argument sketch for why boxing is doomed if the agent is perfectly misaligned:

Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful. 

6Viliam1y
Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option. If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky.
2Measure1y
You also have to assume that the AI knows everything you know which might not be true if it's boxed.

The costs of (not-so-trivial) inconveniences

I like exercising daily. Some days, I want to exercise more than others—let's suppose that I actually benefit more from exercise on that day. Therefore, I have a higher willingness to pay the price of working out.

Consider the population of TurnTrouts over time, one for each day. This is a population of consumers with different willingnesses to pay, and so we can plot the corresponding exercise demand curve (with a fixed price). In this idealized model, I exercise whenever my willingness to pay exceeds the price.

B... (read more)

2Jan Czechowski2y
Can you give some clarifications for this concept? I'm not sure what you mean here.
6Mark Xu2y
https://en.wikipedia.org/wiki/Deadweight_loss#Harberger's_triangle [https://en.wikipedia.org/wiki/Deadweight_loss#Harberger's_triangle]

The discussion of the HPMOR epilogue in this recent April Fool's thread was essentially online improv, where no one could acknowledge that without ruining the pretense. Maybe I should do more improv in real life, because I enjoyed it!

If you measure death-badness from behind the veil of ignorance, you’d naively prioritize well-liked, famous people with large families.

2Pattern3y
Would you prioritize the young from behind the veil of ignorance?

AIDungeon's subscriber-only GPT-3 can do some complex arithmetic, but it's very spotty. Bold text is me.

You say "What happens if I take the square root of 3i?" 

The oracle says: "You'll get a negative number. [wrong] So, for example, the square root of  is ." [correct]
"What?" you say.
 "I just said it," the oracle repeats. 
"But that's ridiculous! The square root of  is not . It's complex. It's  plus a multiple of ." [wrong, but my character is supposed to be playing dumb here]

The

... (read more)

Idea: learn by making conjectures (math, physical, etc) and then testing them / proving them, based on what I've already learned from a textbook. 

Learning seems easier and faster when I'm curious about one of my own ideas.

8NaiveTortoise3y
For what it's worth, this is very true for me as well. I'm also reminded of a story of Robin Hanson from Cryonics magazine: * Source [https://www.google.com/url?sa=t&source=web&rct=j&url=https://alcor.org/cryonics/Cryonics2017-4.pdf&ved=2ahUKEwjKhLnnl6LqAhWeoXIEHQWwB4UQFjAPegQIBxAB&usg=AOvVaw3uCyzISnOW89LDK4zeAsTC]
1Rudi C3y
How do you estimate how hard your invented problems are?

Sentences spoken aloud are a latent space embedding of our thoughts; when trying to move a thought from our mind to another's, our thoughts are encoded with the aim of minimizing the other person's decoder error.

Broca’s area handles syntax, while Wernicke’s area handles the semantic side of language processing. Subjects with damage to the latter can speak in syntactically fluent jargon-filled sentences (fluent aphasia) – and they can’t even tell their utterances don’t make sense, because they can’t even make sense of the words leaving their own mouth!

It seems like GPT2 : Broca’s area :: ??? : Wernicke’s area. Are there any cog psych/AI theories on this?

We can think about how consumers respond to changes in price by considering the elasticity of the quantity demanded at a given price - how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as ; in other words, for price and quantity , this is (this looks kinda weird, and it wasn't immediately obvious what's happening here...). Revenue is the total amount of cash changing hands: .

What's happening here is that raising prices is a good idea when the revenue gained (the "pric

... (read more)

How does representation interact with consciousness? Suppose you're reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as numbers, as their numeric labels.

To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement lea

... (read more)
9Gordon Seidoh Worley4y
I think a reasonable and related question we don't have a solid answer for is if humans are already capable of mind crime. For example, maybe Alice is mad at Bob and imagines causing harm to Bob. How well does Alice have to model Bob for her imaginings to be mind crime? If Alice has low cognitive empathy is it not mind crime but if her cognitive empathy is above some level is it then mind crime? I think we're currently confused enough about what mind crime is such that it's hard to even begin to know how we could answer these questions based on more than gut feelings.
2Vladimir_Nesov4y
I suspect that it doesn't matter how accurate or straightforward a predictor is in modeling people. What would make prediction morally irrelevant is that it's not noticed by the predicted people, irrespective of whether this happens because it spreads the moral weight conferred to them over many possibilities (giving inaccurate prediction), keeps the representation sufficiently baroque, or for some other reason. In the case of inaccurate prediction or baroque representation, it probably does become harder for the predicted people to notice being predicted, and I think this is the actual source of moral irrelevance, not those things on their own. A more direct way of getting the same result is to predict counterfactuals where the people you reason about don't notice the fact that you are observing them, which also gives a form of inaccuracy (imagine that your predicting them is part of their prior, that'll drive the counterfactual further from reality).

I seem to differently discount different parts of what I want. For example, I'm somewhat willing to postpone fun to low-probability high-fun futures, whereas I'm not willing to do the same with romance.

Handling compute overhangs after a pause. 

Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.

There seem to be a range of relatively simple pol... (read more)

2Vladimir_Nesov9d
Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.
01a3orn8d
This is only a simple policy approach in an extremely theoretical sense though, I'd say. Like it assumes a perfect global compute cap with no exceptions, no nations managing to do anything in secret, and with the global enforcement agency being incorruptible and non-favoritist, and so on. You fail at any of these, and the situation could be worse than if no "pause" happened, even assuming the frame where a pause was important in the first place. (Although, full disclosure, I do not share that frame).

Wikipedia has an unfortunate and incorrect-in-generality description of reinforcement learning (emphasis added)

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Later in the article, talking about basic optimal-control inspired approaches:

The purpose of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the "reward function" or other user-provided reinforcement signal

... (read more)
8Steven Byrnes2mo
The description doesn't seem so bad to me. Your post "Reward is not the optimization target" is about what actual RL algorithms actually do. The wiki descriptions here are a kind of normative motivation as to how people came to be looking into those algorithms in the first place. Like, if there's an RL algorithm that performs worse than chance at getting a high reward, then that ain't an RL algorithm. Right? Nobody would call it that. I think lots of families of algorithms are likewise lumped together by a kind of normative "goal", even if any given algorithm in that family is doing something somewhat different and more complicated than “achieving that goal”, and even if, in any given application, the programmer might not want that goal to be perfectly achieved even if it could be [https://www.alignmentforum.org/posts/hRuE66SXobGhrbwxR/note-on-algorithms-with-multiple-trained-components]. So by the same token, supervised learning algorithms are "supposed" to minimize a loss, compilers are "supposed" to create efficient and correct assembly code, word processors are "supposed" to process words, etc., but in all cases that's not a literal and complete description of what the algorithms in question actually do, right? It’s a pointer to a class of algorithms. Sorry if I'm misunderstanding.
4TurnTrout2mo
I agree that it is narrowly technically accurate as a description of researcher motivation. Note that they don't offer any other explanation elsewhere in the article. Also note that they also make empirical claims:
4Steven Byrnes2mo
Sure. That excerpt is not great.
4TurnTrout2mo
(I do think that animals care about the reinforcement signals and their tight correlates, to some degree, such that it's reasonable to gloss it as "animals sometimes optimize rewards." I more strongly object to conflating what the animals may care about with the mechanistic purpose/description of the RL process.)
3niplav2mo
I encourage you to fix the mistake. (I can't guarantee that the fix will be incorporated, but for something this important it's worth a try).

Idea for getting weak-in-expectation evidence about deception:

  1. Pretrain a model.
  2. Finetune two copies using reward functions you are confident will produce different internal values, where one set of values is substantially less aligned.
  3. See if the two models, which are (at least first) unaware of this procedure, will display different behavior, or not.
  4. If they both behave in an aligned-seeming fashion, this seems like strong evidence of deception. 

Be cautious with sequences-style "words don't matter, only anticipations matter." (At least, this is an impression I got from the sequences, and could probably back this up.) Words do matter insofar as they affect how your internal values bind. If you decide that... (searches for non-political example) monkeys count as "people", that will substantially affect your future decisions via e.g. changing your internal "person" predicate, which in turn will change how different downstream shards activate (like "if person harmed, be less likely to execute plan", a... (read more)

9Zack_M_Davis7mo
You probably do, though! [https://www.lesswrong.com/posts/esRZaPXSHgWzyB2NL/where-to-draw-the-boundaries] Thinking of a monkey as a "person" means using your beliefs about persons-in-general to make predictions about aspects of the monkey that you haven't observed.
2TurnTrout7mo
Right, good point!

Why don't people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I've won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don't I delude myself?

On my model of people, rewards provide ~"policy gradients" which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won't come from reward gradients.

For example, if I ... (read more)

2JBlack7mo
A lot of people do delude themselves in many ways, and some directly in many of the ways you describe. However, I doubt that human brains work literally in terms of nothing but reward reinforcement. There may well be a core of something akin to that, but mixed in with all the usual hacks and kludges that evolved systems have.
2TurnTrout7mo
I was thinking about delusions like "I literally anticipate-believe that there is a stack of cash in the corner of the room." I agree that people do delude themselves, but my impression is that mentally healthy people do not anticipation-level delude themselves on nearby physical observables which they have lots of info about.  I could be wrong about that, though?
1Garrett Baker7mo
I wonder if this hypothesis is supported by looking at the parts of schizophrenics' (or just anyone currently having a hallucination's) brains. Ideally the parts responsible for producing the hallucination. 

How the power-seeking theorems relate to the selection theorem agenda. 

  1. Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment). 

    I've mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I've discovered some gears for what situations cause what kinds of behaviors.
    1. The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, trai
... (read more)

Idea: Expert prediction markets on predictions made by theories in the field, with $ for being a good predictor and lots of $ for designing and running a later-replicated experiment whose result the expert community strongly anti-predicted. Lots of problems with the plan, but surprisal-based compensation seems interesting and I haven't heard about it before. 

What is "real"? I think about myself as a computation embedded in some other computation (i.e. a universe-history). I think "real" describes hypotheses about the environment where my computation lives. What should I think is real? That which an "ideal embedded reasoner" would assign high credence. However that works.

This sensibly suggests that Gimli-in-actual-Ea (LOTR) should believe he lives in Ea, and that Ea is real, even though it isn't our universe's Earth. Also, the notion accounts for indexical uncertainty by punting it to how embedded reasoning sho... (read more)

ordinal preferences just tell you which outcomes you like more than others: apples more than oranges.

Interval scale preferences assign numbers to outcomes, which communicates how close outcomes are in value: kiwi 1, orange 5, apple 6. You can say that apples have 5 times the advantage over kiwis that they do over oranges, but you can't say that apples are six times as good as kiwis. Fahrenheit and Celsius are also like this.

Ratio scale ("rational"? 😉) preferences do let you say that apples are six times as good as kiwis, and you need this property to maxi

... (read more)
4Matt Goldenberg3y
Isn't the typical assumption in game theory that preferences are ordinal? This suggests that you can make quite a few strategic decisions without bringing in ratio.
3Dagon3y
From what I have read, and from self-introspection, humans mostly have ordinal preferences. Some of them we can interpolate to interval scales or ratios (or higher-order functions) but if we extrapolate very far, we get odd results. It turns out you can do a LOT with just ordinal preferences. Almost all real-world decisions are made this way.

My autodidacting has given me a mental reflex which attempts to construct a gears-level explanation of almost any claim I hear. For example, when listening to “Listen to Your Heart” by Roxette:

Listen to your heart,

There’s nothing else you can do

I understood what she obviously meant and simultaneously found myself subvocalizing “she means all other reasonable plans are worse than listening to your heart - not that that’s literally all you can do”.

This reflex is really silly and annoying in the wrong context - I’ll fix it soon. But it’s pretty amusing

... (read more)

One of the reasons I think corrigibility might have a simple core principle is: it seems possible to imagine a kind of AI which would make a lot of different possible designers happy. That is, if you imagine the same AI design deployed by counterfactually different agents with different values and somewhat-reasonable rationalities, it ends up doing a good job by almost all of them. It ends up acting to further the designers' interests in each counterfactual. This has been a useful informal way for me to think about corrigibility, when considering different

... (read more)

I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent's optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of Towards a New Impact Measure that the thing-behind-AUP might produce interesting ML regularization techniques.

This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achiev

... (read more)

From the ELK report

We can then train a model to predict these human evaluations, and search for actions that lead to predicted futures that look good. 

For simplicity and concreteness you can imagine a brute force search. A more interesting system might train a value function and/or policy, do Monte-Carlo Tree Search with learned heuristics, and so on. These techniques introduce new learned models, and in practice we would care about ELK for each of them. But we don’t believe that this complication changes the basic picture and so we leave it ou

... (read more)

I think that the training goal of "the AI never makes a catastrophic decision" is unrealistic and unachievable and unnecessary. I think this is not a natural shape for values to take. Consider a highly altruistic man with anger problems, strongly triggered by e.g. a specifc vacation home. If he is present with his wife at this home, he beats her. As long as he starts off away from the home, and knows about his anger problems, he will be motivated to resolve his anger problems, or at least avoid the triggering contexts / take other precautions to ensure her... (read more)

"Goodhart" is no longer part of my native ontology for considering alignment failures. When I hear "The AI goodharts on some proxy of human happiness", I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like: 

Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain 

Warning: No history ... (read more)

2Vladimir_Nesov1y
There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it's then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that's not "correctly prompted" by behaviors in original situations. It's like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe "behavioral invariants" are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/"robustness"-hardened model prepared for use in the new situations).

Excalidraw is now quite good and works almost seamlessly on my iPad. It's also nice to use on the computer. I recommend it to people who want to make fast diagrams for their posts.

Reading EY's dath ilan glowfics, I can't help but think of how poor English is as a language to think in. I wonder if I could train myself to think without subvocalizing (presumably it would be too much work to come up with a well-optimized encoding of thoughts, all on my own, so no new language for me). No subvocalizing might let me think important thoughts more quickly and precisely.

2Yoav Ravid2y
Interesting. My native language is Hebrew but I often find it easier to think in English.
2Matthew Barnett2y
This is an interesting question, and one that has been studied by linguists [https://en.wikipedia.org/wiki/Linguistic_relativity].
2TurnTrout2y
I'm not sure how often I subvocalize to think thoughts. Often I have trouble putting a new idea into words just right, which means the raw idea essence came before the wordsmithing. But other times it feels like I'm synchronously subvocalizing as I brainstorm

How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around? 

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

  • Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
  • Red if the parameter vector+... leads to a misaligned or dece
... (read more)

I'd like to see research exploring the relevance of intragenomic conflict to AI alignment research. Intragenomic conflict constitutes an in-the-wild example of misalignment, where conflict arises "within an agent" even though the agent's genes have strong instrumental incentives to work together (they share the same body). 

In an interesting parallel to John Wentworth's Fixing the Good Regulator Theorem, I have an MDP result that says: 

Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for  reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).

Roughly: being able to complete linearly many tasks in the state space means you ha... (read more)

I read someone saying that ~half of the universes in a neighborhood of ours went to Trump. But... this doesn't seem right. Assuming Biden wins in the world we live in, consider the possible perturbations to the mental states of each voter. (Big assumption! We aren't thinking about all possible modifications to the world state. Whatever that means.)

Assume all 2020 voters would be equally affected by a perturbation (which you can just think of as a decision-flip for simplicity, perhaps). Since we're talking about a neighborhood ("worlds pretty close to ours"... (read more)

1Measure3y
I think this depends on the distance considered. In worlds very very close to ours, the vast majority will have the same outcome as ours. As you increase the neighborhood size (I imagine this as considering worlds which diverged from ours more distantly in the past), Trump becomes more likely relative to Biden [edit: more likely than he is relative to Biden in more nearby worlds]. As you continue to expand, other outcomes start to have significant likelihood as well.
2TurnTrout3y
Why do you think that? How do you know that?
2Measure3y
General intuition that "butterfly effect" is basically true, meaning that if a change occurs in a chaotic system, then the size of the downstream effects will tend to increase over time. Edit: I don't have a good sense of how far back you would have to go to see meaningful change in outcome, just that the farther you go the more likely change becomes.
2TurnTrout3y
Sure, but why would those changes tend to favor Trump as you get outside of a small neighborhood? Like, why would Biden / (Biden or Trump win) < .5? I agree it would at least approach .5 as the neighborhood grows. I think. 
5Measure3y
I think we're in agreement here. I didn't mean to imply that Trump would become more likely than Biden in absolute terms, just that the ratio Trump/Biden would increase.

Epistemic status: not an expert

Understanding Newton's third law, .

Consider the vector-valued velocity as a function of time, . Scale this by the object's mass and you get the momentum function over time. Imagine this momentum function wiggling around over time, the vector from the origin rotating and growing and shrinking.

The third law says that force is the derivative of this rescaled vector function - if an object is more massive, then the same displacement of this rescaled arrow is a proportionally smaller velocity modification, because o... (read more)

Tricking AIDungeon's GPT-3 model into writing HPMOR:

You start reading Harry Potter and the Methods of Rationality by Eliezer Yudkowsky:

" "It said to me," said Professor Quirrell, "that it knew me, and that it would hunt me down someday, wherever I tried to hide." His face was rigid, showing no fright.
"Ah," Harry said. "I wouldn't worry about that, Professor Quirrell." It's not like Dementors can actually talk, or think; the structure they have is borrowed from your own mind and expectations...
Now

... (read more)
2Pattern3y
I love the ending. It's way more exciting,
2TurnTrout3y
2habryka3y
Mod note: Spoilerified, to shield the eyes of the innocent.
4TurnTrout3y
My bad! Thanks.

ARCHES distinguishes between single-agent / single-user and single-agent/multi-user alignment scenarios. Given assumptions like "everyone in society is VNM-rational" and "societal preferences should also follow VNM rationality", and "if everyone wants a thing, society also wants the thing", Harsanyi's utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone's utilities. So, in a very narrow (and unrealistic) setting, Harsanyi's theorem tells you how the single-multi solution is built from the si

... (read more)

From FLI's AI Alignment Podcast: Inverse Reinforcement Learning and Inferring Human Preferences with Dylan Hadfield-Menell:

Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was act

... (read more)

On page 22 of Probabilistic reasoning in intelligent systems, Pearl writes:

Raw experiential data is not amenable to reasoning activities such as prediction and planning; these require that data be abstracted into a representation with a coarser grain. Probabilities are summaries of details lost in this abstraction...

An agent observes a sequence of images displaying either a red or a blue ball. The balls are drawn according to some deterministic rule of the time step. Reasoning directly from the experiential data leads to ~Solomonoff induction. What mig

... (read more)
2TurnTrout3y
In particular, the coarse-grain is what I mentioned in 1) – beliefs are easier to manage with respect to a fixed featurization of the observation space.
1NaiveTortoise3y
Only related to the first part of your post, I suspect Pearl!2020 would say the coarse-grained model should be some sort of causal model on which we can do counterfactual reasoning.

We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent?

Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they

... (read more)

It seems to me that Zeno's paradoxes leverage incorrect, naïve notions of time and computation. We exist in the world, and we might suppose that that the world is being computed in some way. If time is continuous, then the computer might need to do some pretty weird things to determine our location at an infinite number of intermediate times. However, even if that were the case, we would never notice it – we exist within time and we would not observe the external behavior of the system which is computing us, nor its runtime.

2Pattern3y
What are your thoughts on infinitely small quantities?
2TurnTrout3y
Don't have much of an opinion - I haven't rigorously studied infinitesimals yet. I usually just think of infinite / infinitely small quantities as being produced by limiting processes. For example, the intersection of all the ϵ-balls around a real number is just that number (under the standard topology), which set has 0 measure and is, in a sense, "infinitely small".

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

AFAICT, the deadweight loss triangle from eg price ceilings is just a lower bound on lost surplus. inefficient allocation to consumers means that people who value good less than market equilibrium price can buy it, while dwl triangle optimistically assumes consumers with highest willingness to buy will eat up the limited supply.

4Wei Dai3y
Good point. By searching for "deadweight loss price ceiling lower bound" I was able to find a source [http://www.econ.ucla.edu/sboard/teaching/econ11_09/econ11_09_slides10.pdf] (see page 26) that acknowledges this, but most explications of price ceilings do not seem to mention that the triangle is just a lower bound for lost surplus.
2Dagon3y
Lost surplus is definitely a loss - it's not linear with utility, but it's not uncorrelated. Also, if supply is elastic over any relevant timeframe, there's an additional source of loss. And I'd argue that for most goods, over timeframes smaller than most price-fixing proposals are expected to last, there is significant price elasticity.
2TurnTrout3y
I don't think I was disagreeing?
2Dagon3y
Ah, I took the "just" in "just a lower bound on lost surplus" as an indicator that it's less important than other factors. And I lightly believe (meaning: for the cases I find most available, I believe it, but I don't know how general it is) that the supply elasticity _is_ the more important effect of such distortions. So I wanted to reinforce that I wasn't ignoring that cost, only pointing out a greater cost.

I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.

Alex Mennen suggested a great way of thinking about it. For some odd prime , consider the multiplicative group . This group is abelian and has even order . Now, consider a primitive root / generator . By definition, every element of the group can be expressed as . The quadratic residues ar

... (read more)
4AlexMennen4y
The theorem: where k is relatively prime to an odd prime p and n<e, k⋅pn is a square mod pe iff k is a square mod p and n is even. The real meat of the theorem is the n=0 case (i.e. a square mod p that isn't a multiple of p is also a square mod pe. Deriving the general case from there should be fairly straightforward, so let's focus on this special case. Why is it true? This question has a surprising answer: Newton's method for finding roots of functions. Specifically, we want to find a root of f(x):=x2−k, except in Z/peZ instead of R. To adapt Newton's method to work in this situation, we'll need the p-adic absolute value on Z: |k⋅pn|p:=p−n for k relatively prime to p. This has lots of properties that you should expect of an "absolute value": it's positive (|x|p≥0 with = only when x=0), multiplicative (|xy|p=|x|p|y|p), symmetric (|−x|p=|x|p), and satisfies a triangle inequality (|x+y|p≤|x|p+|y|p; in fact, we get more in this case: |x+y|p≤max(|x|p,|y|p)). Because of positivity, symmetry, and the triangle inequality, the p-adic absolute value induces a metric (in fact, ultrametric, because of the strong version of the triangle inequality) d(x,y):=|x−y|p. To visualize this distance function, draw p giant circles, and sort integers into circles based on their value mod p. Then draw p smaller circles inside each of those giant circles, and sort the integers in the big circle into the smaller circles based on their value mod p2. Then draw p even smaller circles inside each of those, and sort based on value mod p3, and so on. The distance between two numbers corresponds to the size of the smallest circle encompassing both of them. Note that, in this metric, 1,p,p2,p3,... converges to 0. Now on to Newton's method: if k is a square mod p, let a be one of its square roots mod p. |f(a)|p≤p−1; that is, a is somewhat close to being a root of f with respect to the p-adic absolute value. f′(x)=2x, so |f'(a)|p=|2a|p=|2|p⋅|a|p=1⋅1=1; that is, f is steep near a. This is good,
2AlexMennen4y
The part about derivatives might have seemed a little odd. After all, you might think, Z is a discrete set, so what does it mean to take derivatives of functions on it. One answer to this is to just differentiate symbolically using polynomial differentiation rules. But I think a better answer is to remember that we're using a different metric than usual, and Z isn't discrete at all! Indeed, for any number k, limn→∞k+pn=k, so no points are isolated, and we can define differentiation of functions on Z in exactly the usual way with limits.

I noticed I was confused and liable to forget my grasp on what the hell is so "normal" about normal subgroups. You know what that means - colorful picture time!

First, the classic definition. A subgroup is normal when, for all group elements , (this is trivially true for all subgroups of abelian groups).

ETA: I drew the bounds a bit incorrectly; is most certainly within the left coset ().

Notice that nontrivial cosets aren't subgroups, because they don't have the identity .

This "normal" thing matters because sometimes we want to highlight regu

... (read more)

The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.

  1. Humans provide many highly correlated datapoints on general intelligence (human minds), as developed within one kind of learning process (best guess: massively parallel circuitry, locally randomly initialized, self-supervised learning + RL). 
    1. We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don't just have a single unitary mesa-obj
... (read more)

Transplanting algorithms into randomly initialized networks. I wonder if you could train a policy network to walk upright in sim, back out the "walk upright" algorithm, randomly initialize a new network which can call that algorithm as a "subroutine call" (but the walk-upright weights are frozen), and then have the new second model learn to call that subroutine appropriately? Possibly the learned representations would be convergently similar enough to interface quickly via SGD update dynamics. 

If so, this provides some (small, IMO) amount of rescue fo... (read more)

3cfoster08mo
This is basically how I view the DeepMind Flamingo [https://arxiv.org/abs/2204.14198] model training to have operated, where a few stitching layers learn to translate the outputs of a frozen vision encoder into "subroutine calls" into the frozen language model, such that visual concept circuits ping their corresponding text token output circuits.

When I was younger, I never happened to "look in the right direction" on my own in order to start the process of becoming agentic and coherent. Here are some sentences I wish I had heard when I was a kid:

  • The world is made out of parts you can understand, in particular via math
  • By understanding the parts, you can also control them
  • Your mind is part of the world and sometimes it doesn't work properly, but you can work on that too
  • Humanity's future is not being properly managed and so you should do something about that

Plausibly just hearing this would have done it for me, but probably that's too optimistic. 

0Dagon9mo
Given the somewhat continuous (if you squint) nature of self-awareness, there MUST be some people exactly on the margin where a very small push is sufficient to accelerate their movement along the past.  But I suspect it's pretty rare, and it's optimistic (or pessimistic, depending on your valuation of your prior innocence and experiences) to think you were in precisely the right mind-state that this could have made a huge difference.

What's up with biological hermaphrodite species? My first reaction was, "no way, what about the specialization benefits from sexual dimorphism?"

There are apparently no hermaphrodite mammal or bird species, which seems like evidence supporting my initial reaction. But there are, of course, other hermaphrodite species—maybe they aren't K-strategists, and so sexual dimorphism and role specialization isn't as important?

I went into a local dentist's office to get more prescription toothpaste; I was wearing my 3M p100 mask (with a surgical mask taped over the exhaust, in order to protect other people in addition to the native exhaust filtering offered by the mask). When I got in, the receptionist was on the phone. I realized it would be more sensible for me to wait outside and come back in, but I felt a strange reluctance to do so. It would be weird and awkward to leave right after entering. I hovered near the door for about 5 seconds before actually leaving. I was pretty ... (read more)

Instead of waiting to find out you were confused about new material you learned, pre-emptively google things like "common misconceptions about [concept]" and put the answers in your spaced repetition system, or otherwise magically remember them.

In order to reduce bias (halo effect, racism, etc), shouldn't many judicial proceedings generally be held over telephone, and/or through digital audio-only calls with voice anonymizers? 

3Mark Xu3y
I don't see strong reasons why this isn't a good idea. I have heard that technical interviews sometimes get conducted with voice anonymizers.

Continuous functions can be represented by their rational support; in particular, for each real number , choose a sequence of rational numbers converging to , and let .

Therefore, there is an injection from the vector space of continuous functions to the vector space of all sequences : since the rationals are countable, enumerate them by . Then the sequence represents continuous function .

3itaibn03y
This map is not a surjection because not every map from the rational numbers to the real numbers is continuous, and so not every sequence represents a continuous function. It is injective, and so it shows that a basis for the latter space is at least as large in cardinality as a basis for the former space. One can construct an injective map in the other direction, showing the both spaces of bases with the same cardinality, and so they are isomorphic.
2TurnTrout3y
Fixed, thanks.

(Just starting to learn microecon, so please feel free to chirp corrections)

How diminishing marginal utility helps create supply/demand curves: think about the uses you could find for a pillow. Your first few pillows are used to help you fall asleep. After that, maybe some for your couch, and then a few spares to keep in storage. You prioritize pillow allocation in this manner; the value of the latter uses is much less than the value of having a place to rest your head.

How many pillows do you buy at a given price point? Well, if you buy any, you'll buy som

... (read more)

How can I make predictions in a way which lets me do data analysis? I want to be able to grep / tag questions, plot calibration over time, split out accuracy over tags, etc. Presumably exporting to a CSV should be sufficient. PredictionBook doesn't have an obvious export feature, and its API seems to not be working right now / I haven't figured it out yet. 

Trying to collate team shard's prediction results and visualize with plotly, but there's a lot of data processing that has to be done first. Want to avoid the pain in the future.

2DirectedEvolution1mo
I pasted your question into Bing Chat, and it returned Python code to export your PredictionBook data to a CSV file. Haven't tested it. Might be worth looking into this approach?

Consider trying to use Solomonoff induction to reason about P(I see “Canada goes to war with USA" in next year), emphasis added:

In Solomonoff induction, since we have unlimited computing power, we express our uncertainty about a  video frame the same way. All the various pixel fields you could see if your eye jumped to a plausible place, saw a plausible number of dust specks, and saw the box flash something that visually encoded '14', would have high probability. Pixel fields where the box vanished and was replaced with a glow-in-the-dar

... (read more)

Team shard is now accepting applications for summer MATS. SERI MATS is now accepting applications for their 4.0 program this summer. In particular, consider applying to the shard theory stream, especially if you have the following interests:

Feel free to apply if you're interested in shard theory more generally, although I expect to mostly supervise empirical work. Feel free to message me if you have questi... (read more)

The policy of truth is a blog post about why policy gradient/REINFORCE suck. I'm leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:

Our goal remains to find a policy that maximizes the total reward after  time steps.

 

And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:

 

If you start with a reward function whose values are in  and you subtract one million

... (read more)

Shard-theoretic model of wandering thoughts: Why trained agents won't just do nothing in an empty room. If human values are contextually activated subroutines etched into us by reward events (e.g. "If candy nearby and hungry, then upweight actions which go to candy"), then what happens in "blank" contexts? Why don't people just sit in empty rooms and do nothing?

Consider that, for an agent with lots of value shards (e.g. candy, family, thrill-seeking, music), the "doing nothing" context is a very unstable equilibrium. I think these shards will activate on t... (read more)

3Thane Ruthenis8mo
Another point here is that "an empty room" doesn't mean "no context". Presumably when you're sitting in an empty room, your world-model is still active, it's still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying "the meeting is probably soon", and that will prompt your punctuality shard. Similarly, your self-model is part of the world-model, so even if everything outside the empty room were wiped out, you'd still have your "internal context" — and there'd be some shards that activate in response to events in it as well. It's actually pretty difficult to imagine what an actual "no context" situation for realistic agents would look like. I guess you can imagine surgically removing all input channels from the WM to shards, to model this?

I think people go in silence retreats to find out what happens when you take out all the standard busy work. I could imagine the "fresh empty room" and "accustomed empty room" being the difference of calming down for an hour in contrast to a week.

Are there any alignment techniques which would benefit from the supervisor having a severed corpus callosum, or otherwise atypical neuroanatomy? Usually theoretical alignment discourse focuses on the supervisor's competence / intelligence. Perhaps there are other, more niche considerations.

Does anyone have tips on how to buy rapid tests in the US? Not seeing any on US Amazon, not seeing any in person back where I'm from. Considering buying German tests. Even after huge shipping costs, it'll come out to ~$12 a test, which is sadly competitive with US market prices.

Wasn't able to easily find tests on the Mexican and Canadian Amazon websites, and other EU countries don't seem to have them either. 

1Markas2y
I've been able to buy from the CVS website several times in the past couple months, and even though they're sold out online now, they have some (sparse) in-store availability listed.  Worth checking there, Walgreens, etc. periodically.

The Baldwin effect

I couldn't find great explanations online, so here's my explanation after a bit of Googling. I welcome corrections from real experts.

Organisms exhibit phenotypic plasticity when they act differently in different environments. The phenotype (manifested traits: color, size, etc) manifests differently, even though two organisms might share the same genotype (genetic makeup). 

Panel 1: organisms are not phenotypically plastic and do not adapt to a spider-filled environment. Panel 2: a plastic organism might do the bad thing, and then lear
... (read more)
3Robbo2y
[disclaimer: not an expert, possibly still confused about the Baldwin effect] A bit of feedback on this explanation: as written, it didn’t make clear to me what makes it a special effect. “Evolution selects for genome-level hardcoding of extremely important learned lessons.” As a reader I was like, what makes this a special case? If it’s useful lesson then of course evolution would tend to select for knowing it innately - that does seem handy for an organism. As I understand it, what is interesting about the Baldwin effect is that such hard coding is selected for more among creatures that can learn, and indeed because of learning. The learnability of the solution makes it even more important to be endowed with the solution. So individual learning, in this way, drives selection pressures. Dennett’s explanation emphasizes this - curious what you make of his? https://ase.tufts.edu/cogstud/dennett/papers/baldwincranefin.htm [https://ase.tufts.edu/cogstud/dennett/papers/baldwincranefin.htm]
3TurnTrout2y
Right, I wondered this as well. I had thought its significance was that the effect seemed Lamarckian, but it wasn't. (And, I confess, I made the parent comment partly hoping that someone would point out that I'd missed the key significance of the Baldwin effect. As the joke goes, the fastest way to get your paper spell-checked is to comment it on a YouTube video!) Thanks for this link. One part which I didn't understand is why closeness in learning-space (given your genotype, you're plastic enough to learn to do something) must imply that you're close in genotype-space (evolution has a path of local improvements which implement genetic assimilation of the plastic advantage). I can learn to program computers. Does that mean that, given the appropriate selection pressures, my descendents would learn to program computers instinctively? In a reasonable timeframe? It's not that I can't imagine such evolution occurring. It just wasn't clear why these distance metrics should be so strongly related. Reading the link, Dennett points out this assumption and discusses why it might be reasonable, and how we might test it.

(This is a basic point on conjunctions, but I don't recall seeing its connection to Occam's razor anywhere)

When I first read Occam's Razor back in 2017, it seemed to me that the essay only addressed one kind of complexity: how complex the laws of physics are. If I'm not sure whether the witch did it, the universes where the witch did it are more complex, and so these explanations are exponentially less likely under a simplicity prior. Fine so far.

But there's another type. Suppose I'm weighing whether the United States government is currently engaged in a v... (read more)

2Steven Byrnes3y
I agree with the principle but I'm not sure I'd call it "Occam's razor". Occam's razor is a bit sketchy, it's not really a guarantee of anything, it's not a mathematical law, it's like a rule of thumb or something. Here you have a much more solid argument: multiplying many probabilities into a conjunction makes the result smaller and smaller. That's a mathematical law, rock-solid. So I'd go with that...
2TurnTrout3y
My point was more that "people generally call both of these kinds of reasoning 'Occam's razor', and they're both good ways to reason, but they work differently."
2Steven Byrnes3y
Oh, hmm, I guess that's fair, now that you mention it I do recall hearing a talk where someone used "Occam's razor" to talk about the solomonoff prior. Actually he called it "Bayes Occam's razor" I think. He was talking about a probabilistic programming algorithm. That's (1) not physics, and (2) includes (as a special case) penalizing conjunctions, so maybe related to what you said. Or sorry if I'm still not getting what you meant

At a poster session today, I was asked how I might define "autonomy" from an RL framing; "power" is well-definable in RL, and the concepts seem reasonably similar. 

I think that autonomy is about having many ways to get what you want. If your attainable utility is high, but there's only one trajectory which really makes good things happen, then you're hemmed-in and don't have much of a choice. But if you have many policies which make good things happen, you have a lot of slack and you have a lot of choices. This would be a lot of autonomy.

This has to b... (read more)

In Markov decision processes, state-action reward functions seem less natural to me than state-based reward functions, at least if they assign different rewards to equivalent actions. That is, actions  at a state  can have different reward  even though they induce the same transition probabilities: . This is unappealing because the actions don't actually have a "noticeable difference" from within the MDP, and the MDP is visitation-distribution-isomorphic to an MDP without the act... (read more)

From unpublished work.

The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?

Reasoning about learned policies via formal theorems on the power-seeking incentives of optimal policies

One way instrumental subgoals might arise in actual learned policies: we train a proto-AGI reinforcement learning agent with a curriculum including a variety of small subtasks. The current theorems show sufficient conditions for power-seeking tending to be optimal in fully-observable environments; many environments meet these sufficient conditions; optimal policies aren't hard to compute for the subtasks. One highly transferable heuristic would therefore... (read more)

I prompted GPT-3 with modified versions of Eliezer's Beisutsukai stories, where I modified the "class project" to be about solving intent alignment instead of quantum gravity. 

... Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it 'utility function patching', we'd better st

... (read more)

Transparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?

Physics has existed for hundreds of years. Why can you reach the frontier of knowledge with just a few years of study? Think of all the thousands of insights and ideas and breakthroughs that have been had - yet, I do not imagine you need most of those to grasp modern consensus.

Idea 1: the tech tree is rather horizontal - for any given question, several approaches and frames are tried. Some are inevitably more attractive or useful. You can view a Markov decision process in several ways - through the Bellman equations, through the structure of the state

... (read more)
9Viliam3y
Could this depend on your definition of "physics"? Like, if you use a narrow definition like "general relativity + quantum mechanics", you can learn that in a few years. But if you include things like electricity, expansion of universe, fluid mechanics, particle physics, superconductors, optics, string theory, acoustics, aerodynamics... most of them may be relatively simple to learn, but all of them together it's too much.
4TurnTrout3y
Maybe. I don't feel like that's the key thing I'm trying to point at here, though. The fact that you can understand any one of those in a reasonable amount of time is still surprising, if you step back far enough.

When under moral uncertainty, rational EV maximization will look a lot like preserving attainable utility / choiceworthiness for your different moral theories / utility functions, while you resolve that uncertainty.

3MichaelA3y
This seems right to me, and I think it's essentially the rationale for the idea of the Long Reflection [https://forum.effectivealtruism.org/posts/H2zno3ggRJaph9P6c/quotes-about-the-long-reflection].

To prolong my medicine stores by 200%, I've mixed in similar-looking iron supplement placebos with my real medication. (To be clear, nothing serious happens to me if I miss days)

New to LessWrong?