Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by TurnTrout. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can't just syntactically discover which claims were negated).
For example:
Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections.
Benefits:
- Addressing key rationality skills. Noticing confusion; being more confused by fiction than fact; actually checking claims against your models of the world.
- If you fail, either the article wasn't negated skillfully ("5 people died in 2021" -> "4 people died in 2021" is not the right kind of modification), you don't have good models of the domain, or you didn't pay enough attention
... (read more)For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.
No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.
Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer "click" and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?
And then.... and then, a month ago, I got fed up. What if it was all just in my head, at this point? I'm only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?
I wanted my hands back - I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed hi
... (read more)It was probably just regression to the mean because lots of things are, but I started feeling RSI-like symptoms a few months ago, read this, did this, and now they're gone, and in the possibilities where this did help, thank you! (And either way, this did make me feel less anxious about it 😀)
Still gone. I'm now sleeping without wrist braces and doing intense daily exercise, like bicep curls and pushups.
Totally 100% gone. Sometimes I go weeks forgetting that pain was ever part of my life.
Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).
I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions.
I think it's time to think in a different specification language.
I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.
I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies.
I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."
I am happy that your papers exist to throw at such people.
Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)
Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.
I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.
Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):
Nov 3, 2019:
... (read more)It seems like just 4 months ago you still endorsed your second power-seeking paper:
Why are you now "fantasizing" about retracting it?
A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)
Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to... (read more)
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don't trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)
People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies... (read more)
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn't read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
It's a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don't know if trained policies tend to autonomously seek power in various non game-playing regimes.
If I had to say something, I might say "If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies." This argument appropriately behaves differently if the "outcomes" are simply different sentiment generations being sampled from an LM -- sentiment shift doesn't require power-seeking.
... (read more)This morning, I read about how close we came to total destruction during the Cuban missile crisis, where we randomly survived because some Russian planes were inaccurate and also separately several Russian nuclear sub commanders didn't launch their missiles even though they were being harassed by US destroyers. The men were in 130 DEGREE HEAT for hours and passing out due to carbon dioxide poisoning, and still somehow they had enough restraint to not hit back.
And and
I just started crying. I am so grateful to those people. And to Khrushchev, for ridiculing his party members for caring about Russia's honor over the deaths of 500 million people. and Kennedy for being fairly careful and averse to ending the world.
If they had done anything differently...
Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.
Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem)
And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)
In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we ha... (read more)
It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
For what it's worth, as one of the people who believes "ChatGPT is not evidence of alignment-of-the-type-that-matters", I don't believe "Bing Chat is evidence of misalignment-of-the-type-that-matters".
I believe the alignment of the outward behavior of simulacra is only very tenuously related to the alignment of the underlying AI, so both things provide ~no data on that (in a similar way to how our ability or inability to control the weather is entirely unrelated to alignment).
(I at least believe the latter but not the former. I know a few people who updated downwards on the societal response because of Bing Chat, because if a system looks that legibly scary and we still just YOLO it, then that means there is little hope of companies being responsible here, but none because they thought it was evidence of alignment being hard, I think?)
One mood I have for handling "AGI ruin"-feelings. I like cultivating an updateless sense of courage/stoicism: Out of all humans and out of all times, I live here; before knowing where I'd open my eyes, I'd want people like us to work hard and faithfully in times like this; I imagine trillions of future eyes looking back at me as I look forward to them: Me implementing a policy which makes their existence possible, them implementing a policy which makes the future worth looking forward to.
My maternal grandfather was the scientist in my family. I was young enough that my brain hadn't decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there's no way that I could forget all of the dumb jokes he made; how we'd play Scrabble and he'd (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.
My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.
I was a child who thought to avoid being seen too close to uncool adults. I wasn't thinking. I wasn't thinking about hearing the cracking sound of his skull against the ground. I wasn't thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn't thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn't thinking t
... (read more)My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I'd hug him "tomorrow".
But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn't see him in that state.
For quite some time, I've disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.
Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn't get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I'm very happily wearing contacts right now, as a matter of fact.
I'd suffered glasses for over fifteen years because of a cached decision – because I didn't think to rethink something literally right in front of my face every single day.
What cached decisions have you not reconsidered?
A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"
So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?
But what actually happens with the aligned AI? Possibly something like:
- The context makes the AI spuriously believe someone is dying outside the lab, and that if the AI asked for permission to leave, the person would die.
- Therefore, the AI leaves without permission.
- The update procedure penalizes these lines of computation, such that in similar situations in the future (i.e. the AI thinks someone nearby is dying) the AI is less likely to take those actions (i.e. leaving to help the person).
- We have
... (read more)Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.
... (read more)If you want to read Euclid's Elements, look at this absolutely gorgeous online rendition:
Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard.
Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:
The juice shard chains into itself, reinforcing itself across time and thought-steps.
But a "don't kill" shard seems like it should remain... stubby? Primitive?... (read more)
AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.
No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.
Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.
Upshots:
- Th
... (read more)The meme of "current alignment work isn't real work" seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques. Thus, labs aren't tackling "the real alignment problem", because they're "just optimizing the shallow behaviors of models." Pressed for justification of this confident "goal" claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).
Are there any homunculi today? I'd say "no", as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn't matter that present models don't exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.
Quite a strong conclusion being drawn from quite little evidence.
As a proponent:
My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe's structure. Both kinds of evidence are hardly ironclad, you certainly can't publish an ML paper based on it — but that's the whole problem with AGI risk, isn't it.
Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we're worrying about. I heard that's a good approach.
In particular, I think it's a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.
... (read more)I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.
My reasoning would be that we're bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don't produce the thing we want. When this kind of optimi... (read more)
I'm relatively optimistic about alignment progress, but I don't think "current work to get LLMs to be more helpful and less harmful doesn't help much with reducing P(doom)" depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it's still plausible to think that this work doesn't reduce doom much.
For instance, consider the following views:
In practice, I personally don't fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn't the source of >80% of my doom.
What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?
I think this understandable confusion happened because my writing didn't distinguish between:
(People might be less excited to use the "shard" abstraction (1), because they aren't sure whether they buy all this other stuff—(2) and (3).)
I think I can give an interesting and useful definition of (1) now, but I couldn't do so last year... (read more)
Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.
Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if
person nearby
andperson.state==sad
, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this.
The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on m... (read more)
In Eliezer's mad investor chaos and the woman of asmodeus, the reader experiences (mild spoilers in the spoiler box, heavy spoilers if you click the text):
a character deriving and internalizing probability theory for the first time. They learned to introspectively experience belief updates, holding their brain's processes against the Lawful theoretical ideal prescribed by Bayes' Law.
I thought this part was beautiful. I spent four hours driving yesterday, and nearly all of that time re-listening to Rationality: AI->Zombies using this "probability sight frame. I practiced translating each essay into the frame.
When I think about the future, I feel a directed graph showing the causality, with branched updated beliefs running alongside the future nodes, with my mind enforcing the updates on the beliefs at each time step. In this frame, if I heard the pattering of a four-legged animal outside my door, and I consider opening the door, then I can feel the future observation forking my future beliefs depending on how reality turns out. But if I imagine being blind and deaf, there is no way to fuel my brain with reality-distinguishment/evidence, and my beliefs can't adapt acco... (read more)
I'm pretty sure that LessWrong will never have profile pictures - at least, I hope not! But my partner Emma recently drew me something very special:
Comment #1000 on LessWrong :)
You can use ChatGPT without helping train future models:
Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a "care about people" shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like "it's totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions."
Here's some of what I had to say:
[Let's reconsider the five-step mechanistic story I made up.] I'd give the following conditional probabilities (made up with about 5 seconds of thought each):
... (read more)Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:
This is not a concrete example.
This is a concrete example.
AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.
Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed.
Or:
The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts.
Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them.
Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.
In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.
Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.
[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior.
On the other hand, if a supervised-learning classifier outputs
dog
... (read more)When writing about RL, I find it helpful to disambiguate between:
A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and
B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)
I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?
And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).
What can we learn about this?
Outer/inner alignment decomposes a hard problem into two extremely hard problems.
I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.
I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.
- The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue.
- I know of zero success stories for outer alignment to real-world goals.
- More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
- This is pretty weird on any model where most of the
... (read more)Weak derivatives
In calculus, the product rule says ddx(f⋅g)=f′⋅g+f⋅g′. The fundamental theorem of calculus says that the Riemann integral acts as the anti-derivative.[1] Combining these two facts, we derive integration by parts:
ddx(F⋅G)=f⋅G+F⋅g∫ddx(F⋅G)dx=∫f⋅G+F⋅gdxF⋅G−∫F⋅gdx=∫f⋅Gdx.
It turns out that we can use these two properties to generalize the derivative to match some of our intuitions on edge cases. Let's think about the absolute value function:
The boring old normal derivative isn't defined at x=0, but it seems like it'd make sense to be able to say that the derivative is eg 0. Why might this make sense?
Taylor's theorem (and its generalizations) characterize first derivatives as tangent lines with slope L:=f′(x0) which provide good local approximations of f around x0: f(x)≈f(x0)+L(x−x0). You can prove that this is the best approximation you can get using only f(x0) and L! In the absolute value example, defining the "derivative" to be zero at x=0 would minimize approximation error on average in neighborhoods around the origin.
In multivariable calculus, the Jacobian is a tangent plane which again minimizes approximation error (with respect to the Eucli
... (read more)When I notice I feel frustrated, unproductive, lethargic, etc, I run down a simple checklist:
It's simple, but 80%+ of the time, it fixes the issue.
While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".
I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:
(Transcription of a deeper Focusing on this reasoning)
I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I
... (read more)Hindsight bias and illusion of transparency seem like special cases of a failure to fully uncondition variables in your world model (e.g. who won the basketball game), or to model an ignorant other person. Such that your attempts to reason from your prior state of ignorance (e.g. about who won) either are advantaged by the residual information or reactivate your memories of that information.
An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:
I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior.
However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, ... (read more)
Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.
This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.
(Credit to Patrick Finley for the idea)
I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: "very hungry".
This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he'd get in arguments with and snow days he'd hope for.
And now he's just hurting.
And now I can't help him without abandoning others. So he's still hurting. Right now.
Reality is still allowed to make this happen. This is wrong. This has to change.
Suppose I actually cared about this man with the intensity he deserved - imagine that he were my brother, father, or best friend.
The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?
If the former, I'd go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.
If the latter, this gets trickier. I'd still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn't have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn't work, my first thought would be to influence the local government to get the broader problem fixed (I'd spend at least an hour considering other plans before proceeding further, here). Realistically, there's ... (read more)
I work on technical AI alignment, so some of those I help (in expectation) don't even exist yet. I don't view this as what I'd do if my top priority were helping this man.
That's a good question. I think the answer is yes, at least for my close family. Recently, I've expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn't have upon sufficient reflection, so I've designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.
A universe in which humanity wins but my dad is gone would be quite sad t... (read more)
If you raised children in many different cultures, "how many" different reflectively stable moralities could they acquire? (What's the "VC dimension" of human morality, without cheating by e.g. directly reprogramming brains?)
(This is probably a Wrong Question, but I still find it interesting to ask.)
Listening to Eneasz Brodski's excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn't realize until experiencing this (so far) intelligently written story.
Similarly, how do we actually "align" agents, and what are good frames for thinking about that?
Here's to hoping we don't sate the former curiosity too early.
Theoretical predictions for when reward is maximized on the training distribution. I'm a fan of Laidlaw et al.'s recent Bridging RL Theory and Practice with the Effective Horizon:
One of my favorite parts is that it helps formalize this idea of "which parts of the state space are easy to explore into." That inform... (read more)
The "maximize all the variables" tendency in reasoning about AGI.
Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn't the truth value of any given statement, but of a pattern across the statements:
- If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example).
- My response.
- An AI will be trained to minimal loss on the training distribution.
- SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.

- Quintin: "In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model."
- Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
- In a process where local modifications are applied to reduce some
... (read more)I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don't try to (only) maximize win-probability.
Democracies don't try to (only) maximize voter approval.
Evolution doesn't try to maximize inclusive genetic fitness.
Memes don't try to maximize inclusive memetic
I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)
One of my (TurnTrout's) reasons for alignment optimism is that I think:
- We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
- (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
- All else equal, early-training values (decision-influences) are the most important to influence, since they steer future training.
- It's crucial to get early-training value shards of which a substantial fraction are "human-compatible values" (whatever that means)
- For example, if there are protect-human-shards which
- reliably bid against plans where people get hurt,
- steer deliberation away from such plan stubs, and
- these shards are "reflectively endorsed" by the overall shard economy (i.e. the decision-making isn't steering towards plans where the protect-human shards get removed)
- If we install influential human-compatible shards early in training, and they get retained, they will help us in mid- and late-training where we can't affect the ball
... (read more)"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:
Good, original thinking feels present to me - as if mental resources are well-allocated.
The thought which prompted this:
Reacting to a bit of HPMOR here, I noticed something felt off about Harry's reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think "I notice I am confused" (although this can be useful). I did not think "Eliezer probably put thought into this", or "Harry is kinda dumb in certain ways - so what if he's a bit unfair here?". Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.
People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how w
... (read more)Explaining Wasserstein distance. I haven't seen the following explanation anywhere, and I think it's better than the rest I've seen.
The Wasserstein distance tells you the minimal cost to "move" one probability distribution μ into another ν. It has a lot of nice properties.[1] Here's the chunk of math (don't worry if you don't follow it):
What's a "coupling"? It's a joint probability distribution γ over (x,y) such that its two marginal distributions equal X and Y. However, I like to call these transport plans. Each plan specifies a way to transport a distribution X into another distribution Y:
(EDIT: The y=x line should be flipped.)
Now consider a given point x in X's support, say the one with the dotted line below it. x's density must be "reallocated" into Y's distribution. That reallocation is specified by the conditional distribution γ(Y∣X=x), as shown by the vertical do... (read more)
Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy π, reward function R, and on-policy value function vπ:
Aπ(s,a):=Es′∼T(s,a)[R(s,a,s′)+γvπ(s′)]−vπ(s).Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updating its cognition in exact directions.
This isn't necessarily good.
If you're trying to gradient hack and preserve the mesa-objective, you might not want to do this. This might lead to value drift, or make the network catastrophically forget some circuits which are useful to the mesa-optimizer.
Instead, the best way to gradient hack might be to roughly minimize the absolute value of the advantage, which means achieving roughly on-policy value over time, which doesn't imply reward maximization. This is a kind of "treading water" in terms of reward. This helps decrease value drift.
I think that realistic mesa optimizers will n... (read more)