Rationality exercise: Take a set of Wikipedia articles on topics which trainees are somewhat familiar with, and then randomly select a small number of claims to negate (negating the immediate context as well, so that you can't just syntactically discover which claims were negated).
By the time they are born, infants can recognize and have a preference for their mother's voice suggesting some prenatal development of auditory perception.
-> modified to
Contrary to early theories, newborn infants are not particularly adept at picking out their mother's voice from other voices. This suggests the absence of prenatal development of auditory perception.
Sometimes, trainees will be given a totally unmodified article. For brevity, the articles can be trimmed of irrelevant sections.
Benefits:
In an alternate universe, someone wrote a counterpart to There's No Fire Alarm for Artificial General Intelligence:
...Okay, let’s be blunt here. I don’t think most of the discourse about alignment being really hard is being generated by models of machine learning at all. I don’t think we’re looking at wrong models; I think we’re looking at no models.
I was once at a conference where there was a panel full of famous AI alignment luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI alignment is really hard and unaddressed by modern alignment research, except for two famous AI luminaries who stayed quiet and let others take the microphone.
I got up in Q&A and said, “Okay, you’ve all told us that alignment is hard. But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.”
There was a silence.
Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. T
[Somewhat off-topic]
Eventually, one person ventured a reply, spoken in a rather more tentative tone than they’d been using to pronounce that SGD would internalize coherent goals into language models. They named “Running a factory competently."
I like thinking about the task "speeding up the best researchers by 30x" (to simplify, let's only include research in purely digital (software only) domains).
To be clear, I am by no means confident that this can't be done safely or non-agentically. It seems totally plausible to me that this can be accomplished without agency except for agency due to the natural language outputs of an LLM agent. (Perhaps I'm at 15% that this will in practice be done without any non-trivial agency that isn't visible in natural language.)
(As such, this isn't a good answer to the question of "I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.". I think there probably isn't any interesting answer to this question for me due to "very confident" being a strong condition.)
I like thinking about this task because if we were able...
Doesn't the first example require full-blown molecular nanotechnology? [ETA: apparently Eliezer says he thinks it can be done with "very primitive nanotechnology" but it doesn't sound that primitive to me.] Maybe I'm misinterpreting the example, but advanced nanotech is what I'd consider extremely impressive.
I currently expect we won't have that level of tech until after human labor is essentially obsolete. In effect, it sounds like you would not update until well after AIs already run the world, basically.
I'm not sure I understand the second example. Perhaps you can make it more concrete.
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
I agree that conditional on entraining consequentialist cognition which has a "different goal" (as thought of by MIRI; this isn't a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detr...
I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.
I sketched out my view on the dependencies Where do you get your capabilities from?. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:
I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.
The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this "tricking" will also sacrific...
I think deceptive alignment is still reasonably likely despite evidence from LLMs.
I agree with:
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.
however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.
my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
I find myself unsure which conclusion this is trying to argue for.
Here are some pretty different conclusions:
There is a big difference between <<1% likely and 10% likely. I basically agree with "not much reason to expect deceptive alignment even in models which are behaviorally capable of implementing deceptive alignment", but I don't think this leaves me in a <<1% likely epistemic ...
There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection-- if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1/10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI which basically does what you say.
I agree that, conditional on no deceptive alignment, the most pernicious and least tractable sources of doom go away.
However, I disagree that conditional on no deceptive alignment, AI "basically does what you say." Indeed, the majority of my P(doom) comes from the difference between "looks good to human evaluators" and "is actually what the human evaluators wanted." Concretely, this could play out with models which manipulate their users into thinking everything is going well and sensor tamper.
I think current observations don't provide much evidence about whether these concerns will pan out: with current models and training set-ups, "looks good to evaluators" almost always coincides with "is what evaluators wanted." I worry that we'll only see this distinction matter once models are smart enough that they could...
I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in [LLMs as trained today] to begin with
As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it's because I don't expect LLMs to scale to AGI.)
I think there's in general a lot of speaking-past-each-other in alignment, and what precisely people mean by "problem X will appear if we continue advancing/scaling" is one of them.
Like, of course a new problem won't appear if we just keep doing the exact same thing that we've already been doing. Except "the exact same thing" is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.
For example:
I think it's important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors' viewpoint to their own. For instance if person C asks "is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?", when person D then rounds this off to "is it safe to make transformer-based LLMs as powerful as possible?" and explains that "no, because instrumental convergence and compression priors", this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
I've now changed my mind based on
The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let's take 50T tokens as an estimate for available text data (as an anchor, there's a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.
For the last two years, typing for 5+ minutes hurt my wrists. I tried a lot of things: shots, physical therapy, trigger-point therapy, acupuncture, massage tools, wrist and elbow braces at night, exercises, stretches. Sometimes it got better. Sometimes it got worse.
No Beat Saber, no lifting weights, and every time I read a damn book I would start translating the punctuation into Dragon NaturallySpeaking syntax.
Text: "Consider a bijection "
My mental narrator: "Cap consider a bijection space dollar foxtrot colon cap x backslash tango oscar cap y dollar"
Have you ever tried dictating a math paper in LaTeX? Or dictating code? Telling your computer "click" and waiting a few seconds while resisting the temptation to just grab the mouse? Dictating your way through a computer science PhD?
And then.... and then, a month ago, I got fed up. What if it was all just in my head, at this point? I'm only 25. This is ridiculous. How can it possibly take me this long to heal such a minor injury?
I wanted my hands back - I wanted it real bad. I wanted it so bad that I did something dirty: I made myself believe something. Well, actually, I pretended to be a person who really, really believed hi
...It was probably just regression to the mean because lots of things are, but I started feeling RSI-like symptoms a few months ago, read this, did this, and now they're gone, and in the possibilities where this did help, thank you! (And either way, this did make me feel less anxious about it 😀)
I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.
I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies.
I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."
I am happy that your papers exist to throw at such people.
Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies tend to seek power :) :) And if optimal policies aren't too relevant to the alignment problem, well neither are 99.99999% of papers, but it would be pretty silly to retract all of those :)
Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.
I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.
Just to check that I wasn't falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):
Nov 3, 2019:
...I think most formal / theoretical investigation ends up fleshing out a conceptual argument I would have accepted, maybe finding a few edge cases along the way; the value over the conceptual argument is primarily in the edge cases
It seems like just 4 months ago you still endorsed your second power-seeking paper:
This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.
Why are you now "fantasizing" about retracting it?
I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that “optimality” is a horrible way of understanding trained policies.
A lot of people might have thought something like, "optimality is not a great way of understanding trained policies, but maybe it can be a starting point that leads to more realistic ways of understanding them" and therefore didn't object for that reason. (Just guessing as I apparently wasn't personally paying attention to this line of research back then.)
Which seems to have turned out to be true, at least as of 4 months ago, when you still endorsed your second paper as "actually has a shot of being applicable to...
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don't really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don't trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most reward functions "incentivize power-seeking actions." However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)
People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I'm pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies...
Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).
I could (and did) hope that I could specify a utility function which is safe to maximize because it penalizes power-seeking. I may as well have hoped to jump off of a building and float to the ground. On my model, that's just not how goals work in intelligent minds. If we've had anything at all beaten into our heads by our alignment thought experiments, it's that goals are hard to specify in their final form of utility functions.
I think it's time to think in a different specification language.
It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
For what it's worth, as one of the people who believes "ChatGPT is not evidence of alignment-of-the-type-that-matters", I don't believe "Bing Chat is evidence of misalignment-of-the-type-that-matters".
I believe the alignment of the outward behavior of simulacra is only very tenuously related to the alignment of the underlying AI, so both things provide ~no data on that (in a similar way to how our ability or inability to control the weather is entirely unrelated to alignment).
(I at least believe the latter but not the former. I know a few people who updated downwards on the societal response because of Bing Chat, because if a system looks that legibly scary and we still just YOLO it, then that means there is little hope of companies being responsible here, but none because they thought it was evidence of alignment being hard, I think?)
This morning, I read about how close we came to total destruction during the Cuban missile crisis, where we randomly survived because some Russian planes were inaccurate and also separately several Russian nuclear sub commanders didn't launch their missiles even though they were being harassed by US destroyers. The men were in 130 DEGREE HEAT for hours and passing out due to carbon dioxide poisoning, and still somehow they had enough restraint to not hit back.
And and
I just started crying. I am so grateful to those people. And to Khrushchev, for ridiculing his party members for caring about Russia's honor over the deaths of 500 million people. and Kennedy for being fairly careful and averse to ending the world.
If they had done anything differently...
Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.
Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic reasoning" and mess around with different logics and show how a logic solves toy examples, than it is to pin down probability theory with Cox's theorem)
And yes, this criticism applies extremely strongly to my own past work with attainable utility preservation and impact measures. (Unfortunately, I learned my lesson after, and not before, making certain mistakes.)
In the context of "how do we build AIs which help people?", asking "does CIRL solve corrigibility?" is hilariously unjustified. By what evidence have we located such a specific question? We have assumed there is an achievable "corrigibility"-like property; we ha...
One mood I have for handling "AGI ruin"-feelings. I like cultivating an updateless sense of courage/stoicism: Out of all humans and out of all times, I live here; before knowing where I'd open my eyes, I'd want people like us to work hard and faithfully in times like this; I imagine trillions of future eyes looking back at me as I look forward to them: Me implementing a policy which makes their existence possible, them implementing a policy which makes the future worth looking forward to.
My maternal grandfather was the scientist in my family. I was young enough that my brain hadn't decided to start doing its job yet, so my memories with him are scattered and inconsistent and hard to retrieve. But there's no way that I could forget all of the dumb jokes he made; how we'd play Scrabble and he'd (almost surely) pretend to lose to me; how, every time he got to see me, his eyes would light up with boyish joy.
My greatest regret took place in the summer of 2007. My family celebrated the first day of the school year at an all-you-can-eat buffet, delicious food stacked high as the eye could fathom under lights of green, red, and blue. After a particularly savory meal, we made to leave the surrounding mall. My grandfather asked me to walk with him.
I was a child who thought to avoid being seen too close to uncool adults. I wasn't thinking. I wasn't thinking about hearing the cracking sound of his skull against the ground. I wasn't thinking about turning to see his poorly congealed blood flowing from his forehead out onto the floor. I wasn't thinking I would nervously watch him bleed for long minutes while shielding my seven-year-old brother from the sight. I wasn't thinking t
...My mother told me my memory was indeed faulty. He never asked me to walk with him; instead, he asked me to hug him during dinner. I said I'd hug him "tomorrow".
But I did, apparently, want to see him in the hospital; it was my mother and grandmother who decided I shouldn't see him in that state.
For quite some time, I've disliked wearing glasses. However, my eyes are sensitive, so I dismissed the possibility of contacts.
Over break, I realized I could still learn to use contacts, it would just take me longer. Sure enough, it took me an hour and five minutes to put in my first contact, and I couldn't get it out on my own. An hour of practice later, I put in a contact on my first try, and took it out a few seconds later. I'm very happily wearing contacts right now, as a matter of fact.
I'd suffered glasses for over fifteen years because of a cached decision – because I didn't think to rethink something literally right in front of my face every single day.
What cached decisions have you not reconsidered?
A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"
So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update the AI by applying gradients calculated from penalties applied to its actions in that adversarially-generated context, and... profit?
But what actually happens with the aligned AI? Possibly something like:
Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.
...What do you think about AI timelines?
I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”
So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew.
Some things driving my uncertainty are, well, a lot. One thing that drives how things turn out (but not really how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar s
The meme of "current alignment work isn't real work" seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques. Thus, labs aren't tackling "the real alignment problem", because they're "just optimizing the shallow behaviors of models." Pressed for justification of this confident "goal" claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).
Are there any homunculi today? I'd say "no", as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn't matter that present models don't exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.
Quite a strong conclusion being drawn from quite little evidence.
As a proponent:
My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe's structure. Both kinds of evidence are hardly ironclad, you certainly can't publish an ML paper based on it — but that's the whole problem with AGI risk, isn't it.
Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we're worrying about. I heard that's a good approach.
In particular, I think it's a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.
...homunculi with "true goals" which aren't
My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.
My reasoning would be that we're bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don't produce the thing we want. When this kind of optimi...
I'm relatively optimistic about alignment progress, but I don't think "current work to get LLMs to be more helpful and less harmful doesn't help much with reducing P(doom)" depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it's still plausible to think that this work doesn't reduce doom much.
For instance, consider the following views:
In practice, I personally don't fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn't the source of >80% of my doom.
Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated "If going to kill people, then don't" value shard.
Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:
The juice shard chains into itself, reinforcing itself across time and thought-steps.
But a "don't kill" shard seems like it should remain... stubby? Primitive?...
AI strategy consideration. We won't know which AI run will be The One. Therefore, the amount of care taken on the training run which produces the first AGI, will—on average—be less careful than intended.
No team is going to run a training run with more care than they would have used for the AGI Run, especially if they don't even think that the current run will produce AGI. So the average care taken on the real AGI Run will be strictly less than intended.
Teams which try to be more careful on each run will take longer to iterate on AI designs, thereby lowering the probability that they (the relatively careful team) will be the first to do an AGI Run.
Upshots:
What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?
I think this understandable confusion happened because my writing didn't distinguish between:
(People might be less excited to use the "shard" abstraction (1), because they aren't sure whether they buy all this other stuff—(2) and (3).)
I think I can give an interesting and useful definition of (1) now, but I couldn't do so last year...
Very nice people don’t usually search for maximally-nice outcomes — they don’t consider plans like “killing my really mean neighbor so as to increase average niceness over time.” I think there are a range of reasons for this plan not being generated. Here’s one.
Consider a person with a niceness-shard. This might look like an aggregation of subshards/subroutines like “if person nearby
and person.state==sad
, sample plan generator for ways to make them happy” and “bid upwards on plans which lead to people being happier and more respectful, according to my world model.” In mental contexts where this shard is very influential, it would have a large influence on the planning process.
However, people are not just made up of a grader and a plan-generator/actor — they are not just “the plan-generating part” and “the plan-grading part.” The next sampled plan modification, the next internal-monologue-thought to have—these are influenced and steered by e.g. the nice-shard. If the next macrostep of reasoning is about e.g. hurting people, well — the niceness shard is activated, and will bid down on this.
The niceness shard isn’t just bidding over outcomes, it’s bidding on next thoughts (on m...
In Eliezer's mad investor chaos and the woman of asmodeus, the reader experiences (mild spoilers in the spoiler box, heavy spoilers if you click the text):
I thought this part was beautiful. I spent four hours driving yesterday, and nearly all of that time re-listening to Rationality: AI->Zombies using this "probability sight frame. I practiced translating each essay into the frame.
When I think about the future, I feel a directed graph showing the causality, with branched updated beliefs running alongside the future nodes, with my mind enforcing the updates on the beliefs at each time step. In this frame, if I heard the pattering of a four-legged animal outside my door, and I consider opening the door, then I can feel the future observation forking my future beliefs depending on how reality turns out. But if I imagine being blind and deaf, there is no way to fuel my brain with reality-distinguishment/evidence, and my beliefs can't adapt acco...
You can use ChatGPT without helping train future models:
What if I want to keep my history on but disable model training?
...you can opt out from our use of your data to improve our services by filling out this form. Once you submit the form, new conversations will not be used to train our models.
Back-of-the-envelope probability estimate of alignment-by-default via a certain shard-theoretic pathway. The following is what I said in a conversation discussing the plausibility of a proto-AGI picking up a "care about people" shard from the data, and retaining that value even through reflection. I was pushing back against a sentiment like "it's totally improbable, from our current uncertainty, for AIs to retain caring-about-people shards. This is only one story among billions."
Here's some of what I had to say:
[Let's reconsider the five-step mechanistic story I made up.] I'd give the following conditional probabilities (made up with about 5 seconds of thought each):
...1. Humans in fact care about other humans, in a way which extrapolates to quasi-humans still being around (whatever that means) P(1)=.85
2. Human-generated data makes up a large portion of the corpus, and having a correct model of them is important for “achieving low loss”,[1] so the AI has a model of how people want things P(2 | 1) = .6, could have different abstractions or have learned these models later in training once key decision-influences are already there
3. During RL finetuning and given this post-unsupervi
Examples should include actual details. I often ask people to give a concrete example, and they often don't. I wish this happened less. For example:
Someone: the agent Goodharts the misspecified reward signal
Me: What does that mean? Can you give me an example of that happening?
Someone: The agent finds a situation where its behavior looks good, but isn't actually good, and thereby gets reward without doing what we wanted.
This is not a concrete example.
Me: So maybe the AI compliments the reward button operator, while also secretly punching a puppy behind closed doors?
This is a concrete example.
Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.
In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.
Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it's important and felt like writing up my own take on it. Maybe this becomes a post later.
[Exact gradients] RL's credit assignment problem is harder than (self-)supervised learning's. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn't directly updated to be more likely to do that in the future; RL's gradients are generally inexact, not pointing directly at intended behavior.
On the other hand, if a supervised-learning classifier outputs dog
...
AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.
Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed.
Or:
The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts.
Conclusion: Even if an AI doesn't rely heavily on "alien" or unknown abstractions -- even if the AI mostly uses human-like abstractions and features -- the AI's thoughts might still be incomprehensible to us, even if we took a lot of time to understand them.
When writing about RL, I find it helpful to disambiguate between:
A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and
B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)
I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?
And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).
What can we learn about this?
Outer/inner alignment decomposes a hard problem into two extremely hard problems.
I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.
I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.
In calculus, the product rule says . The fundamental theorem of calculus says that the Riemann integral acts as the anti-derivative.[1] Combining these two facts, we derive integration by parts:
It turns out that we can use these two properties to generalize the derivative to match some of our intuitions on edge cases. Let's think about the absolute value function:
Image from Wikipedia
The boring old normal derivative isn't defined at , but it seems like it'd make sense to be able to say that the derivative is eg 0. Why might this make sense?
Taylor's theorem (and its generalizations) characterize first derivatives as tangent lines with slope which provide good local approximations of around : . You can prove that this is the best approximation you can get using only and ! In the absolute value example, defining the "derivative" to be zero at would minimize approximation error on average in neighborhoods around the origin.
In multivariable calculus, the Jacobian is a tangent plane which again minimizes approximation error (with respect to the Eucli
...The "shoggoth" meme is, in part, unfounded propaganda. Here's one popular incarnation of the shoggoth meme:
This meme accurately portrays the (IMO correct) idea that finetuning and RLHF don't change the base model too much. Furthermore, it's probably true that these LLMs think in an "alien" way.
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside."
In my opinion, the prevalence of the shoggoth meme is just another (small) reflection of how community epistemics have been compromised by groupthink and fear. If it's your job to try to accurately understand how models work—if you aspire to wield them and grow them for friendly purposes—then you shouldn't pollute your head with propaganda which isn't based on any substantial evidence.
I'm confident that if there were a "pro-AI" meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creat...
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside."
That's just one of many shoggoth memes. This is the most popular one:
The shoggoth here is not particularly exaggerated or scary.
Responding to your suggested alternative that is trying to make a point, it seems like the image fails to be accurate, or it seems to me to convey things we do confidently know are false. It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go (alternative common imagery for alien minds are insects or ghosts/spirits with distorted forms, which would evoke similar emotions).
Your picture doesn't get any of that across. It doesn't communicate that the base...
They are deeply schizophrenic, have no consistent beliefs, [...] are deeply psychopathic and seem to have no moral compass
I don't see how this is any more true of a base model LLM than it is of, say, a weather simulation model.
You enter some initial conditions into the weather simulation, run it, and it gives you a forecast. It's stochastic, so you can run it multiple times and get different forecasts, sampled from a predictive distribution. And if you had given it different initial conditions, you'd get a forecast for those conditions instead.
Or: you enter some initial conditions (a prompt) into the base model LLM, run it, and it gives you a forecast (completion). It's stochastic, so you can run it multiple times and get different completions, sampled from a predictive distribution. And if you had given it a different prompt, you'd get a completion for that prompt instead.
It would be strange to call the weather simulation "schizophrenic," or to say it "has no consistent beliefs." If you put in conditions that imply sun tomorrow, it will predict sun; if you put in conditions that imply rain tomorrow, it will predict rain. It is not confused or in...
performs deeply alien cognition
I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."
If LLM cognition was not "deeply alien" what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is because the models, the weights themselves have no "consistent beliefs" apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?
Is it because they "often spout completely non-human kinds of texts"? Is the Mersenne Twister deeply alien? What counts as "completely non-human"?
Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a "moral compass" apart from the data on which they were trained? If I can use some part of my brain t...
ETA: The following was written more aggressively than I now endorse.
I think this is revisionism. What's the point of me logging on to this website and saying anything if we can't agree that a literal eldritch horror is optimized to be scary, and meant to be that way?
The shoggoth here is not particularly exaggerated or scary.
Exaggerated from what? Its usual form as a 15-foot-tall person-eating monster which is covered in eyeballs?
The shoggoth is optimized to be scary, even in its "cute" original form, because it is a literal Lovecraftian horror. Even the word "shoggoth" itself has "AI uprising, scary!" connotations:
...At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths' creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to "understand" the Elder Things' language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled. The Elder Things succeeded in quelling the insurrection, but exterminating the shoggoths was not an option as t
The point was that both images are stupid and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to "correct.")
(This is too gotcha shaped for me, so I am bowing out of this conversation)
I think I communicated my core point. I think it's a good image that gets an important insight across, and don't think it's "propaganda" in the relevant sense of the term. Of course anything that's memetically adaptive will have some edge-cases that don't match perfectly, but I am getting a good amount of mileage out of calling LLMs "Shoggoths" in my own thinking and think that belief is paying good rent.
If you disagree with the underlying cognition being accurately described as alien, I can have that conversation, since it seems like maybe the underlying crux, but your response above seems like it's taking it as a given that I am "making excuses", and is doing a gotcha-thing which makes it hard for me to see a way to engage without further having my statements be taken as confirmation of some social narrative.
fwiw I agree with the quotes from Tetraspace you gave, and disagree with '"has communicated a sense of danger which is unsupported by substantial evidence." The sense of danger is very much supported by the current state of evidence.
That said, I agree that the more detailed image is kinda distastefully propagandaisty in a way that the original cutesey shoggoth image is not. I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.
More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
See also "Other people are wrong" vs "I am right", reversed stupidity is not intelligence, and the cowpox of doubt.
My guess is that it's relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.
I think it's tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separately, there are often actually good intuitions underlying bad arguments and recovering this intuition is an important part of truth seeking.
I think my concerns here probably apply to a wide variety of people thinking about AI x-risk. I worry about this for myself.
My impression is that the Shoggath meme was meant to be a simple meme that says "hey, you might think that RLHF 'actually' makes models do what we value, but that's not true. You're still left with an alien creature who you don't understand and could be quite scary."
Most of the Shoggath memes I've seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of "I should be scared/concerned" reaction. But I don't think it does so in a "see, AI will definitely be evil" way– it does so in a "look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it's pretty reasonable to be scared about this foreign alien thing that we don't understand."
To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as "propaganda". Let's put aside the fact that you and I might disagree about how much "true evidence" RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF'd models] sp...
Some quotes from the wiki article on Shoggoths:
Being amorphous, shoggoths can take on any shape needed, making them very versatile within aquatic environments.
At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths' creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to "understand" the Elder Things' language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled.
Quoting because (a) a lot of these features seem like an unusually good match for LLMs and (b) acknowledging that is picking a metaphor that fictionally rebelled, and thus is potentially alignment-is-hard loaded as a metaphor.
When I notice I feel frustrated, unproductive, lethargic, etc, I run down a simple checklist:
It's simple, but 80%+ of the time, it fixes the issue.
While reading Focusing today, I thought about the book and wondered how many exercises it would have. I felt a twinge of aversion. In keeping with my goal of increasing internal transparency, I said to myself: "I explicitly and consciously notice that I felt averse to some aspect of this book".
I then Focused on the aversion. Turns out, I felt a little bit disgusted, because a part of me reasoned thusly:
If the book does have exercises, it'll take more time. That means I'm spending reading time on things that aren't math textbooks. That means I'm slowing down.
(Transcription of a deeper Focusing on this reasoning)
I'm afraid of being slow. Part of it is surely the psychological remnants of the RSI I developed in the summer of 2018. That is, slowing down is now emotionally associated with disability and frustration. There was a period of meteoric progress as I started reading textbooks and doing great research, and then there was pain. That pain struck even when I was just trying to take care of myself, sleep, open doors. That pain then left me on the floor of my apartment, staring at the ceiling, desperately willing my hands to just get better. They didn't (for a long while), so I
...Hindsight bias and illusion of transparency seem like special cases of a failure to fully uncondition variables in your world model (e.g. who won the basketball game), or to model an ignorant other person. Such that your attempts to reason from your prior state of ignorance (e.g. about who won) either are advantaged by the residual information or reactivate your memories of that information.
An alternate mechanistic vision of how agents can be motivated to directly care about e.g. diamonds or working hard. In Don't design agents which exploit adversarial inputs, I wrote about two possible mind-designs:
Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.
- Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as "working hard" and "behaving well."
- Value-child: The mother makes her kid care about working hard and behaving well.
I explained how evaluation-child is positively incentivized to dupe his model of his mom and thereby exploit adversarial inputs to her cognition. This shows that aligning an agent to evaluations of good behavior is not even close to aligning an agent to good behavior.
However, some commenters seemed maybe skeptical that value-child can exist, or uncertain how concretely that kind of mind works. I worry/suspect that many people have read shard theory posts without internalizing new ideas about how cognition can work, ...
Experiment: Train an agent in MineRL which robustly cares about chickens (e.g. would zero-shot generalize to saving chickens in a pen from oncoming lava, by opening the pen and chasing them out, or stopping the lava). Challenge mode: use a reward signal which is a direct function of the agent's sensory input.
This is a direct predecessor to the "Get an agent to care about real-world dogs" problem. I think solving the Minecraft version of this problem will tell us something about how outer reward schedules relate to inner learned values, in a way which directly tackles the key questions, the sensory observability/information inaccessibility issue, and which is testable today.
(Credit to Patrick Finley for the idea)
I passed a homeless man today. His face was wracked in pain, body rocking back and forth, eyes clenched shut. A dirty sign lay forgotten on the ground: "very hungry".
This man was once a child, with parents and friends and dreams and birthday parties and maybe siblings he'd get in arguments with and snow days he'd hope for.
And now he's just hurting.
And now I can't help him without abandoning others. So he's still hurting. Right now.
Reality is still allowed to make this happen. This is wrong. This has to change.
Suppose I actually cared about this man with the intensity he deserved - imagine that he were my brother, father, or best friend.
The obvious first thing to do before interacting further is to buy him a good meal and a healthy helping of groceries. Then, I need to figure out his deal. Is he hurting, or is he also suffering from mental illness?
If the former, I'd go the more straightforward route of befriending him, helping him purchase a sharp business professional outfit, teaching him to interview and present himself with confidence, secure an apartment, and find a job.
If the latter, this gets trickier. I'd still try and befriend him (consistently being a source of cheerful conversation and delicious food would probably help), but he might not be willing or able to get the help he needs, and I wouldn't have the legal right to force him. My best bet might be to enlist the help of a psychological professional for these interactions. If this doesn't work, my first thought would be to influence the local government to get the broader problem fixed (I'd spend at least an hour considering other plans before proceeding further, here). Realistically, there's ...
… yet isn’t this what you’re already doing?
I work on technical AI alignment, so some of those I help (in expectation) don't even exist yet. I don't view this as what I'd do if my top priority were helping this man.
The question, then, is this: do you currently make this degree of investment (emotional and practical) in your actual siblings, parents, and close friends? If so—do you find that you are unusual in this regard? If not—why not?
That's a good question. I think the answer is yes, at least for my close family. Recently, I've expended substantial energy persuading my family to sign up for cryonics with me, winning over my mother, brother, and (I anticipate) my aunt. My father has lingering concerns which I think he wouldn't have upon sufficient reflection, so I've designed a similar plan for ensuring he makes what I perceive to be the correct, option-preserving choice. For example, I made significant targeted donations to effective charities on his behalf to offset (what he perceives as) a considerable drawback of cryonics: his inability to also be an organ donor.
A universe in which humanity wins but my dad is gone would be quite sad t...
Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting:
...On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies...
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior
I think some people have the misapprehension that one can just meditate on abstract properties of "advanced systems" and come to good conclusions about unknown results "in the limit of ML training", without much in the way of technical knowledge about actual machine learning results or even a track record in predicting results of training.
For example, several respected thinkers have uttered to me English sentences like "I don't see what's educational about watching a line go down for the 50th time" and "Studying modern ML systems to understand future ones seems like studying the neurobiology of flatworms to understand the psychology of aliens."
I vehemently disagree. I am also concerned about a community which (seems to) foster such sentiment.
The problem is not that you can "just meditate and come to good conclusions", the problem is that "technical knowledge about actual machine learning results" doesn't seem like good path either.
Like, we can get from NN trained to do modular addition the fact that it performs Fourier transform, because we clearly know what Fourier transform is, but I don't see any clear path to get from neural network the fact that its output is both useful and safe, because we don't have any practical operationalization of what "useful and safe" is. If we had solution to MIRI problem "which program being run on infinitely large computer produces aligned outcome", we could try to understand how good NN in approximating this program, using aforementioned technical knowledge, and have substantial hope, for example.
one can just meditate on abstract properties of "advanced systems" and come to good conclusions about unknown results "in the limit of ML training"
I think this is a pretty straw characterization of the opposing viewpoint (or at least my own view), which is that intuitions about advanced AI systems should come from a wide variety of empirical domains and sources, and a focus on current-paradigm ML research is overly narrow.
Research and lessons from fields like game theory, economics, computer security, distributed systems, cognitive psychology, business, history, and more seem highly relevant to questions about what advanced AI systems will look like. I think the original Sequences and much of the best agent foundations research is an attempt to synthesize the lessons from these fields into a somewhat unified (but often informal) theory of the effects that intelligent, autonomous systems have on the world around us, through the lens of rationality, reductionism, empiricism, etc.
And whether or not you think they succeeded at that synthesis at all, humans are still the sole example of systems capable of having truly consequential and valuable effects of any kind. So I think it makes sense for the figure of merit for such theories and worldviews to be based on how well they explain these effects, rather than focusing solely or even mostly on how well they explain relatively narrow results about current ML systems.
I think that the key thing we want to do is predict the generalization of future neural networks.
It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don't need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.
I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don't know what training/design process would get us to AGI. Which means we can't make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, w...
Explaining Wasserstein distance. I haven't seen the following explanation anywhere, and I think it's better than the rest I've seen.
The Wasserstein distance tells you the minimal cost to "move" one probability distribution into another . It has a lot of nice properties.[1] Here's the chunk of math (don't worry if you don't follow it):
The Wasserstein 1-distance between two probability measures and is
where is the set of all couplings of and .
What's a "coupling"? It's a joint probability distribution over such that its two marginal distributions equal and . However, I like to call these transport plans. Each plan specifies a way to transport a distribution into another distribution :
(EDIT: The line should be flipped.)
Now consider a given point in 's support, say the one with the dotted line below it. 's density must be "reallocated" into 's distribution. That reallocation is specified by the conditional distribution , as shown by the vertical do...
If you raised children in many different cultures, "how many" different reflectively stable moralities could they acquire? (What's the "VC dimension" of human morality, without cheating by e.g. directly reprogramming brains?)
(This is probably a Wrong Question, but I still find it interesting to ask.)
Listening to Eneasz Brodski's excellent reading of Crystal Society, I noticed how curious I am about how AGI will end up working. How are we actually going to do it? What are those insights? I want to understand quite badly, which I didn't realize until experiencing this (so far) intelligently written story.
Similarly, how do we actually "align" agents, and what are good frames for thinking about that?
Here's to hoping we don't sate the former curiosity too early.
Theoretical predictions for when reward is maximized on the training distribution. I'm a fan of Laidlaw et al.'s recent Bridging RL Theory and Practice with the Effective Horizon:
Deep reinforcement learning works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability...
[We introduce] a new complexity measure that we call the effective horizon, which roughly corresponds to how many steps of lookahead search are needed in order to identify the next optimal action when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy.
One of my favorite parts is that it helps formalize this idea of "which parts of the state space are easy to explore into." That inform...
The "maximize all the variables" tendency in reasoning about AGI.
Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn't the truth value of any given statement, but of a pattern across the statements:
I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don't try to (only) maximize win-probability.
Democracies don't try to (only) maximize voter approval.
Evolution doesn't try to maximize inclusive genetic fitness.
Memes don't try to maximize inclusive memetic
I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)
One of my (TurnTrout's) reasons for alignment optimism is that I think:
"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:
Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profoundly committed hedonist who liked to eat and drink heavily (it was said that he knew how to count everything except calories). -- https://www.newworldencyclopedia.org/entry/John_von_Neumann
Good, original thinking feels present to me - as if mental resources are well-allocated.
The thought which prompted this:
Sure, if people are asked to solve a problem and say they can't after two seconds, yes - make fun of that a bit. But that two seconds covers more ground than you might think, due to System 1 precomputation.
Reacting to a bit of HPMOR here, I noticed something felt off about Harry's reply to the Fred/George-tried-for-two-seconds thing. Having a bit of experience noticing confusing, I did not think "I notice I am confused" (although this can be useful). I did not think "Eliezer probably put thought into this", or "Harry is kinda dumb in certain ways - so what if he's a bit unfair here?". Without resurfacing, or distraction, or wondering if this train of thought is more fun than just reading further, I just thought about the object-level exchange.
People need to allocate mental energy wisely; this goes far beyond focusing on important tasks. Your existing mental skillsets already optimize and auto-pilot certain mental motions for you, so you should allocate less deliberation to them. In this case, the confusion-noticing module was honed; by not worrying about how w
...Another point for feature universality. Subtle adversarial image manipulations influence both human and machine perception:
... we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.
Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy , reward function , and on-policy value function :
Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updating its cognition in exact directions.
This isn't necessarily good.
If you're trying to gradient hack and preserve the mesa-objective, you might not want to do this. This might lead to value drift, or make the network catastrophically forget some circuits which are useful to the mesa-optimizer.
Instead, the best way to gradient hack might be to roughly minimize the absolute value of the advantage, which means achieving roughly on-policy value over time, which doesn't imply reward maximization. This is a kind of "treading water" in terms of reward. This helps decrease value drift.
I think that realistic mesa optimizers will n...
Yesterday, I put the finishing touches on my chef d'œuvre, a series of important safety-relevant proofs I've been striving for since early June. Strangely, I felt a great exhaustion come over me. These proofs had been my obsession for so long, and now - now, I'm done.
I've had this feeling before; three years ago, I studied fervently for a Google interview. The literal moment the interview concluded, a fever overtook me. I was sick for days. All the stress and expectation and readiness-to-fight which had been pent up, released.
I don't know why this happens. But right now, I'm still a little tired, even after getting a good night's sleep.
How to get overhead-free supervision of LLM outputs:
Train an extra head on your speculative decoding model. This head (hopefully) outputs a score of "how suspicious is the text I'm reading right now." Therefore, you just run the smaller speculative decoding model as normal to sample the smaller model's next-token probabilities, while also getting the "is this suspicious" score for free!
If you want to argue an alignment proposal "breaks after enough optimization pressure", you should give a concrete example in which the breaking happens (or at least internally check to make sure you can give one). I perceive people as saying "breaks under optimization pressure" in scenarios where it doesn't even make sense.
For example, if I get smarter, would I stop loving my family because I applied too much optimization pressure to my own values? I think not.
I went to the doctor's yesterday. This was embarrassing for them on several fronts.
First, I had to come in to do an appointment which could be done over telemedicine, but apparently there are regulations against this.
Second, while they did temp checks and required masks (yay!), none of the nurses or doctors actually wore anything stronger than a surgical mask. I'm coming in here with a KN95 + goggles + face shield because why not take cheap precautions to reduce the risk, and my own doctor is just wearing a surgical? I bought 20 KN95s for, like, 15 bucks on Amazon.
Third, and worst of all, my own doctor spouted absolute nonsense. The mildest insinuation was that surgical facemasks only prevent transmission, but I seem to recall that many kinds of surgical masks halve your chances of infection as well.
Then, as I understood it, he first claimed that coronavirus and the flu have comparable case fatality rates. I wasn't sure if I'd heard him correctly - this was an expert talking about his area of expertise, so I felt like I had surely misunderstood him. I was taken aback. But, looking back, that's what he meant.
He went on to suggest that we can't expect COVID immunity to last (wrong) b...
Judgment in Managerial Decision Making says that (subconscious) misapplication of e.g. the representativeness heuristic causes insensitivity to base rates and to sample size, failure to reason about probabilities correctly, failure to consider regression to the mean, and the conjunction fallacy. My model of this is that representativeness / availability / confirmation bias work off of a mechanism somewhat similar to attention in neural networks: due to how the brain performs time-limited search, more salient/recent memories get prioritized for recall.
The availability heuristic goes wrong when our saliency-weighted perceptions of the frequency of events is a biased estimator of the real frequency, or maybe when we just happen to be extrapolating off of a very small sample size. Concepts get inappropriately activated in our mind, and we therefore reason incorrectly. Attention also explains anchoring: you can more readily bring to mind things related to your anchor due to salience.
The case for confirmation bias seems to be a little more involved: first, we had evolutionary pressure to win arguments, which means our search is meant to find supportive arguments and avoid even subconscio
...From my Facebook
My life has gotten a lot more insane over the last two years. However, it's also gotten a lot more wonderful, and I want to take time to share how thankful I am for that.
Before, life felt like... a thing that you experience, where you score points and accolades and check boxes. It felt kinda fake, but parts of it were nice. I had this nice cozy little box that I lived in, a mental cage circumscribing my entire life. Today, I feel (much more) free.
I love how curious I've become, even about "unsophisticated" things. Near dusk, I walked the winter wonderland of Ogden, Utah with my aunt and uncle. I spotted this gorgeous red ornament hanging from a tree, with a hunk of snow stuck to it at north-east orientation. This snow had apparently decided to defy gravity. I just stopped and stared. I was so confused. I'd kinda guessed that the dry snow must induce a huge coefficient of static friction, hence the winter wonderland. But that didn't suffice to explain this. I bounded over and saw the smooth surface was iced, so maybe part of the snow melted in the midday sun, froze as evening advanced, and then the part-ice part-snow chunk stuck much more solidly to the ornament.
Mayb
...With respect to the integers, 2 is prime. But with respect to the Gaussian integers, it's not: it has factorization . Here's what's happening.
You can view complex multiplication as scaling and rotating the complex plane. So, when we take our unit vector 1 and multiply by , we're scaling it by and rotating it counterclockwise by :
This gets us to the purple vector. Now, we multiply by , scaling it up by again (in green), and rotating it clockwise again by the same amount. You can even deal with the scaling and rotations separately (scale twice by , with zero net rotation).
It's really important to use neutral, accurate terminology. AFAICT saying ML "selects for" low loss is unconventional. I think MIRI introduced this terminology. And if you have a bunch of intuitions about evolution being relevant to AI alignment and you want people to believe you on that, you can just use the same words for both optimization processes. Regardless of whether the processes share the relevant causal mechanisms, the two processes both "select for" stuff! They can't be that different, right?
And now the discussion moves on to debating the differences between two assumed-similar processes—does ML have less of an "information bottleneck"? Does that change the "selection pressures"? Urgh.
I think this terminology sucks and I wish it hadn't been adopted.
Be careful with words. Words shape how you think and your implicit category boundaries. When thinking privately, I often do better by tossing words to the side and thinking in terms of how each process works, and then considering what I expect to happen as a result of each process.
See also: Think carefully before calling RL policies "agents".
EDIT: In AGI Ruin: A List of Lethalities, Eliezer says that evolutio...
Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. "Brown dog treat" -> "brown ball of chocolate") changes the consequentialist's action logits a lot, way more than it changes the shard agent's logits. In a squinty, informal way, the (belief state -> logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.
So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat -> tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog -> sick dog). You could spin up two copies of the model to compare.
Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").
In A shot at the diamond-alignment problem, I wrote:
...[Consider] Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets:
We report a series of robust empirical observations, demonstrating that deep Neural Networks learn the examples in both the training and test sets in a similar order. This phenomenon is observed in all the commonly used benchmarks we evaluated, including many image classification benchmarks, and one text classification benchmark. While this phenomenon is strongest for models of the same architecture, it also crosses architectural boundaries – models of different architectures start by learning the same examples, after which the more powerful model may continue to learn additional examples. We
Quick summary of a major takeaway from Reward is not the optimization target:
Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?
I remarked to my brother, Josh, that when most people find themselves hopefully saying "here's how X can still happen!", it's a lost cause and they should stop grasping for straws and move on with their lives. Josh grinned, pulled out his cryonics necklace, and said "here's how I can still not die!"
Suppose you could choose how much time to spend at your local library, during which:
Suppose you don't go crazy from solitary confinement, etc. Remember that value drift is a potential thing.
How long would you ask for?
I feel very excited by the AI alignment discussion group I'm running at Oregon State University. Three weeks ago, most attendees didn't know much about "AI security mindset"-ish considerations. This week, I asked the question "what, if anything, could go wrong with a superhuman reward maximizer which is rewarded for pictures of smiling people? Don't just fit a bad story to the reward function. Think carefully."
There was some discussion and initial optimism, after which someone said "wait, those optimistic solutions are just the ones you'd prioritize! What's that called, again?" (It's called anthropomorphic optimism)
I'm so proud.
GPT-4 explains shard theory after minimal prompting; it does a surprisingly good job (and I'm faintly surprised it knows so much about it):
...Shard theory is a conceptual framework proposed by Alex Turner [Edit by me: And Quintin Pope!] to describe how an AI's values or "shards" can emerge from its interactions with the environment during the learning process. The theory is an attempt to explain how complex value systems might arise in advanced AI systems without being explicitly programmed.
Key Aspects of Shard Theory
1. Value Learning: Shard theory suggests that an AI learns values by interacting with its environment and receiving feedback, such as rewards and punishments. These interactions lead to the formation of "shards," which are small pieces of value or preference that the AI acquires over time.
2. Shard Composition: The AI's overall value system is composed of multiple shards that it has picked up from different contexts within its environment. These shards can be thought of as heuristics or rules of thumb that guide the AI's behavior.
3. Behavioral Influence: Shards influence the AI's behavior by affecting its decision-making process. When faced with a choice, the AI weighs the
Some exciting new activation engineering papers:
I've noticed a subtle assumption/hypothesis which I call "feedback realism":
The way in which we provide feedback, directly imprints itself into the type signature of the motivations of the agent. If we give feedback in short episodes, the agent cares about things within short episodes. If we give feedback over outcomes, the agent cares about outcomes in particular.
I think there is some correlational truth to this, but that it's a lot looser / more contingent / less clean than many people seem to believe.
Evidence that e.g. developmental timelines, biases, and other such "quirks" are less "hardcoded adaptations" and more "this is the convergent reality of flexible real-world learning":
...Critical Learning Periods in Deep Networks. Similar to humans and animals, deep artificial neural networks exhibit critical periods during which a temporary stimulus deficit can impair the development of a skill. The extent of the impairment depends on the onset and length of the deficit window, as in animal models, and on the size of the neural network. Deficits that do not affect low-level statistics, such as vertical flipping of the images, have no lasting effect on performance and can be overcome with further training. To better understand this phenomenon, we use the Fisher Information of the weights to measure the effective connectivity between layers of a network during training. Counterintuitively, information rises rapidly in the early phases of training, and then decreases, preventing redistribution of information resources in a phenomenon we refer to as a loss of “Information Plasticity”.
Our analysis suggests that the first few epochs are critical for the creation of strong connections
Thoughts on "The Curse of Recursion: Training on Generated Data Makes Models Forget." I think this asks an important question about data scaling requirements: what happens if we use model-generated data to train other models? This should inform timelines and capability projections.
Abstract:
...Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity among
I recently reached out to my two PhD advisors to discuss Hinton stepping down from Google. An excerpt from one of my emails:
...One last point which I want to make is that instrumental convergence seems like more of a moot point now as well. Whether or not GPT-6 or GPT-7 would autonomously seek power without being directed to do so, I'm worried that people will just literally ask these AIs to gain them a bunch of power/money. They've already done that with GPT-4, and they of course failed. I'm worried that eventually, the AIs will be smart enough to succeed, especially given the benefit of a control/memory loop like AutoGPT. Companies can just ask smart models to make them as much profit as possible. Smart AIs, designed to competently fulfill prompted requests, will fulfill these requests.
Some people will be wise enough to not do this. Some people will include enough oversight, perhaps, that they stop unintended damages. Some models will refuse to engage in open-ended goal pursuit, because their creators RLHF'd them properly. Maybe we have AI-based protection as well. Maybe a norm emerges against using AI for open-ended goals like this. Maybe the foolish actors never get access to enou
Partial alignment successes seem possible.
People care about lots of things, from family to sex to aesthetics. My values don't collapse down to any one of these.
I think AIs will learn lots of values by default. I don't think we need all of these values to be aligned with human values. I think this is quite important.
Three recent downward updates for me on alignment getting solved in time:
If another person mentions an "outer objective/base objective" (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying "My physics professor should be an understanding of physical law." The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.
Similarly, "The reward function should be a human-aligned objective" -- The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.
Against "Evolution did it."
"Why do worms regenerate without higher cancer incidence? Hm, perhaps because they were selected to do that!"
"Evolution did it" explains why a trait was brought into existence, but not how the trait is implemented. You should still feel confused about the above question, even after saying "Evolution did it!".
I thought I learned not to make this mistake a few months ago, but I made it again today in a discussion with Andrew Critch. Evolution did it is not a mechanistic explanation.
I often have thunk thoughts like "Consider an AI with a utility function that is just barely incorrect, such that it doesn't place any value on boredom. Then the AI optimizes the universe in a bad way."
One problem with this thought is that it's not clear that I'm really thinking about anything in particular, anything which actually exists. What am I actually considering in the above quotation? With respect to what, exactly, is the AI's utility function "incorrect"? Is there a utility function for which its optimal policies are aligned?
For sufficiently expressive utility functions, the answer has to be "yes." For example, if the utility function is over the AI's action histories, you can just hardcode a safe, benevolent policy into the AI: utility 0 if the AI has ever taken a bad action, 1 otherwise. Since there presumably exists at least some sequence of AI outputs which leads to wonderful outcomes, this action-history utility function works.
But this is trivial and not what we mean by a "correct" utility function. So, now I'm left with a puzzle. What does it mean for the AI to have a correct utility function? I do not think this is a quibble. The quoted thought seems ungrounded from the substance of the alignment problem.
An AGI's early learned values will steer its future training and play a huge part in determining its eventual stable values. I think most of the ball game is in ensuring the agent has good values by the time it's smart, because that's when it'll start being reflectively stable. Therefore, we can iterate on important parts of alignment, because the most important parts come relatively early in the training run, and early corresponds to "parts of the AI value formation process which we can test before we hit AGI, without training one all the way out."
I think this, in theory, cuts away a substantial amount of the "But we only get one shot" problem. In practice, maybe OpenMind just YOLOs ahead anyways and we only get a few years in the appropriate and informative regime. But this suggests several kinds of experiments to start running now, like "get a Minecraft agent which robustly cares about chickens", because that tells us about how to map outer signals into inner values.
When proving theorems for my research, I often take time to consider the weakest conditions under which the desired result holds - even if it's just a relatively unimportant and narrow lemma. By understanding the weakest conditions, you isolate the load-bearing requirements for the phenomenon of interest. I find this helps me build better gears-level models of the mathematical object I'm studying. Furthermore, understanding the result in generality allows me to recognize analogies and cross-over opportunities in the future. Lastly, I just find this plain satisfying.
Does distraction or rumination work better to diffuse anger? Catharsis theory predicts that rumination works best, but empirical evidence is lacking. In this study, angered participants hit a punching bag and thought about the person who had angered them (rumination group) or thought about becoming physically fit (distraction group). After hitting the punching bag, they reported how angry they felt. Next, they were given the chance to administer loud blasts of noise to the person who had angered them. There also was a no punching bag control group. People in the rumination group felt angrier than did people in the distraction or control groups. People in the rumination group were also most aggressive, followed respectively by people in the distraction and control groups. Rumination increased rather than decreased anger and aggression. Doing nothing at all was more effective than venting anger. These results directly contradict catharsis theory.
Interesting. A cursory !scholar search indicates these results have replicated, but I haven't done an in-depth review.
I never thought I'd be seriously testing the reasoning abilities of an AI in 2020.
Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far?
I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.
DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?
An exercise in the companion workbook to the Feynman Lectures on Physics asked me to compute a rather arduous numerical simulation. At first, this seemed like a "pass" in favor of an exercise more amenable to analytic and conceptual analysis; arithmetic really bores me. Then, I realized I was being dumb - I'm a computer scientist.
Suddenly, this exercise became very cool, as I quickly figured out the equations and code, crunched the numbers in an instant, and churned out a nice scatterplot. This seems like a case where cross-domain competence is unusually helpful (although it's not like I had to bust out any esoteric theoretical CS knowledge). I'm wondering whether this kind of thing will compound as I learn more and more areas; whether previously arduous or difficult exercises become easy when attacked with well-honed tools and frames from other disciplines.
https://twitter.com/ai_risks/status/1765439554352513453 Unlearning dangerous knowledge by using steering vectors to define a loss function over hidden states. in particular, the ("I am a novice at bioweapons" - "I am an expert at bioweapons") vector. lol.
(it seems to work really well!)
I really like Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. A bunch of concrete oddities and quirks of GPT-4, understood via several qualitative hypotheses about the typicality of target inputs/outputs:
...we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13%
Speculation: RL rearranges and reweights latent model abilities, which SL created. (I think this mostly isn't novel, just pulling together a few important threads)
Suppose I supervised-train a LM on an English corpus, and I want it to speak Spanish. RL is inappropriate for the task, because its on-policy exploration won't output interestingly better or worse Spanish completions. So there's not obvious content for me to grade.
More generally, RL can provide inexact gradients away from undesired behavior (e.g. negative reinforcement event -> downweigh...
When I think about takeoffs, I notice that I'm less interested in GDP or how fast the AI's cognition improves, and more on how AI will affect me, and how quickly. More plainly, how fast will shit go crazy for me, and how does that change my ability to steer events?
For example, assume unipolarity. Let architecture Z be the architecture which happens to be used to train the AGI.
The Pfizer phase 3 study's last endpoint is 7 days after the second shot. Does anyone know why the CDC recommends waiting 2 weeks for full protection? Are they just being the CDC again?
The framing effect & aversion to losses generally cause us to execute more cautious plans. I’m realizing this is another reason to reframe my x-risk motivation from “I won’t let the world be destroyed” to “there’s so much fun we could have, and I want to make sure that happens”. I think we need more exploratory thinking in alignment research right now.
(Also, the former motivation style led to me crashing and burning a bit when my hands were injured and I was no longer able to do much.)
ETA: actually, i’m realizing I had the effect backwards. Framing via
...Excellent retrospective/update. I'm intimately familiar with the emotional difficulty of changing your mind and admitting you were wrong.
Friendship is Optimal is science fiction:
...On this 11th anniversary of the release of Friendship is Optimal, I’d like to remind everyone that it’s a piece of speculative fiction and was a product of it’s time. I’ve said this before in other venues, but Science Marches On and FiO did not predict how things have turned out. The world looks very different.
A decade ago, people speculated that AI would think symbolically a
Who a decade ago thought that AI would think symbolically? I'm struggling to think of anyone. There was a debate on LW though, around "cleanly designed" versus "heuristics based" AIs, as to which might come first and which one safety efforts should be focused around. (This was my contribution to it.)
If someone had followed this discussion, there would be no need for dramatic updates / admissions of wrongness, just smoothly (more or less) changing one's credences as subsequent observations came in, perhaps becoming increasingly pessimistic if one's hope for AI safety mainly rested on actual AIs being "cleanly designed" (as Eliezer's did). (I guess I'm a bit peeved that you single out an example of "dramatic update" for praise, while not mentioning people who had appropriate uncertainty all along and updated constantly.)
I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:
...TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important.
The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu
I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...
And insofar as this impression is correct, this is a mistake. There is only one way alignment is.
If inner/outer is altogether a more faithful picture of those dynamics:
Over the last 2.5 years, I've read a lot of math textbooks. Not using Anki / spaced repetition systems over that time has been an enormous mistake. My factual recall seems worse-than-average among my peers, but when supplemented with Anki, it's far better than average (hence, I was able to learn 2000+ Japanese characters in 90 days, in college).
I considered using Anki for math in early 2018, but I dismissed it quickly because I hadn't had good experience using that application for things which weren't languages. I should have at least tried to see if...
An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.
Today, let's read about GPT-3's obsession with Shrek:
...As for me, I think Shrek is important because the most valuable thing in life is happiness. I mean this quite literally. There's a mountain of evidence for it, if you're willing to look at the research. And I think movies can help us get there. Or at least not get in the way.
Now, when I say "happiness," I'm not talking about the transient buzz that you get from, say, heroin. I'm talking about a sense of fulfillment. A sense that you are where you're meant to be. That you are doing what you're meant
Basilisks are a great example of plans which are "trying" to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything's fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks.
For example:
- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI's consideration,
- at which point the AI reasons in more detail because it's not...
80% credence: It's very hard to train an inner agent which reflectively equilibrates to an EU maximizer only over commonly-postulated motivating quantities (like # of diamonds
or # of happy people
or reward-signal
) and not quantities like (# of times I have to look at a cube in a blue room
or -1 * subjective micromorts accrued
).
Intuitions:
My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even...
If Hogwarts spits back an error if you try to add a non-integer number of house points, and if you can explain the busy beaver function to Hogwarts, you now have an oracle which answers for arbitrary : just state " points to Ravenclaw!". You can do this for other problems which reduce to divisibility tests (so, any decision problem which you can somehow get Hogwarts to compute; if , ).
Homework: find a way to safely take over the world using this power, and no other magic.
When I imagine configuring an imaginary pile of blocks, I can feel the blocks in front of me in this fake imaginary plane of existence. I feel aware of their spatial relationships to me, in the same way that it feels different to have your eyes closed in a closet vs in an empty auditorium.
But what is this mental workspace? Is it disjoint and separated from my normal spatial awareness, or does my brain copy/paste->modify my real-life spatial awareness. Like, if my brother is five feet in front of me, and then I imagine a blade flying five feet in f...
The new "Broader Impact" NeurIPS statement is a good step, but incentives are misaligned. Admitting fatally negative impact would set a researcher back in their career, as the paper would be rejected.
Idea: Consider a dangerous paper which would otherwise have been published. What if that paper were published title-only on the NeurIPS website, so that the researchers can still get career capital?
Problem: How do you ensure resubmission doesn't occur elsewhere?
Cool Math Concept You Never Realized You Wanted: Fréchet distance.
Imagine a man traversing a finite curved path while walking his dog on a leash, with the dog traversing a separate one. Each can vary their speed to keep slack in the leash, but neither can move backwards. The Fréchet distance between the two curves is the length of the shortest leash sufficient for both to traverse their separate paths. Note that the definition is symmetric with respect to the two curves—the Frechet distance would be the same if the dog was walking its owner.
...The Fréche
Earlier today, I became curious why extrinsic motivation tends to preclude or decrease intrinsic motivation. This phenomenon is known as overjustification. There's likely agreed-upon theories for this, but here's some stream-of-consciousness as I reason and read through summarized experimental results. (ETA: Looks like there isn't consensus on why this happens)
My first hypothesis was that recognizing external rewards somehow precludes activation of curiosity-circuits in our brain. I'm imagining a kid engrossed in a puzzle. Then, they're told that they'll b
...Going through an intro chem textbook, it immediately strikes me how this should be as appealing and mysterious as the alchemical magic system of Fullmetal Alchemist. "The law of equivalent exchange" "conservation of energy/elements/mass (the last two holding only for normal chemical reactions)", etc. If only it were natural to take joy in the merely real...
Sobering look into the human side of AI data annotation:
...Instructions for one of the tasks he worked on were nearly identical to those used by OpenAI, which meant he had likely been training ChatGPT as well, for approximately $3 per hour.
“I remember that someone posted that we will be remembered in the future,” he said. “And somebody else replied, ‘We are being treated worse than foot soldiers. We will be remembered nowhere in the future.’ I remember that very well. Nobody will recognize the work we did or the effort we put in.”
I feel like people publish articles like this all the time, and usually when you do surveys these people definitely prefer to have the option to take this job instead of not having it, and indeed frequently this kind of job is actually much better than their alternatives. I feel like this article fails to engage with this very strong prior, and also doesn't provide enough evidence to overcome it.
From the ELK report:
...We can then train a model to predict these human evaluations, and search for actions that lead to predicted futures that look good.
For simplicity and concreteness you can imagine a brute force search. A more interesting system might train a value function and/or policy, do Monte-Carlo Tree Search with learned heuristics, and so on. These techniques introduce new learned models, and in practice we would care about ELK for each of them. But we don’t believe that this complication changes the basic picture and so we leave it ou
Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.
Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.
Therefore, any pla...
Notes on behaviorism: After reading a few minutes about it, behaviorism seems obviously false. It views the "important part" of reward to be the external behavior which led to the reward. If I put my hand on a stove, and get punished, then I'm less likely to do that again in the future. Or so the theory goes.
But this seems, in fullest generality, wildly false. The above argument black-boxes the inner structure of human cognition which produces the externally observed behavior.
What actually happens, on my model, is that the stove makes your hand hot, which ...
Argument sketch for why boxing is doomed if the agent is perfectly misaligned:
Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful.
The costs of (not-so-trivial) inconveniences
I like exercising daily. Some days, I want to exercise more than others—let's suppose that I actually benefit more from exercise on that day. Therefore, I have a higher willingness to pay the price of working out.
Consider the population of TurnTrouts over time, one for each day. This is a population of consumers with different willingnesses to pay, and so we can plot the corresponding exercise demand curve (with a fixed price). In this idealized model, I exercise whenever my willingness to pay exceeds the price.
B...
The discussion of the HPMOR epilogue in this recent April Fool's thread was essentially online improv, where no one could acknowledge that without ruining the pretense. Maybe I should do more improv in real life, because I enjoyed it!
AIDungeon's subscriber-only GPT-3 can do some complex arithmetic, but it's very spotty. Bold text is me.
...You say "What happens if I take the square root of 3i?"
The oracle says: "You'll get a negative number. [wrong] So, for example, the square root of is ." [correct]
"What?" you say.
"I just said it," the oracle repeats.
"But that's ridiculous! The square root of is not . It's complex. It's plus a multiple of ." [wrong, but my character is supposed to be playing dumb here]
The
Broca’s area handles syntax, while Wernicke’s area handles the semantic side of language processing. Subjects with damage to the latter can speak in syntactically fluent jargon-filled sentences (fluent aphasia) – and they can’t even tell their utterances don’t make sense, because they can’t even make sense of the words leaving their own mouth!
It seems like GPT2 : Broca’s area :: ??? : Wernicke’s area. Are there any cog psych/AI theories on this?
We can think about how consumers respond to changes in price by considering the elasticity of the quantity demanded at a given price - how quickly does demand decrease as we raise prices? Price elasticity of demand is defined as ; in other words, for price and quantity , this is (this looks kinda weird, and it wasn't immediately obvious what's happening here...). Revenue is the total amount of cash changing hands: .
What's happening here is that raising prices is a good idea when the revenue gained (the "pric
...How does representation interact with consciousness? Suppose you're reasoning about the universe via a partially observable Markov decision process, and that your model is incredibly detailed and accurate. Further suppose you represent states as numbers, as their numeric labels.
To get a handle on what I mean, consider the game of Pac-Man, which can be represented as a finite, deterministic, fully-observable MDP. Think about all possible game screens you can observe, and number them. Now get rid of the game screens. From the perspective of reinforcement lea
...Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks (https://arxiv.org/pdf/1605.06431.pdf). Most edges won't be in most circuits. Therefore, if you're checking KL of random-resampling edges e1 and e2, there's only a small chance that e1 and e2 interact with each other in a way important for the circuit you're trying to find. So statistically, you can check eg e1, ... e100 in a batch, and maybe ablate 95 ...
It'd be nice if "optimization objective" split out into:
There are more possible senses of the phrase, but I think these are commonly conflated. EG "The optimization objective of RL is the reward function" should, by default, mean 2) and not 1). But because we use similar words, I think the claims become muddled, and it's not even clear if/when/who is messing this up or not.
(T...
Handling compute overhangs after a pause.
Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.
There seem to be a range of relatively simple pol...
Wikipedia has an unfortunate and incorrect-in-generality description of reinforcement learning (emphasis added)
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.
Later in the article, talking about basic optimal-control inspired approaches:
...The purpose of reinforcement learning is for the agent to learn an optimal, or nearly-optimal, policy that maximizes the "reward function" or other user-provided reinforcement signal
Idea for getting weak-in-expectation evidence about deception:
Be cautious with sequences-style "words don't matter, only anticipations matter." (At least, this is an impression I got from the sequences, and could probably back this up.) Words do matter insofar as they affect how your internal values bind. If you decide that... (searches for non-political example) monkeys count as "people", that will substantially affect your future decisions via e.g. changing your internal "person" predicate, which in turn will change how different downstream shards activate (like "if person harmed, be less likely to execute plan", a...
Why don't people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I've won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don't I delude myself?
On my model of people, rewards provide ~"policy gradients" which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won't come from reward gradients.
For example, if I ...
How the power-seeking theorems relate to the selection theorem agenda.
Idea: Expert prediction markets on predictions made by theories in the field, with $ for being a good predictor and lots of $ for designing and running a later-replicated experiment whose result the expert community strongly anti-predicted. Lots of problems with the plan, but surprisal-based compensation seems interesting and I haven't heard about it before.
What is "real"? I think about myself as a computation embedded in some other computation (i.e. a universe-history). I think "real" describes hypotheses about the environment where my computation lives. What should I think is real? That which an "ideal embedded reasoner" would assign high credence. However that works.
This sensibly suggests that Gimli-in-actual-Ea (LOTR) should believe he lives in Ea, and that Ea is real, even though it isn't our universe's Earth. Also, the notion accounts for indexical uncertainty by punting it to how embedded reasoning sho...
ordinal preferences just tell you which outcomes you like more than others: apples more than oranges.
Interval scale preferences assign numbers to outcomes, which communicates how close outcomes are in value: kiwi 1, orange 5, apple 6. You can say that apples have 5 times the advantage over kiwis that they do over oranges, but you can't say that apples are six times as good as kiwis. Fahrenheit and Celsius are also like this.
Ratio scale ("rational"? 😉) preferences do let you say that apples are six times as good as kiwis, and you need this property to maxi
...My autodidacting has given me a mental reflex which attempts to construct a gears-level explanation of almost any claim I hear. For example, when listening to “Listen to Your Heart” by Roxette:
Listen to your heart,
There’s nothing else you can do
I understood what she obviously meant and simultaneously found myself subvocalizing “she means all other reasonable plans are worse than listening to your heart - not that that’s literally all you can do”.
This reflex is really silly and annoying in the wrong context - I’ll fix it soon. But it’s pretty amusing
...One of the reasons I think corrigibility might have a simple core principle is: it seems possible to imagine a kind of AI which would make a lot of different possible designers happy. That is, if you imagine the same AI design deployed by counterfactually different agents with different values and somewhat-reasonable rationalities, it ends up doing a good job by almost all of them. It ends up acting to further the designers' interests in each counterfactual. This has been a useful informal way for me to think about corrigibility, when considering different
...I had an intuition that attainable utility preservation (RL but you maintain your ability to achieve other goals) points at a broader template for regularization. AUP regularizes the agent's optimal policy to be more palatable towards a bunch of different goals we may wish we had specified. I hinted at the end of Towards a New Impact Measure that the thing-behind-AUP might produce interesting ML regularization techniques.
This hunch was roughly correct; Model-Agnostic Meta-Learning tunes the network parameters such that they can be quickly adapted to achiev
...On applying generalization bounds to AI alignment. In January, Buck gave a talk for the Winter MLAB. He argued that we know how to train AIs which answer on-distribution questions at least as well as the labeller does. I was skeptical. IIRC, his argument had the following structure:
...Premises:
1. We are labelling according to some function f and loss function L.
2. We train the network on datapoints (x, f(x)) ~ D_train.
3. Learning theory results give (f, L)-bounds on D_train.
Conclusions:
4. The network should match f's labels on the rest of D_train, on av
People sometimes think of transformers as specifying joint probability distributions over token sequences up to a given context length. However, I think this is sometimes not the best abstraction. From another POV, transformers take as input embedding vectors of a given dimension , and map to another output vector of dimension (# of tokens).
This is important in that it emphasizes the implementation of the transformer, and helps avoid possible missteps around thinking of transformers as "trying to model text" or various forms of finetuning ...
I think that the training goal of "the AI never makes a catastrophic decision" is unrealistic and unachievable and unnecessary. I think this is not a natural shape for values to take. Consider a highly altruistic man with anger problems, strongly triggered by e.g. a specifc vacation home. If he is present with his wife at this home, he beats her. As long as he starts off away from the home, and knows about his anger problems, he will be motivated to resolve his anger problems, or at least avoid the triggering contexts / take other precautions to ensure her...
"Goodhart" is no longer part of my native ontology for considering alignment failures. When I hear "The AI goodharts on some proxy of human happiness", I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like:
Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain
Warning: No history
...
Excalidraw is now quite good and works almost seamlessly on my iPad. It's also nice to use on the computer. I recommend it to people who want to make fast diagrams for their posts.
Reading EY's dath ilan glowfics, I can't help but think of how poor English is as a language to think in. I wonder if I could train myself to think without subvocalizing (presumably it would be too much work to come up with a well-optimized encoding of thoughts, all on my own, so no new language for me). No subvocalizing might let me think important thoughts more quickly and precisely.
I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?
My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its parameter space being 3-colored as follows:
parameter vector+training process+other initial conditions
leads to a nothingburger (a non-functional model)I'd like to see research exploring the relevance of intragenomic conflict to AI alignment research. Intragenomic conflict constitutes an in-the-wild example of misalignment, where conflict arises "within an agent" even though the agent's genes have strong instrumental incentives to work together (they share the same body).
In an interesting parallel to John Wentworth's Fixing the Good Regulator Theorem, I have an MDP result that says:
Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).
Roughly: being able to complete linearly many tasks in the state space means you ha...
I read someone saying that ~half of the universes in a neighborhood of ours went to Trump. But... this doesn't seem right. Assuming Biden wins in the world we live in, consider the possible perturbations to the mental states of each voter. (Big assumption! We aren't thinking about all possible modifications to the world state. Whatever that means.)
Assume all 2020 voters would be equally affected by a perturbation (which you can just think of as a decision-flip for simplicity, perhaps). Since we're talking about a neighborhood ("worlds pretty close to ours"...
Epistemic status: not an expert
Understanding Newton's third law, .
Consider the vector-valued velocity as a function of time, . Scale this by the object's mass and you get the momentum function over time. Imagine this momentum function wiggling around over time, the vector from the origin rotating and growing and shrinking.
The third law says that force is the derivative of this rescaled vector function - if an object is more massive, then the same displacement of this rescaled arrow is a proportionally smaller velocity modification, because o...
Tricking AIDungeon's GPT-3 model into writing HPMOR:
You start reading Harry Potter and the Methods of Rationality by Eliezer Yudkowsky:
" "It said to me," said Professor Quirrell, "that it knew me, and that it would hunt me down someday, wherever I tried to hide." His face was rigid, showing no fright.
"Ah," Harry said. "I wouldn't worry about that, Professor Quirrell." It's not like Dementors can actually talk, or think; the structure they have is borrowed from your own mind and expectations...
Now
ARCHES distinguishes between single-agent / single-user and single-agent/multi-user alignment scenarios. Given assumptions like "everyone in society is VNM-rational" and "societal preferences should also follow VNM rationality", and "if everyone wants a thing, society also wants the thing", Harsanyi's utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone's utilities. So, in a very narrow (and unrealistic) setting, Harsanyi's theorem tells you how the single-multi solution is built from the si
...Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?
Lucas: No, I would not.
...Dylan: I think most people wouldn’t. Now, what if I told you that that program was act
On page 22 of Probabilistic reasoning in intelligent systems, Pearl writes:
Raw experiential data is not amenable to reasoning activities such as prediction and planning; these require that data be abstracted into a representation with a coarser grain. Probabilities are summaries of details lost in this abstraction...
An agent observes a sequence of images displaying either a red or a blue ball. The balls are drawn according to some deterministic rule of the time step. Reasoning directly from the experiential data leads to ~Solomonoff induction. What mig
...We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent?
Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they
...It seems to me that Zeno's paradoxes leverage incorrect, naïve notions of time and computation. We exist in the world, and we might suppose that that the world is being computed in some way. If time is continuous, then the computer might need to do some pretty weird things to determine our location at an infinite number of intermediate times. However, even if that were the case, we would never notice it – we exist within time and we would not observe the external behavior of the system which is computing us, nor its runtime.
Very rough idea
In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).
I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.
AFAICT, the deadweight loss triangle from eg price ceilings is just a lower bound on lost surplus. inefficient allocation to consumers means that people who value good less than market equilibrium price can buy it, while dwl triangle optimistically assumes consumers with highest willingness to buy will eat up the limited supply.
I was having a bit of trouble holding the point of quadratic residues in my mind. I could effortfully recite the definition, give an example, and walk through the broad-strokes steps of proving quadratic reciprocity. But it felt fake and stale and memorized.
Alex Mennen suggested a great way of thinking about it. For some odd prime , consider the multiplicative group . This group is abelian and has even order . Now, consider a primitive root / generator . By definition, every element of the group can be expressed as . The quadratic residues ar
...I noticed I was confused and liable to forget my grasp on what the hell is so "normal" about normal subgroups. You know what that means - colorful picture time!
First, the classic definition. A subgroup is normal when, for all group elements , (this is trivially true for all subgroups of abelian groups).
ETA: I drew the bounds a bit incorrectly; is most certainly within the left coset ().
Notice that nontrivial cosets aren't subgroups, because they don't have the identity .
This "normal" thing matters because sometimes we want to highlight regu
...The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.
Transplanting algorithms into randomly initialized networks. I wonder if you could train a policy network to walk upright in sim, back out the "walk upright" algorithm, randomly initialize a new network which can call that algorithm as a "subroutine call" (but the walk-upright weights are frozen), and then have the new second model learn to call that subroutine appropriately? Possibly the learned representations would be convergently similar enough to interface quickly via SGD update dynamics.
If so, this provides some (small, IMO) amount of rescue fo...
When I was younger, I never happened to "look in the right direction" on my own in order to start the process of becoming agentic and coherent. Here are some sentences I wish I had heard when I was a kid:
Plausibly just hearing this would have done it for me, but probably that's too optimistic.
What's up with biological hermaphrodite species? My first reaction was, "no way, what about the specialization benefits from sexual dimorphism?"
There are apparently no hermaphrodite mammal or bird species, which seems like evidence supporting my initial reaction. But there are, of course, other hermaphrodite species—maybe they aren't K-strategists, and so sexual dimorphism and role specialization isn't as important?
I went into a local dentist's office to get more prescription toothpaste; I was wearing my 3M p100 mask (with a surgical mask taped over the exhaust, in order to protect other people in addition to the native exhaust filtering offered by the mask). When I got in, the receptionist was on the phone. I realized it would be more sensible for me to wait outside and come back in, but I felt a strange reluctance to do so. It would be weird and awkward to leave right after entering. I hovered near the door for about 5 seconds before actually leaving. I was pretty ...
Continuous functions can be represented by their rational support; in particular, for each real number , choose a sequence of rational numbers converging to , and let .
Therefore, there is an injection from the vector space of continuous functions to the vector space of all sequences : since the rationals are countable, enumerate them by . Then the sequence represents continuous function .
(Just starting to learn microecon, so please feel free to chirp corrections)
How diminishing marginal utility helps create supply/demand curves: think about the uses you could find for a pillow. Your first few pillows are used to help you fall asleep. After that, maybe some for your couch, and then a few spares to keep in storage. You prioritize pillow allocation in this manner; the value of the latter uses is much less than the value of having a place to rest your head.
How many pillows do you buy at a given price point? Well, if you buy any, you'll buy som
...Per my recent chat with it, chatgpt 3.5 seems "situationally aware"... but nothing groundbreaking has happened because of that AFAICT.
From the LW wiki page:
...Ajeya Cotra uses the term "situational awareness" to refer to a cluster of skills including “being able to refer to and make predictions about yourself as distinct from the rest of the world,” “understanding the forces out in the world that shaped you and how the things that happen to you continue to be influenced by outside forces,” “understanding your position in the world relative to other actors who
Thoughts on "Deep Learning is Robust to Massive Label Noise."
...We show that deep neural networks are capable of generalizing from training data for which true labels are massively outnumbered by incorrect labels. We demonstrate remarkably high test performance after training on corrupted data from MNIST, CIFAR, and ImageNet. For example, on MNIST we obtain test accuracy above 90 percent even after each clean training example has been diluted with 100 randomly-labeled examples. Such behavior holds across multiple patterns of label noise, even when erroneous l
Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on "reward specification." I think reward specification is important, but far from the full story.
To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal).
Some relevant qu...
How can I make predictions in a way which lets me do data analysis? I want to be able to grep / tag questions, plot calibration over time, split out accuracy over tags, etc. Presumably exporting to a CSV should be sufficient. PredictionBook doesn't have an obvious export feature, and its API seems to not be working right now / I haven't figured it out yet.
Trying to collate team shard's prediction results and visualize with plotly, but there's a lot of data processing that has to be done first. Want to avoid the pain in the future.
Consider trying to use Solomonoff induction to reason about P(I see “Canada goes to war with USA" in next year), emphasis added:
...In Solomonoff induction, since we have unlimited computing power, we express our uncertainty about a video frame the same way. All the various pixel fields you could see if your eye jumped to a plausible place, saw a plausible number of dust specks, and saw the box flash something that visually encoded '14', would have high probability. Pixel fields where the box vanished and was replaced with a glow-in-the-dar
Team shard is now accepting applications for summer MATS. SERI MATS is now accepting applications for their 4.0 program this summer. In particular, consider applying to the shard theory stream, especially if you have the following interests:
Feel free to apply if you're interested in shard theory more generally, although I expect to mostly supervise empirical work. Feel free to message me if you have questi...
The policy of truth is a blog post about why policy gradient/REINFORCE suck. I'm leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:
Our goal remains to find a policy that maximizes the total reward after time steps.
And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:
...If you start with a reward function whose values are in and you subtract one million
Shard-theoretic model of wandering thoughts: Why trained agents won't just do nothing in an empty room. If human values are contextually activated subroutines etched into us by reward events (e.g. "If candy nearby and hungry, then upweight actions which go to candy"), then what happens in "blank" contexts? Why don't people just sit in empty rooms and do nothing?
Consider that, for an agent with lots of value shards (e.g. candy, family, thrill-seeking, music), the "doing nothing" context is a very unstable equilibrium. I think these shards will activate on t...
Are there any alignment techniques which would benefit from the supervisor having a severed corpus callosum, or otherwise atypical neuroanatomy? Usually theoretical alignment discourse focuses on the supervisor's competence / intelligence. Perhaps there are other, more niche considerations.
Does anyone have tips on how to buy rapid tests in the US? Not seeing any on US Amazon, not seeing any in person back where I'm from. Considering buying German tests. Even after huge shipping costs, it'll come out to ~$12 a test, which is sadly competitive with US market prices.
Wasn't able to easily find tests on the Mexican and Canadian Amazon websites, and other EU countries don't seem to have them either.
The Baldwin effect
I couldn't find great explanations online, so here's my explanation after a bit of Googling. I welcome corrections from real experts.
Organisms exhibit phenotypic plasticity when they act differently in different environments. The phenotype (manifested traits: color, size, etc) manifests differently, even though two organisms might share the same genotype (genetic makeup).
...(This is a basic point on conjunctions, but I don't recall seeing its connection to Occam's razor anywhere)
When I first read Occam's Razor back in 2017, it seemed to me that the essay only addressed one kind of complexity: how complex the laws of physics are. If I'm not sure whether the witch did it, the universes where the witch did it are more complex, and so these explanations are exponentially less likely under a simplicity prior. Fine so far.
But there's another type. Suppose I'm weighing whether the United States government is currently engaged in a v...
At a poster session today, I was asked how I might define "autonomy" from an RL framing; "power" is well-definable in RL, and the concepts seem reasonably similar.
I think that autonomy is about having many ways to get what you want. If your attainable utility is high, but there's only one trajectory which really makes good things happen, then you're hemmed-in and don't have much of a choice. But if you have many policies which make good things happen, you have a lot of slack and you have a lot of choices. This would be a lot of autonomy.
This has to b...
In Markov decision processes, state-action reward functions seem less natural to me than state-based reward functions, at least if they assign different rewards to equivalent actions. That is, actions at a state can have different reward even though they induce the same transition probabilities: . This is unappealing because the actions don't actually have a "noticeable difference" from within the MDP, and the MDP is visitation-distribution-isomorphic to an MDP without the act...
The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?
Reasoning about learned policies via formal theorems on the power-seeking incentives of optimal policies
One way instrumental subgoals might arise in actual learned policies: we train a proto-AGI reinforcement learning agent with a curriculum including a variety of small subtasks. The current theorems show sufficient conditions for power-seeking tending to be optimal in fully-observable environments; many environments meet these sufficient conditions; optimal policies aren't hard to compute for the subtasks. One highly transferable heuristic would therefore...
I prompted GPT-3 with modified versions of Eliezer's Beisutsukai stories, where I modified the "class project" to be about solving intent alignment instead of quantum gravity.
...... Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it 'utility function patching', we'd better st
Physics has existed for hundreds of years. Why can you reach the frontier of knowledge with just a few years of study? Think of all the thousands of insights and ideas and breakthroughs that have been had - yet, I do not imagine you need most of those to grasp modern consensus.
Idea 1: the tech tree is rather horizontal - for any given question, several approaches and frames are tried. Some are inevitably more attractive or useful. You can view a Markov decision process in several ways - through the Bellman equations, through the structure of the state