All of Jon Garcia's Comments + Replies

I would expect that for model-based RL, the more powerful the AI is at predicting the environment and the impact of its actions on it, the less prone it becomes to Goodharting its reward function. That is, after a certain point, the only way to make the AI more powerful at optimizing its reward function is to make it better at generalizing from its reward signal in the direction that the creators meant for it to generalize.

In such a world, when AIs are placed in complex multiagent environments where they engage in iterated prisoner's dilemmas, the more int... (read more)

Disclaimer: I am not a medical doctor nor a nutritionist, just someone who researches nutrition from time to time.

I would be surprised if protein deficiency per se was the actual problem. As I understand it, many vegetables actually have a higher level of protein per calorie than meat (probably due to the higher fat content of the latter, which is more calorie dense), although obviously, there's less protein per unit mass than meat (since vegetables are mostly cellulose and water). The point is, though, that if you were getting enough calories to function ... (read more)

Unfortunately, I do not have useful links for this - my understanding comes from non-English podcasts of a nutritionist. Please do not rely on my memory, but maybe this can be helpful for localizing good hypotheses. According to how I remember this, one complication of veg*n diets and amino acids is that the question of which of the amino acids can be produced by your body and which are essential can effectively depend on your personal genes. In the podcast they mentioned that especially for males there is a fraction of the population who totally would need to supplement some "non-essential" amino acids if they want to stay healthy and follow veg*n diets. As these nutrients are usually not considered as worthy of consideration (because most people really do not need to think about them separately and also do not restrict their diet to avoid animal sources), they are not included in usual supplements and nutrition advice  (I think the term is "meat-based bioactive compounds"). I think Elizabeth also emphasized this aspect in this post

Congrats! I went through this thought process as well, and one of your three hypotheses above seems like the right one. Vitamin D isn't the issue (I have tests for it and have heavily supplemented for years), and sulfur itself isn't an issue (onions and broccolo are both pretty big in my diet). However, the lack of sulfur amino acids is the lead hypothesis.

Over the years, I had slowly shifted my diet more and more plant based: lots of vegetables, with occasional meat and a piece of fish every couple of days. As you mentioned, not all protein is create... (read more)

The potato is closest to meat in having about 36 calories per gram of protein and all essential amino acids, but it’s still only about 30% of the way to being as proteinaceous as non-lean steak or 15% of the way to lean steak. As a highly active exerciser, OP may have needed up to 2.5x the protein as a typical person, while only requiring 40% more calories. So no, low-meat diet that would sustain the protein needs of a low-activity person doesn’t always straightforwardly translate to a high-activity lifestyle. An 80 kg man at low activity might need 64 g protein/day and 2500 calories. If he starts exercising intensively, he may need 3500 calories and up to 160g protein. It’s not possible to do that with potatoes - 160g of protein from potatoes entails eating almost 6000 calories. The only thing that would get you there on a vegan/vegetarian diet is protein powder, which is as protein-dense as lean steak at about 100 calories/15 g protein.

Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.

If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.

To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residua... (read more)

I understand the network's "intention" the other way around, I think that the network wants to have an exponentially growing residual stream. And in order to get an exponentially growing residual stream the model increases its weights exponentially. And our speculation for why the model would want this is our "favored explanation" mentioned above.

If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.

Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the "true" classes is discernible. This would be a case... (read more)

Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.

For an image-classification network, if we remove the softmax nonlinearity from the very end, then  would represent the input image in pixel space, and  would represent the class logits. Then  would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and  would represent higher class certainty (softmax temperature = ) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real ... (read more)

5Steven Byrnes10mo
Well, averaging / adding two images in pixel space usually gives a thing that looks like two semi-transparent images overlaid, as opposed to “an image with two objects”.

I would say we want an ASI to view world-state-optimization from the perspective of a game developer. Not only should it create predictive models of what goals humans wish to achieve (from both stated and revealed preferences), but it should also learn to predict what difficulty level each human wants to experience in pursuit of those goals.

Then the ASI could aim to adjust the world into states where humans can achieve any goal they can think of when they apply a level of effort that would leave them satisfied in the accomplishment.

Humans don't want everyt... (read more)

I agree, hence the "if humanity never makes it to the long-term, this is a moot point."

Last I checked, you can get about 10x as much energy from burning a square meter of biosphere as you can get by collecting a square meter of sunlight for a day.

Even if this is true, it's only because that square meter of biosphere has been accumulating solar energy over an extended period of time. Burning biofuel may help accelerate things in the short term, but it will always fall short of long-term sustainability. Of course, if humanity never makes it to the long-term, this is a moot point.

Disassembling us for parts seems likely to be easier than buildin

... (read more)
5Brendan Long10mo
Yeah, but you might as well take the short-term boost from burning the biosphere and then put solar panels on top.

You heard the LLM, alignment is solved!

But seriously, it definitely has a lot of unwarranted confidence in its accomplishments.

I guess the connection to the real world is what will throw off such systems until they are trained on more real-world-like data.

I wouldn't phrase it that it needs to be trained on more data. More like it needs to be retrained within an actual R&D loop. Have it actually write and execute its own code, test its hypotheses, evaluate the results, and iterate. Use RLHF to evaluate its assessments and a debugger to evaluate its ... (read more)

I agree. A while back, I asked Does non-access to outputs prevent recursive self-improvement? I think that letting such systems learn from experiments with the real world is very dangerous.

"Activation space gradient descent" sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.

From what I know of Transformer models, it's hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learn... (read more)

Here's a sketch of the predictive-coding-inspired model I think you propose: The initial layer predicts token i+1 from token i for all tokens. The job of each  "predictive coding" layer would be to read all the true tokens and predictions from the residual streams, find the error between the prediction and the ground truth, then make a uniform update to all tokens to correct those errors. As in the dual form of gradient descent, where updating all the training data to be closer to a random model also allows you to update a test output to be closer to the output of a trained model, updating all the predicted tokens uniformly also moves prediction n+1 closer to the true token n+1. At the end, an output layer reads the prediction for n+1 out of the latent stream of token n. This would be a cool way for language models to work: * it puts next-token-prediction first and foremost, which is what we would expect for a model trained on next-token-prediction. * it's an intuitive framing for people familiar with making iterative updates to models / predictions * it's very interpretable, at each step we can read off the model's current prediction from the latent stream of the final token (and because the architecture is horizontally homogenous, we can read off the model's "predictions" for mid-sequence tokens too, though as you say they wouldn't be quite the same as the predictions you would get for truncated sequences). But we have no idea if GPT works like this! I haven't checked if GPT has any circuits that fit this form; from what I've read of the Transformer Circuits sequence they don't seem to have found predicted tokens in the residual streams. The activation space gradient descent theory is equally compelling, and equally unproven. Someone (you? me? anthropic?) should poke around in the weights of an LLM and see if they can find something that looks like this.

AI has gotten even faster and associated with that there are people that worry about AI, you know, fairness, bias, social economic displacement. There are also the further out speculative worries about AGI, evil sentient killer robots, but I think that there are real worries about harms, possible real harms today and possibly other harms in the future that people worry about.

It seems that the sort of AI risks most people worry about fall into one of a few categories:

  1. AI/automation starts taking our jobs, amplifying economic inequalities.
  2. The spread of
... (read more)

Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.

LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.

I would much rather live in a world where the first AGIs weren't b... (read more)

4Seth Herd10mo
Yes, the Auto-GPT approach does evaluate its potential plans against the goals it was given. So all you have to do to get some decent starting alignment is give it good high-level goals (which isn't trivial; don't tell it to reduce suffering or you may find out too late it had a solution you didn't intend...). But because it's also pretty easily interpretable, and can be made to at least start as corrigible with good top-level goals, there's a shot at correcting your alignment mistakes as they arise.

It seems to me that imitation requires some form of prediction in order to work. First make some prediction of the behavioral trajectory of another agent; then try to minimize the deviation of your own behavior from an equivalent trajectory. In this scheme, prediction constitutes a strict subset of the computational complexity necessary to enable imitation. How would GPT's task flip this around?

And if prediction is what's going on, in the much-more-powerful-than-imitation sense, what sort of training scheme would be necessary to produce pure imitation without also training the more powerful predictor as a prerequisite?

1Roman Leventov10mo
"Imitation of itself" constitutes prediction, although this phrase doesn't make much sense. Humans' "linguistic center" or "skill" predicts their own generated text with some veracity, but usually (unless the person is a professional linguist, or a translator, or a very talented writer) are bad at predicting others' generated text (i.e., styles). So, one vector of superhumanness is that GPT is trained to predict extremely wide range of styles, of tens of thousands of notable writers and speakers across the training corpus. Another vector of superhumanness is that GPTs are trained to produce this prediction autoregressively, "on the first try", whereas for people it may take many iterations to craft good writing, speech, or, perhaps most importantly, code. Then, since GPTs can match this skill "intuitively", in a single rollout, when GPTs are themselves applied iteratively, e.g. to iteratively critique and improve their own generation, this could produce a superhuman quality of code, or strategic planning, or rhetoric, etc.

First of all, I strongly agree that intelligence requires (or is exponentially easier to develop as) connectionist systems. However, I think that while big, inscrutable matrices may be unavoidable, there is plenty of room to make models more interpretable at an architectural level.

Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?

I have long thought that Transformer models are actually too general purpose for their own good. By... (read more)

I think the problem is that many things along the rough lines of what you're describing have been attempted in the past, and have turned out to work not-so-well (TBF, they also were attempted with older systems: not even sure if anyone's tried to make something like an expert system with a full-fledged transformer). The common wisdom the field has derived from those experiences seems to have been "stochastic gradient descent knows best, just throw your data into a function and let RNJesus sort it out". Which is... not the best lesson IMO. I think there might be worth in the things you suggest and they are intuitively appealing to me too. But as it turns out when an industry is driven by racing to the goal rather than a genuine commitment to proper scientific understanding they end up taking all sorts of shortcuts. Who could have guessed.

Would it make sense to have a "Newbie Garden" section of the site? The idea would be to give new users a place to feel like they're contributing to the community, along with the understanding that the ideas shared there are not necessarily endorsed by the LessWrong community as a whole. A few thoughts on how it could work:

  • New users may be directed toward the Newbie Garden (needs a better name) if they try to make a post or comment, especially if a moderator deems their intended contribution to be low-quality. This could also happen by default for all users
... (read more)
I think this works at universities because teachers are paid to grade things (they wouldn't do it otherwise) and students get some legible-to-the-world certificate once they graduate.  Like, we already have a wealth of curriculum / as much content for newbies as they can stand; the thing that's missing is the peer reading group and mentors. We could probably construct peer reading groups (my simple idea is that you basically put people into groups based on what <month> they join, varying the duration until you get groups of the right size, and then you have some private-to-that-group forum / comment system / whatever), but I don't think we have the supply of mentors. [This is a crux--if someone thinks they have supply here or funding for it, I want to hear about it.] 


If the alignment problem is the most important problem in history, shouldn't alignment-focused endeavors be more willing to hire contributors who can't/won't relocate?

It's not like remote work isn't the easiest to implement that it's ever been in all of history.

Of course there needs to be some filtering out of candidates to ensure resources are devoted to the most promising individuals. But I really don't think that willingness to move correlates strongly enough with competence at solving alignment to warrant treating it like a dealbreaker.

I was also looking to do alignment-focused work remotely, and then, while failing to find any appropriate[1] opportunities, had a bit of a wake-up call which led to me changing my mind. From the "inside", there are some pretty compelling considerations for avoiding remote work. "Context is that which is scarce" - the less "shovel-ready" the work is, the more important it is to have very high bandwidth communication.  I liked remote work at my last job because I was working at a tech company where we had quarterly planning cycles and projects were structured in a way such that everyone working remotely barely made a difference, most of the time.  (There were a couple projects near the end where it was clearly a significant drag on our ability to make forward progress, due to the increasing number of stakeholders, and the difficulty of coordinating everything). LessWrong is a three-person[2] team, and if we spent basically all of our time developing features the way mature tech companies do, we could probably also be remote with maybe only a 30-40% performance penalty.  But in fact a good chunk of our effort goes into attempting to backchain from "solve the alignment problem/end the acute risk period" into "what should we actually be doing".  This often does involve working on LessWrong, but not 100% of the time.  As an example, we're currently in the middle of a two-week "alignment sprint", where we're spending most of our timing diving into object-level research.  To say that this style of work[3] benefits from co-location would be understating things. Now, I do think that LessWrong is on the far end of the spectrum here, but I think this is substantially true for most alignment orgs, given that they tend to be smaller and working in a domain that's both extremely high context and also fairly pre-paradigmatic.  In general, coordination and management capacity are severely constrained, and remote work is at its best when you need less coordination effort to achie

No, utility functions are not a property of computer programs in general. They are a property of (a certain class of) agents.

A utility function is just a way for an agent to evaluate states, where positive values are good (for states the agent wants to achieve), negative values are bad (for states the agent wants to avoid), and neutral values are neutral (for states the agent doesn't care about one way or the other). This mapping from states to utilities can be anything in principle: a measure of how close to homeostasis the agent's internal state is, a me... (read more)


The ONLY way for humans to maintain dominion over superintelligent AI in this scenario is if alignment was solved long before any superintelligent AI existed. And only then if this alignment solution were tailored specifically to produce robustly submissive motivational schemas for AGI. And only then if this solution were provably scalable to an arbitrary degree. And only then if this solution were well-enforced universally.

Even then, though, it's not really dominion. It's more like having gods who treat the universe as their playground but who also feel compelled to make sure their pet ants feel happy and important.

One of the earliest records of a hierarchical organization comes from the Bible (Exodus 18). Basically, Moses starts out completely "in touch with reality," judging all disputes among the Israelites from minor to severe, from dawn until dusk. His father in law, Jethro, notices that he is getting burnt out, so he gives him some advice on dividing up the load:

You will surely wear yourself out, as well as these people who are with you, because the task is too heavy for you. You cannot do it alone, by yourself. Now listen to my voice—I will give you advice..

... (read more)
Interesting (although I do think "judge" is a fairly different job than "manager", so I'd expect fairly different dynamics.
  1. GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.

I believe "encoder" refers exclusively to the part of the model that reads in text to generate an internal representation, while "decoder" refers exclusively to the part that takes the representation created by t... (read more)

2Charlie Steiner1y
Architecturally, I think the big difference is bi-directional (BERT can use future tokens to influence latent features of current tokens) vs. uni-directional (GPT only flows information from past to future). You could totally use the "encoder" to generate text, or the "decoder" to generate latent representations used for another task, though perhaps they're more suited for their typical roles. EDIT: Whoops, was wrong in initial version of comment.

Well, if you could solve the problem of companies X-washing (persuading consumers to buy from them by only pretending to alleviate their concerns), then you would probably be able to solve deceptive alignment as well.

Since two months is not a very long time to complete a research project, and I don't know what lab resources or datasets you have access to, it's a bit difficult to answer this.

It would be great if you could do something like build a model of human value formation based on the interactions between the hypothalamus, VTA, nucleus accumbens, vmPFC, etc. Like, how does the brain generalize its preferences from its gene-coded heuristic value functions? Can this inform how you might design RL systems that are more robust against reward misspecification?

Again, I doubt you can get beyond a toy model in the two months, but maybe you can think of something you can do related to the above.

Stack Overflow moderators would beg to differ.

But yes, retrodding old ground can be very useful. Just from the standpoint of education, actually going through the process of discovery can instill a much deeper understanding of the subject than is possible just from reading or hearing a lecture about it. And if the discovery is a stepping stone to further discoveries, then those who've developed that level of understanding will be at an advantage to push the boundaries of the field.

It seems to me that "inner" versus "outer" alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, "If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!" Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriatel... (read more)

Here's my take:

Like the reward signal in reinforcement learning, next-token prediction is a simple feedback signal that masks a lot of complexity behind the scenes. To predict the next token requires the model first of all to estimate what sort of persona should be speaking, what they know, how they speak, what is the context, and what are they trying to communicate. Self-attention with multiple attention heads at every layer in the Transformer allows the LLM to keep track of all these things. It's probably not the best way to do it, but it works.

Human bra... (read more)

"Let me see what Chatty thinks," (or whatever humanesque name becomes popular).

I assume people will treat it just like talking to a very knowledgeable friend. Just ask a question, get a response, clarify what you meant or ask a followup question, and so on. Conversation in natural language already comes naturally to humans, so probably a lot more people will become a lot more adept at accessing knowledge.

And in future iterations, this "friend" will be able to create art, weave stories, design elucidating infographics, make entertaining music videos, teach ... (read more)

By "code generating being automated," I mean that humans will program using natural human language, without having to think about the particulars of data structures and algorithms (or syntax). A good enough LLM can handle all of that stuff itself, although it might ask the human to verify if the resulting program functions as expected.

Maybe the models will be trained to look for edge cases that technically do what the humans asked for but seem to violate the overall intent of the program. In other words, situations where the program follows the letter of t... (read more)

Well, I very much doubt that the entire programming world will get access to a four-quintillion-parameter code-generating model within five years. However, I do foresee the descendants of OpenAI Codex getting much more powerful and much more used within that timeframe. After all, Transformers just came out only five years ago, and they've definitely come a long way since.

Human culture changes more slowly than AI technology, though, so I expect businesses to begin adopting such models only with great trepidation at first. Programmers will almost certainly n... (read more)

I must note that code generation is already almost universally automated: practically nobody writes assembly, they are almost always generated by compilers, but no, compilers didn't end the programming.
2the gears to ascension1y
yeah, that's probably still another 7 years out by my estimate, yeah I mean, I don't think anyone would reasonably expect it to with the current ratio of who gets gains from trade half joking on both counts, though I could probably think through and make a less joking version that has a lot more caveats; obviously neither statement is exactly true as stated

The cortex uses traveling waves of activity that help it organize concepts in space and time. In other words, the locally traveling waves provide an inductive bias for treating features that occur close together in space and time as part of the same object or concept. As a result, cortical space ends up mapping out conceptual space, in addition to retinotopic, somatic, or auditory space.

This is kind of like DCT in the sense that oscillations are used as a scaffold for storing or reconstructing information. I think that Neural Radiance Fields (NeRF) use a s... (read more)

Thanks! Your links led me down some interesting avenues.

This just might work. For a little while, anyway.

One hurdle for this plan is to incentivize developers to slap on 20 layers of alignment strategies to their generalist AI models. It may be a hard sell when they are trying to maximize power and efficiency to stay competitive.

You'll probably need to convince them that not having such safeguards in place will lead to undesirable behavior (i.e., unprofitable behavior, or behavior leading to bad PR or bad customer reviews) well below the level of apocalyptic scenarios that AI safety advocates normally talk about. Otherwise, they may not care.

3Cleo Nardo1y
* My mainline success-model looks like this: the key actors know alignment is hard and coordinate to solve it. I'm focusing on this success-model until another success-model becomes mainline. * I'm more bullish on coordination between the key actors than a lot of the TAIS community. * I think that the leading alignment-sympathetic actors slowed by the alignment tax will still outpace the alignment-careless actors. * The assemblage might be cheaper to run than the primitives alone. * Companies routinely use cryptographic assemblages. 

That's an interesting way of thinking about it. I'm reminded of how computers have revolutionized thought over the past century.

Most humans have tended to think primarily by intuition. Glorified pattern-matching, with all the cognitive biases that entails.

Computers, in contrast, have tended to be all formalization, no intuition. They directly manipulate abstract symbols at the lowest level, symbols that humans only deal with at the highest levels of conscious thought.

And now AI is bringining things back around to (surprisingly brittle) intuition. It will be interesting, at least, to see how newer AI systems will bring together formalized symbolic reasoning with pattern-matching intuition.


Sometimes its easier to reach a destination when you're not aiming for it. You can't reach the sun by pointing a rocket at it and generating thrust. It's hard to climb a mountain by going up the steepest path. You can't solve a Millenium Prize math problem by staring at it until a solution reveals itself.

Sometimes you need to slingshot around a planet a few times to adjust orbital momentum. Sometimes you'll summit faster by winding around or zigzagging. Sometimes you have to play around with unrelated math problems before you notice something ... (read more)

2Johannes C. Mayer1y
You list several possibilities for how directly working on the problem is not the best thing to do. Somebody who is competent and tries to solve the problem would consider these possibilities and make use of them. I agree that sometimes there will be a promising path, that is discovered by accident. You could not have planned for discovering it. Or could you? Even if you can't predict what path will reveal itself, you can be aware that there are paths that will reveal themselves in circumstances that you could not predict. You can still plan to do some specific type of "random stuff" that you expect might lead to something that you did not think of before. There are circumstances where you discover something on accident without even thinking of the possibility (and considering that it might be worth investigating). I still expect in these circumstances that somebody who tries to solve AI alignment will make better use of the opportunity for making progress on AI alignment. To me, it seems that trying to do X is generally better at achieving X, the more competent you are. It does include strategies like don't try hard to solve X, insofar as that seems useful. The worse you are at optimizing, the better it is to do random stuff, as you might stumble upon solutions your cognitive optimization process could not find. 

A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.

The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and maki... (read more)

(Edit: What do you mean? This calls to mind a basic introduction to what utility functions do, given below, but on second thought that's probably not what the claim is about. I'll leave the rest of the comment here, as it could be useful for someone.) A utility function describes decisions between lotteries, which are mixtures of outcomes, or more generally events in a sample space. The setting assumes uncertainty, outcomes are only known to be within some event, not individually. So a situation where a decision can be made is a collection of events/lotteries, one of which gets to be chosen, the choice is the behavior assigned to this situation. This makes situations reuse parts of each other, they are not defined independently. As a result, it becomes possible to act incoherently, for example pick A from (A, B), pick B from (B, C) and pick C from (A, C). Only satisfying certain properties of collections of behaviors allows existence of a probability measure and a utility function such that agent's choice among the collection of events in any situation coincides with picking the event that has the highest expected utility. Put differently, the issue is that behavior described by a utility function is actually behavior in all possible and counterfactual situations, not in some specific situation. Existence of a utility function says something about which behaviors in different situations can coexist. Without a utility function, each situation could get an arbitrary response/behavior of its own, independently from the responses given for other situations. But requiring a utility function makes that impossible, some behaviors become incompatible with the other behaviors. In the grandparent comment, I'm treating utility functions more loosely, but their role in constraining collections of behaviors assigned to different situations is the same.

The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.

There is no utility for the RL agent's operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the re... (read more)

Thanks, I appreciate the explanation!

The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.

For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not... (read more)

I agree with everything you've said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent's plans in the real-world. For some of the AI's plans, we can say no -- we have a veto button. It seems to me that the AI would be completely fine with that -- is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.

Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.

I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.

Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of ... (read more)

Awesome visualizations. Thanks for doing this.

It occurred to me that LayerNorm seems to be implementing something like lateral inhibition, using extreme values of one neuron to affect the activations of other neurons. In biological brains, lateral inhibition plays a key role in many computations, enabling things like sparse coding and attention. Of course, in those systems, input goes through every neuron's own nonlinear activation function prior to having lateral inhibition applied.

I would be interested in seeing the effect of applying a nonlinearity (suc... (read more)

That was my first thought as well. As far as I know, the most popular simple model used for this in the neuro literature, divisive normalization, uses similar but not quite identical formula. Different authors use different variations, but it's something shaped like zi=yαiβα+∑jκijyαj where yi is the unit's activation before lateral inhibition, β adds a shift/bias, κij are the respective inhibition coefficients, and the exponent α modulates the sharpness of the sigmoid (2 is a typical value). Here's an interactive desmos plot with just a single self-inhibiting unit. This function is asymmetric in the way you describe, if I understand you correctly, but to my knowledge it's never gained any popularity outside of its niche. The ML community seems to much prefer Softmax, LayerNorm et al. and I'm curious if anyone knows if there's a deep technical reason for these different choices.

I think grading in some form will be necessary in the sense that we don't know what value heuristics will be sufficient to ensure alignment in the AI. We will most likely need to add corrections to its reward signals on the fly, even as it learns to extrapolate its own values from those heuristics. In other words, grading.

However, it seems the crucial point is that we need to avoid including grader evaluations as part of the AI's self-evaluation model, for the same reason that we shouldn't give it access to its reward button. In other words, don't build th... (read more)

Could part of the problem be that the actor is optimizing against a single grader's evaluations? Shouldn't it somehow take uncertainty into account?

Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of man... (read more)

I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you're still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.  In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads. 
This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.

Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a "current best guess" model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?

On a somewhat related note, someone should answer, "What is this Coherent Extrapolated Volition I've been hearing about from the AI safety community? Are there any holes in that plan?"

The main problems with CEV are that there is no method for it in practice, and no proof that it could work in principle.
I'm lazy, so I'll just link to this

Different parts of me get excited about this in different directions.

On the one hand, I see AI alignment as highly solvable. When I scan out among a dozen different subdisciplines in machine learning, generative modeling, natural language processing, cognitive science, computational neuroscience, predictive coding, etc., I feel like I can sense the faint edges of a solution to alignment that is already holographically distributed among collective humanity.

Getting AGI that has the same natural abstractions that biological brains converge on, that uses inter... (read more)

Come to think of it, couldn't this be applied to model corrigibility itself?

Have an AI that's constantly coming up with predictive models of human preferences, generating an ensemble of plans for satisfying human preferences according to each model. Then break those plans into landmarks and look for clusters in goal-space.

Each cluster could then form a candidate basin of attraction of goals for the AI to pursue. The center of each basin would represent a "robust bottleneck" that would be helpful across predictive models; the breadth of each basin would acc... (read more)

When you say "optimization target," it seems like you mean a single point in path-space that the planner aims for, where this point consists of several fixed landmarks along the path which don't adjust to changing circumstances. Such an optimization target could still have some wiggle room (i.e., consist of an entire distribution of possible sub-paths) between these landmarks, correct? So some level of uncertainty must be built into the plan regardless of whether you call it a prediction or an optimization target.

It seems to me that what you're advocating ... (read more)

Yup, that's right.

Also, just a couple minor errors:

  1. In your "The first 31 binary strings in lexical order" figure, you're missing a white square at the top of the fourth 3-bit string.
  2. "diving by " should be "dividing by ". I know spell check would miss that one.

I didn't notice any other errors. Again, great article.

Nice catches. I love that somebody double-checked all the binary strings. :)

Excellent introduction. Your examples were all very intuitive.

For those who are reading, one way to get an intuition for the difference between binary strings and bits is to look at data compression. To begin with, it's easy to create a code like ASCII, where every character is represented by a binary string of length 8 (usually referred to as 8 "bits" or one byte), allowing up to  unique characters. This type of code will allow you to represent a text document in English that's 1024 characters in length with exactly 1 kB of information.

Exc... (read more)

1Jon Garcia1y
Also, just a couple minor errors: 1. In your "The first 31 binary strings in lexical order" figure, you're missing a white square at the top of the fourth 3-bit string. 2. "diving by W" should be "dividing by W". I know spell check would miss that one. I didn't notice any other errors. Again, great article.

General success:

  • AGI mostly serves to solve coordination problems among eudaimonic agents or to lower the activation energy necessary for eudaimonic agents to achieve their goals.
  • Any newly minted AGIs come prepackaged with modules for detecting agency in other systems and for harmonizing the needs and goals of all other agents within its sphere of control.
  • The Gaia Hypothesis has become the Gaia Initiative, with ASI steering the evolution of the biosphere into a self-sustaining superorganism.
  • All bodies in the solar system are either in the process of be
... (read more)

Suppose that you gave it a bunch of labeled data about what counts as "good" and "bad".

If your alignment strategy strongly depends on teaching the AGI ethics via labeled training data, you've already lost.

And if your alignment strategy strongly depends on creating innumerable copies of an UFAI and banking on the anthropic principle to save you, then you've already lost spectacularly.

If you can't point to specific submodules within the AGI and say, "Here is where it uses this particular version of predictive coding to model human needs/values/goals," and, "... (read more)

The fox knows many things. The hedgehog knows one important thing.

It turns out that it's optimal to be 3 parts fox to sqrt(2) parts hedgehog:

Mostly due to the limited working memory that Transformers typically use (e.g., a buffer of only the most recent 512 tokens feeding into the decoder). When humans write novels, they have to keep track of plot points, character sheets, thematic arcs, etc. across tens of thousands of words. You could probably get it to work, though, if you augmented the LLM with content-addressable memory and included positional encoding that is aware of where in the novel (percentage-wise) each token resides.

Well, yes, but I was speaking on the short term of the next few trillion years. For that matter, we have the power to cause irreversible collapse today. I would prefer to see a future that is capable of sustaining something worthwhile for as long as physics allows.

1Gerald Monroe1y
Sustainable is a badly used term for this reason. Lots of inefficient things are sustainable for the short term of a few hundred thousand years. "Unsustainable" should be limited to things that are either collapsing now (climate change) or obviously short term, like low paid workers for essential jobs when demographically the population is slowly losing workers.
Load More