All of michael_mjd's Comments + Replies

I share your disagreement with the original author as to the cause of the relief. For me, I find the modern day and age very confusing and difficult to measure one's value to society. Any great idea you can think of, probably someone else has thought of it, and you have little chance to be important. In a zombie apocalypse, instead of thinking how to out-compete your fellow man with some amazing invention, you fall back to survival. Important things in this world, like foraging for food, fending off zombies, etc, have quicker reward, and it's easier in som... (read more)

If we know they aren't conscious, then it is a non-issue. A random sample from conscious beings would land on the SAI with probability 0. I'm concerned we create something accidently conscious. 

I am skeptical it is easy to avoid. If it can simulate a conscious being, why isn't that simulation conscious? If consciousness is a property of the physical universe, then an isomorphic process would have the same properties. And if it can't simulate a conscious being, then it is not a superintelligence.

It can, however, possibly have a non-conscious outer-program... and avoid simulating people. That seems like a reasonable proposal.

Agree. Obviously alignment is important, but it has always creeped me out in the back of my mind, some of the strategies that involve always deferring to human preferences. It seems strange to create something so far beyond ourselves, and have its values be ultimately that of a child or a servant. What if a random consciousness sampled from our universe in the future, comes from it with probability almost 1? We probably have to keep that in mind too. Sigh, yet another constraint we have to add!

6Zac Hatfield-Dodds4mo
Would you say the same of a steam engine, or Stockfish, or Mathematica? All of those vastly exceed human performance in various ways! I don't see much reason to think that very very capable AI systems are necessarily personlike or conscious, or have something-it-is-like-to-be-them - even if we imagine that they are designed and/or trained to behave in ways compatible with and promoting of human values and flourishing. Of course if an AI system does have these things I would also consider it a moral patient, but I'd prefer that our AI systems just aren't moral patients until humanity has sorted out a lot more of our confusions.
At which point maybe the moral thing is to not build this thing.

Hi Critch,

I am curious to hear more of your perspectives, specifically on two points I feel least aligned with, the empathy part, and the Microsoft part. If I hear more I may be able to update in your direction.

Regarding empathy with people working on bias and fairness, concretely, how do you go about interacting with and compromising with them?

My perspective: it's not so much that I find these topics not sufficiently x-risky (but that is true, too), but it is that I perceive a hostility to the very notion of x-risk from at a subset of this same group. The... (read more)

This might be a good time for me to ask a basic question on mechanistic interpretability:

Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn't it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...

8Ben Amitay4mo
It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis. If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

Those are good questions! There's some existing research which address some of your questions.

Single neurons often do represent multiple concepts:

It seems to still be unclear why the dimensions are aligned with the standard basis:

Fascinating, thanks for the research. Your analysis makes sense and seems to indicate that for most situations, prompt engineering is the always the first plan of attack and often works well enough. Then, a step up from there, OpenAI/etc would most likely experiment with fine-tuning or RLHF as it relates to a specific business need. To train a better chatbot and fill in any gaps, they probably would get more bang for their buck on simply fine-tuning it on a large dataset that matched their needs. For example, if they wanted to do better mathematical reason... (read more)

I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it's very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the "But What if We Just..." has already been addressed. It may have been, but this is not the way to gettin... (read more)

This understanding has so far proven to be very shallow and does not actually control behavior, and is therefore insufficient. Users regularly get around it by asking the AI to pretend to be evil, or to write a story, and so on. It is demonstrably not robust. It is also demonstrably very easy for minds (current-AI, human, dog, corporate, or otherwise) to know things and not act on them, even when those actions control rewards.  If I try to imagine LeCun not being aware of this already, I find it hard to get my brain out of Upton Sinclair "It is difficult to get a man to understand something, when his salary depends on his not understanding it," territory.

Essentially yes, heh. I take this as a learning experience for my writing, I don't know what I was thinking, but it is obvious in hindsight that saying to just "switch on backprop" sounds very naive.

I also confess I haven't done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I'll do some more googling tonight.

One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model's own output. The largest models are the ones where the model's own output is not essentially word salad.

I have noted the problem of catastrophic forgetting in the section "why it might not work". In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say "We couldn't get it to work."

Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole "sometimes approaches that didn't work at small scale start working when you throw enough compute at them" thing? Makes sense.

It's possible it's downvoted because it might be considered dangerous capability research. It just seems highly unlikely that this would not be one of many natural research directions perhaps already attempted, and I figure we might as well acknowledge it and find out what it actually does in practice.

Or maybe downvotes because it "obviously won't work", but I think it's not obvious to me and would welcome discussion on that.

I'm worried that no matter how far we go, the next step will be one of the natural research directions.

Thanks, this is a great analysis on the power of agentized LLMs, which I probably need to spend some more time thinking about. I will work my way through the post over the next few days. I briefly skimmed the episodic memory section for now, and I see it is like an embedding based retrieval system for past outputs/interactions of the model, reminiscent of the way some Helper chatbots look up stuff from FAQs. My overall intuitions on this:

  • It's definitely something, but the method of embedding and retrieval, if static, would be very limiting
  • Someone will prob
... (read more)

Very interesting write up. Do you have a high level overview of why, despite all of this, P(doom) is still 5%? What do you still see as the worst failure modes?

Noticed this as well. I tried to get it to solve some integration problems, and it could try different substitutions and things, but if they did not work, it kind of gave up and said to numerically integrate it. Also, it would make small errors, and you would have to point it out, though it was happy to fix them.

I'm thinking that most documents it reads tend to omit the whole search/backtrack phase of thinking. Even work that is posted online that shows all the steps, usually filters out all the false starts. It's like how most famous mathematicians were known for throwing away their scratchwork, leaving everyone to wonder how exactly they formed their thought processes...

The media does have its biases but their reaction seems perfectly reasonable to me. Occam's razor suggests this is not only unorthodox, but shows extremely poor judgment. This demonstrates that (a) either Elon is actually NOT as smart he has been hyped to be, or (b) there's some ulterior motive, but these are long-tailed.

Typically when one joins a company, you don't do anything for X number of months and get the lay of the land. I'm inclined to believe this is not just a local minimum, but typically close to the optimal strategy for a human being (but not ... (read more)

I'll say I definitely think it's too optimistic and I don't much too much stock into it. Still, I think it's worth thinking about.

Yes, absolutely we are not following the rule. The reason why I think it might change with an AGI: (1) currently we humans, despite what we say when we talk about aliens, still place a high prior on being alone in the universe, or from dominant religious perspectives, that we are the most intelligent. Those things combine to make us think there are no consequences to our actions against other life. An AGI, itself a proof of conc... (read more)

That is a very fair criticism. I didn't mean to imply this is something I was very confident in, but was interested in for three reasons:

1) This value function aside, is this a workable strategy, or is there a solid reason for suspecting the solution is all-or-nothing? Is it reasonable to 'look for' our values with human effort, or does this have to be something searched for using algorithms?
2) It sort of gives a flavor to what's important in life. Of course the human value function will be a complicated mix of different sensory inputs, reproduction, and g... (read more)

2Donald Hobson1y
At the moment, we don't know how to make an AI that does something simple like making lots of diamonds.  It seems plausible that making an AI that copies human values is easier than hardcoding even a crude approximation to human values. Or maybe not. 

I'm an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I've seen the light after I read most of Superintelligence. I feel like I'd like to help out somehow.  I'm in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don't have any family money or resources to lean on, and would rather not restart my career. I also don't think I should abandon ML and try to ... (read more)

3Adrià Garriga-alonso1y
You should apply to Anthropic. If you’re writing ML software at semi-FAANG. they probably want to interview you ASAP. The compensation is definitely enough to take care of your family and then save some money!
1Yonatan Cale1y
Anthropic offer equity, they can give you more details in private.  I recommend applying to both (it's a cheap move with a lot of potential upside), let me know if you'd like help connecting to any of them. If you learn by yourself - I'd totally get one on one advise (others linked), people will make sure you're on the best path possible
One of the paths which has non-zero hope in my mind is building a weakly aligned non-self improving research assistant for alignment researchers. Ought and EleutherAI's #accelerating-alignment are the two places I know who are working in this direction fairly directly, though the various language model alignment orgs might also contribute usefully to the project.
Work your way up the ML business  hierarchy to the point where you are having conversations with decision makers.  Try to convince them that unaligned AI is a significant existential risk.  A small chance of you doing this will in expected value terms more than make up for any harm you cause by working in ML given that if you left the field someone else would take your job.
5Linda Linsefors1y
Given where you live, I recomend going to some local LW events. There are still LW meetups in the Bay area, right?
7Adam Jermyn1y
Applying to Redwood or Anthropic seems like a great idea. My understanding is that they're both looking for aligned engineers and scientists and are both very aligned orgs. The worst case seems like they (1) say no or (2) don't make an offer that's enough for you to keep your lifestyle (whatever that means for you). In either case you haven't lost much by applying, and you definitely don't have to take a job that puts you in a precarious place financially.

You might want to consider registering for the AGI Safety Fundamentals Course (or reading through the content). The final project provides a potential way of dipping your toes into the water.

Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k's case, also many other problems).

Noting a conflict of interest - I work for 80,000 hours and know of but haven't used AISS. This post is in a personal capacity, I'm just flagging publicly available information rather than giving an insider take.

Pragmatic AI safety (link: is supposed to be a good sequence for helping you figure out what to do. My best advice is to talk to some people here who are smarter than me and make sure you understand the real problems, because the most common outcome besides reading a lot and doing nothing is to do something that feels like work but isn't actually working on anything important.

Has there been effort into finding a "least acceptable" value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that's not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.

The obvious option in this class is to try to destroy the world in a way that doesn't send out an AI to eat the lightcone that might possibly contain aliens who could have a better shot. I am really not a fan of this option.
4Donald Hobson1y
This doesn't select for humanlike minds. You don't want vast numbers of Ataribots similar to current RL, playing games like pong and pac-man. (And a trillion other autogenerated games sampled from the same distribution)   Even if you could somehow ensure it was human minds playing these games, the line between a fun game and total boredom is complex and subtle.

I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling.  On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transforme... (read more)

I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation -- how much can you get with just world modeling, versus world modeling plus reward/action combined?

If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence -- that the structure for predicting seq... (read more)

Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?

Suppose I'm designing an engine. I try out a new design, and it surprises me - it works much worse or much better than expected. That's a few bits of information. That's basically the sort of information we get from AI experiments today.

What we'd really like is to open up that surprising engine, stick thermometers all over the place, stick pressure sensors all over the place, measure friction between the parts, measure vibration, measure fluid flow and conce... (read more)

I think the desire works because most honest people know, if they give a good-sounding answer that is ultimately meaningless, no benefits will come of the answers given. They may eventually stop asking questions, knowing the answers are always useless. It's a matter of estimating future rewards from building relationships.

Now, when a human gives advice to another human, most of the time it is also useless, but not always. Also, it tends to not be straight up lies. Even in the useless case, people still think there is some utility in there, for example, hav... (read more)

One other thing I'm interested in, is there a good mathematical model of 'search'? There may not be an obvious answer. I just feel like there is some pattern that could be leveraged. I was playing hide and seek with my kids the other day, and noticed that, in a finite space, you expect there to be finite hiding spots. True, but every time you think you've found them all, you end up finding one more. I wonder if figuring out optimizations or discoveries follow a similar pattern. There are some easy ones, then progressively harder ones, but there are far more to be found than one would expect... so to model finding these over time, in a very large room...

If there is a mathematical model of search, it is no good to you unless it is computable.

I agree, I have also thought I am not completely sure of the dynamics of the intelligence explosion. I would like to have more concrete footing to figure out what takeoff will look like, as neither fast nor slow are proved.

My intuition however is the opposite. I can't disprove a slow takeoff, but to me it seems intuitive that there are some "easy" modifications that should take us far beyond human level. Those intuitions, though they could be wrong, are thus:

- I feel like human capability is limited in some obvious ways. If I had more time and energy to fo... (read more)

AI existential risk is like climate change. It's easy to come up with short slogans that make it seem ridiculous. Yet, when you dig deeper into each counterargument, you find none of them are very convincing, and the dangers are quite substantial. There's quite a lot of historical evidence for the risk, especially in the impact humans have had on the rest of the world. I strongly encourage further, open-minded study.

For ML researchers.

It's easy to imagine that the AI will have an off switch, and that we could keep it locked in a box and ask it questions. But just think about it. If some animals were to put you in a box, do you think you would stay in there forever? Or do you think you'd figure a way out that they hadn't thought of?

Policy makers

AI x-risk. It sounds crazy for two reasons. One, because we are used to nothing coming close to human intelligence, and two, because we are used to AI being unintelligent. For the first, the only point of comparison is imagining something that is to us what we are to cats. For the second, though we have not quite succeeded yet, it only takes one. If you have been following the news, we are getting close.

Policy makers.

Yeah, I tend to agree. Just wanted to make sure I'm not violating norms. In that case, my specific thoughts are as follows, with a thought to implementing AI transparency at the end.

There is the observation that the transformer architecture doesn't have a hidden state like an LSTM. I thought for a while something like this was needed for intelligence, to have a compact representation of the state one is in. (My biased view, that I've updated away from, was that the weights represented HOW to think, and less about knowledge.) However, it's really intractabl... (read more)

I think this is absolutely correct. GPT-3/PaLM is scary impressive, but ultimately relies on predicting missing words, and its actual memory during inference is just the words in its context! What scares me about this is that I think there are some really simple low hanging fruit to modify something like this to be, at least, slightly more like an agent. Then plugging things like this as components into existing agent frameworks, and finally, having entire research programs think about it and experiment on it. Seems like the problem would crack. You never ... (read more)

5Lone Pine1y
My opinion is that you're not going to be able to crack the alignment problem if you have a phobia of infohazards. Essentially you need a 'Scout Mindset'. There's already smart people working hard on the problem, including in public such as on podcasts, so realistically the best (or worst) could do on this forum is attempt to parse out what is known publicly about the scary stuff (eg agency) from DeepMind's papers and then figure out if there is a path forward towards alignment.

As a ML engineer, I think it's plausible. I also think there are some other factors that could act to cushion or mitigate slowdown. First, I think there are more low hanging fruit available. Now that we've seen what large transformer models can do on the text domain, and in a text-to-image Dall-E model, I think the obvious next step is to ingest large quantities of video data. We often talk about the sample inefficiency of modern methods as compared with humans, but I think humans are exposed to a TON of sensory data in building their world model. This see... (read more)

I work at a large, not quite FAANG company, so I'll offer my perspective. It's getting there. Generally, the research results are good, but not as good as they sound in summary. Despite the very real and very concerning progress, most papers you take at face value are a bit hyped. The exceptions to some extent are the large language models. However, not everyone has access to these. The open source versions of them are good but not earth shattering. I think they might be if the goal is to general fluent sounding chatbots, but this is not the goal of most w... (read more)

I posted something I think could be relevant to this:

The takeaway is, for a sufficiently advanced agent, who wants to hedge against the possibility of itself being destroyed by a greater power, may decide the only surviving plan is to allow the lesser life forms some room to optimize their own utility. It's sort of an asymmetrical infinite game theoretic chain. If every agent kills lower agents, only the maximum survives and no one knows if they are the maximum. If there even is a maximum.

3Lone Pine1y
Interesting. I think this is the reason why people like equality and find Nietzsche so nauseating. (Nietzsche's vision, in my interpretation, was that people with the opportunities to dominate others should take those opportunities, even if it causes millions of average people to suffer.)

War. Poverty. Inequality. Inhumanity. We have been seeing these for millennia caused by nation states or large corporations. But what are these entities, if not greater-than-human-intelligence systems, who happen to be misaligned with human well-being? Now, imagine that kind of optimization, not from a group of humans acting separately, but by an entity with a singular purpose, with an ever diminishing proportion of humans in the loop.

Audience: all, but maybe emphasizing policy makers

Thanks for pointing to ECL, this looks fascinating!

I like to think of it not like trying to show that agent B is not a threat to C. The way it’s set up we can probably assume B has no chance against C. C also may need to worry about agent D, who is concerned about hypothetical agent E, etc. I think that at some level, the decision an agent X makes is the decision all remaining agents in the hierarchy will make.

That said I sort of agree that’s the real fear about this method. It’s kind of like using super-rationality or something else to solve the prisoner’s dilemma. Are you willing to bet your life the oth... (read more)