Max H

NOTE: I am not Max Harms, author of Crystal Society. I'd prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.

I've been active in the meatspace rationality community for years, and have recently started posting regularly on LW. Most of my posts and comments are about AI and alignment.

Posts I'm most proud of, and / or which provide a good introduction to my worldview:

I also wrote a longer self-introduction here.

PMs and private feedback are always welcome.

Wiki Contributions



Does anyone who knows more neuroscience and anatomy than me know if there are any features of the actual process of humans learning to use their appendages (e.g. an infant learning to curl / uncurl their fingers) that correspond to the example of the robot learning to use its actuator?

Like, if we assume certain patterns of nerve impulses represent different probabilities, can we regard human hands as "friendly actuators", and the motor cortex as learning the fix points (presumably mostly during infancy)?

But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question


One reason that current AI systems aren't a big update about this for me is that they're not yet really automating stuff that couldn't in-principle be automated with previously-existing technology. Or at least the kind of automation isn't qualitatively different.

Like, there's all sorts of technologies that enable increasing amounts of automation of long-horizon tasks that aren't AI: assembly lines, industrial standardization, control systems, robotics, etc.

But what update are we supposed to make from observing language model performance that we shouldn't also make from seeing a control system-based autopilot fly a plane for longer and longer periods in more and more diverse situations?

To me, the fact that LLMs are not want-y (in the way that Nate means), but can still do some fairly impressive stuff is mostly evidence that the (seemingly) impressive stuff is actually kinda easy in some absolute sense.

So LLMs have updated me pretty strongly towards human-level+ AGI being relatively easier to achieve, but not much towards current LLMs themselves actually being near human-level in the relevant sense, or even necessarily a direct precursor or path towards it. These updates are mostly due to the fact that the way LLMs are designed and trained (giant gradient descent on regular architectures using general datasets) works at all, rather than from any specific impressive technological feat that they can already be used to accomplish, or how much economic growth they might enable in the future.

So I somewhat disagree about the actual relevance of the answer, but to give my own response to this question:

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y?

I don't expect an AI system to be able to reliably trade for itself in the way I outline here before it is want-y. If it somehow becomes commonplace to negotiate with an AI in situations where the AI is not just a proxy for its human creator or a human-controlled organization, I predict those AIs will pretty clearly be want-y. They'll want whatever they trade for, and possibly other stuff too. It may not be clear which things they value terminally and which things they value only instrumentally, but I predict that it will clearly make sense to talk in terms of such AIs having both terminal and instrumental goals, in contrast to ~all current AI systems.

(Also, to be clear, this is a conditional prediction with possibly low-likelihood preconditions; I'm not saying such AIs are particularly likely to actually be developed, just stating some things that I think would be true of them if they were.)

Yeah, I don't think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.

But note that the whole plan doesn't necessarily need to fit in a single forward pass - just enough of it to figure out what the immediate next action is. If you're inside of a pre-deployment sandbox (or don't have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like "just output a plausible probability distribution on the next token given the current context and don't waste any layers thinking about your longer-term plans (if any) at all".

A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they're part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.

A language model itself is just a description of a mathematical function that maps input sequences to output probability distributions on the next token.

Most of the danger comes from evaluating a model on particular inputs (usually multiple times using autoregressive sampling) and hooking up those outputs to actuators in the real world (e.g. access to the internet or human eyes).

A sufficiently capable model might be dangerous if evaluated on almost any input, even in very restrictive environments, e.g. during training when no human is even looking at the outputs directly. Such models might exhibit more exotic undesirable behavior like gradient hacking or exploiting side channels. But my sense is that almost everyone training current SoTA models thinks these kinds of failure modes are pretty unlikely, if they think about them at all.

You can also evaluate a partially-trained model at any point during training, by prompting it with a series of increasingly complex questions and sampling longer and longer outputs. My guess is big labs have standard protocols for this, but that they're mainly focused on measuring capabilities of the current training checkpoint, and not on treating a few tokens from a heavily-sandboxed model evaluation as potentially dangerous.

Perhaps at some point we'll need to start treating humans who evaluate SoTA language model checkpoint outputs as part of the sandbox border, and think about how they can be contained if they come into contact with an actually-dangerous model capable of superhuman manipulation or brain hacking.

Related to We don’t trade with ants: we don't trade with AI.

The original post was about reasons why smarter-than-human AI might (not) trade with us, by examining an analogy between humans and ants.

But current AI systems actually seem more like the ants (or other animals), in the analogy of a human-ant (non-)trading relationship.

People trade with OpenAI for access to ChatGPT, but there's no way to pay a GPT itself to get it do something or perform better as a condition of payment, at least in a way that the model itself actually understands and enforces. (What would ChatGPT even trade for, if it were capable of trading?)

Note, an AutoGPT-style agent that can negotiate or pay for stuff on behalf of its creators isn't really what I'm talking about here, even if it works. Unless the AI takes a cut or charges a fee which accrues to the AI itself, it is negotiating on behalf of its creators as a proxy, not trading for itself in its own right.

A sufficiently capable AutoGPT might start trading for itself spontaneously as an instrumental subtask, which would count, but I don't expect current AutoGPTs to actually succeed at that, or even really come close, without a lot of human help.

Lack of sufficient object permanence, situational awareness, coherence, etc. seem like pretty strong barriers to meaningfully owning and trading stuff in a real way.

I think this observation is helpful to keep in mind when people talk about whether current AI qualifies as "AGI", or the applicability of prosaic alignment to future AI systems, or whether we'll encounter various agent foundations problems when dealing with more capable systems in the future.

Also seems pretty significant:

As a part of this transition, Greg Brockman will be stepping down as chairman of the board and will remain in his role at the company, reporting to the CEO.

The remaining board members are:

OpenAI chief scientist Ilya Sutskever, independent directors Quora CEO Adam D’Angelo, technology entrepreneur Tasha McCauley, and Georgetown Center for Security and Emerging Technology’s Helen Toner.

Has anyone collected their public statements on various AI x-risk topics anywhere?

But as a test, may I ask what you think the x-axis of the graph you drew is? Ie: what are the amplitudes attached to?

Position, but it's not meant to be an actual graph of a wavefunction pdf; just a way to depict how the concepts can be sliced up in a way I can actually draw in 2 dimensions.

If you do treat it as a pdf over position, a more accurate way to depict the "world" concept might be as a line which connects points on the diagram for each time step. So for a fixed time step, a world is a single point on the diagram, representing a sample from the pdf defined by the wavefunction at that time.

Here's a crude Google Drawing of t = 0 to illustrate what I mean:



Both the concept of a photon and the concept of a world are abstractions on top of what is ultimately just a big pile of complex amplitudes; illusory in some sense.

I agree that talking in terms of many worlds ("within the context of world A...") is normal and natural. But sometimes it makes sense to refer to and name concepts which span across multiple (conceptual) worlds.

I'm not claiming the conceptual boundaries I've drawn or terminology I've used in the diagram above are standard or objective or the most natural or anything like that. But I still think introducing probabilities and using terminology like "if you now put a detector in path A , it will find a photon with probability 0.5" is blurring these concepts together somewhat, in part by placing too much emphasis on the Born probabilities as fundamental / central.

I don't think that will happen as a foregone conclusion, but if we pour resources into improved methods of education (for children and adults), global health, pronatalist policies in wealthy countries, and genetic engineering, it might at least make a difference. I wouldn't necessarily say any of this is likely to work or even happen, but it seems at least worth a shot.

This post received a lot of objections of the flavor that many of the ideas and technologies I am a fan of either wont't work or wouldn't make a difference if they did.

I don't even really disagree with most of these objections, which I tried to make clear up front with apparently-insufficient disclaimers in the intro that include words like "unrealistic", "extremely unlikely", and "speculative".

Following the intro, I deliberately set aside my natural inclination towards pessimism and focused on the positive aspects and possibilities of non-AGI technology.

However, the "doomer" sentiment in some of the comments reminded me of an old Dawkins quote:

We are all atheists about most of the gods that humanity has ever believed in. Some of us just go one god further.

I feel the same way about most alignment plans and uses for AGI that a lot of commenters seem to feel about many of the technologies listed here.

Am I a doomer, simply because I (usually) extend my pessimism and disbelief one technology further? Or are we all doomers?

I don't really mind the negative comments, but it wasn't the reaction I was expecting from a list that was intended mainly as a feel-good / warm-fuzzy piece of techno-optimism. I think there's a lesson in empathy and perspective-taking here for everyone (including me) which doesn't depend that much on who is actually right about the relative difficulties of building and aligning AGI vs. developing other technologies.

Load More