I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.
There aren't really any non-extremely-leaky abstractions in big NNs on top of something like a "directions and simple functions on these directions" layer. (I originally heard this take from Buck)
Of course this depends on what it's trained to do? And it's false for humans and animals and corporations and markets, we have pretty good abstractions that allow us to predict and sometimes modify the behavior of these entities.
I'd be pretty shocked if this statement was true for AGI.
Yeah I think I agree. It also applies to most research about inductive biases of neural networks (and all of statistical learning theory). Not saying it won't be useful, just that there's a large mysterious gap between great learning theories and alignment solutions and inside that gap is (probably, usually) something like the levels-of-abstraction mistake.
its notion of regulators generally does not line up with neural networks.
When alignment researchers talk about ontologies and world models and agents, we're (often) talking about potential future AIs that we think will be dangerous. We aren't necessarily talking about all current neural networks.
A common-ish belief is that future powerful AIs will be more naturally thought of as being agentic and having a world model. The extent to which this will be true is heavily debated, and gooder regulator is kinda part of that debate.
Biphasic cognition might already be an incomplete theory of mind for humans
Nothing wrong with an incomplete or approximate theory, as long as you keep an eye on the things that it's missing and whether they are relevant to whatever prediction you're trying to make.
Here's a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).
When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn't help much with understanding the high level.
Examples:
1. Understanding semiconductors and transistors doesn't tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you'll still be able to run the same programs. To understand a program, you don't need to be thinking about transistors or logic gates. Often you don't even need to be thinking about the bit level representation of data.
2. The computation happening in single neurons in an artificial neural network doesn't have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]
3. Understanding how neurons work doesn't tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren't even a useful abstraction for that purpose.
Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.
Of course John has said this better.
You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you're just messing with activation functions.
"they should clearly communicate their non-respectful/-kind alternative communication protocols beforehand, and they should help the other person maintain their boundaries;"
Nate did this.
By my somewhat idiosyncratic views on respectful communication, Nate was roughly as respectful as Thomas Kwa.
I do seem to be unusually emotionally compatible with Nate's style of communication though.
Section 4 then showed how those initial results extend to the case of sequential decision making.
[...]
If she's a resolute chooser, then sequential decisions reduce to a single non-sequential decisions.
Ah thanks, this clears up most of my confusion, I had misunderstood the intended argument here. I think I can explain my point better now:
I claim that proposition 3, when extended to sequential decisions with a resolute decision theory, shouldn't be interpreted the way you interpret it. The meaning changes when you make A and B into sequences of actions.
Let's say action A is a list of 1000000 particular actions (e.g. 1000000 small-edits) and B is a list of 1000000 particular actions (e.g. 1 improve-technology, then 999999 amplified-edits).[1]
Proposition 3 says that A is equally likely to be chosen as B (for randomly sampled desires). This is correct. Intuitively this is because A and B are achieving particular outcomes and desires are equally likely to favor "opposite" outcomes.
However this isn't the question we care about. We want to know whether action-sequences that contain "improve-technology" are more likely to be optimal than action-sequences that don't contain "improve-technology", given a random desire function. This is a very different question to the one proposition 3 gives us an answer to.
Almost all optimal action-sequences could contain "improve-technology" at the beginning, while any two particular action sequences are equally likely to be preferred to the other on average across desires. These two facts don't contradict each other. The first fact is true in many environments (e.g. the one I described[2]) and this is what we mean by instrumental convergence. The second fact is unrelated to instrumental convergence.
I think the error might be coming from this definition of instrumental convergence:
could we nonetheless say that she's got a better than probability of choosing from a menu of acts?
When is a sequence of actions, this definition makes less sense. It'd be better to define it as something like "from a menu of initial actions, she has a better than probability of choosing a particular initial action ".
I'm not entirely sure what you mean by "model", but from your use in the penultimate paragraph, I believe you're talking about a particular decision scenario Sia could find herself in.
Yep, I was using "model" to mean "a simplified representation of a complex real world scenario".
For simplicity, we can make this scenario a deterministic known environment, and make sure the number of actions available doesn't change if "improve-technology" is chosen as an action. This way neither of your biases apply.
E.g. we could define a "small-edit" as to any location in the state vector. Then an "amplified-edit" as to any location. This preserves the number of actions, and makes the advantage of "amplified-edit" clear. I can go into more detail if you like, this does depend a little on how we set up the distribution over desires.
I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.
The core interesting claim is this:
My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.
This conclusion doesn't follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list.
The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.
Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don't see why you didn't just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.
Here's my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having "random" goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on . Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it's very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on.
The assumptions I made about the environment are inspired by the real world environment, and the assumptions I've made about the desires are similar to yours, maximally uninformative over trajectories.
I'm not sure how to implement the rule "don't pay people to kill people". Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it's probably not what we want. If we use negative infinity, but then it can't ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I'm confused and you mean "actions that pattern match to actually paying money directly for murder", in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is "don't take any action that allows unprincipled agents to exploit you for having your principles", then maybe there isn't any edge cases. I'm confused about how to define "exploit" though.
You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there's some selection pressure against you if selection is based on money.
We can (kinda) turn this into a money-pump by charging the agent a fee for to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing.
Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like "adversarial agents can cheaply construct situations where I pay them money, but the world isn't actually different".
Ah I see, I was referring to less complete abstractions. The "accurately predict all behavior" definition is fine, but this comes with a scale of how accurate the prediction is. "Directions and simple functions on these directions" probably misses some tiny details like floating point errors, and if you wanted a human to understand it you'd have to use approximations that lose way more accuracy. I'm happy to lose accuracy in exchange for better predictions about behavior in previously-unobserved situations. In particular, it's important to be able to work out what sort of previously-unobserved situation might lead to danger. We can do this with humans and animals etc, we can't do it with "directions and simple functions on these directions".