I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.
I think you're wrong to be psychoanalysing why people aren't paying attention to your work. You're overcomplicating it. Most people just think you're wrong upon hearing a short summary, and don't trust you enough to spend time learning the details. Whether your scenario is important or not, from your perspective it'll usually look like people are bouncing off for bad reasons.
For example, I read the executive summary. For several shallow reasons,[1] the scenario seemed unlikely and unimportant. I didn't expect there to be better arguments further on. So I stopped. Other people have different world models and will bounce off for different reasons.
Which isn't to say it's wrong (that's just my current weakly held guess). My point is just that even if you're correct, the way it looks a priori to most worldviews is sufficient to explain why people are bouncing off it and not engaging properly.
Perhaps I'll encounter information in the future that indicates my bouncing off was a mistake, and I'll go back.
There are a couple of layers of maybes, so the scenario doesn't seem likely. I expect power to be more concentrated. I expect takeoff to be faster. I expect capabilities to have a high cap. I expect alignment to be hard for any goal. Something about maintaining a similar societal structure without various chaotic game-board-flips seems unlikely. The goals-instilled-in-our-replacements are pretty specific (institution-aligned), and pretty obviously misaligned from overall human flourishing. Sure humans are usually myopic, but we do sometimes consider the consequences and act against local incentives.
I don't know whether these reasons are correct, or how well you've argued against them. They're weakly held and weakly considered, so I wouldn't have usually written them down. They are just here to make my point more concrete.
The description of how sequential choice can be defined is helpful, I was previously confused by how this was supposed to work. This matches what I meant by preferences over tuples of outcomes. Thanks!
We'd incorrectly rule out the possibility that the agent goes for (B+,B).
There's two things we might want from the idea of incomplete preferences:
I think modelling an agent as having incomplete preferences is great for (1). Very useful. We make better predictions if we don't rule out the possibility that the agent goes for B after choosing B+. I think we agree here.
For (2), the relevant quote is:
As a general point, you can always look at a decision ex post and back out different ways to rationalise it. The nontrivial task is here prediction, using features of the agent.
If we can always rationalise a decision ex post as being generated by a complete agent, then let's just build that complete agent. Incompleteness isn't helping us, because the behaviour could have been generated by complete preferences.
Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?
That's not right
Are you saying that my description (following) is incorrect?
[incomplete preferences w/ caprice] would be equivalent to 1. choosing the best policy by ranking them in the partial order of outcomes (randomizing over multiple maxima), then 2. implementing that policy without further consideration.
Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally indistinguishable"? I'm using it like: If you observe a single sequence of actions from this agent (and knowing the agent's world model), can you construct a utility function over outcomes that could have produced that sequence.
Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.
This is compatible with a resolute outcome-utility maximizer (for whom A is a maxima). There's no rule that says an agent must take the shortest route to the same outcome (right?).
As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point.
...
There's no such drawback for agents with incomplete preferences using resolute choice.
Sure, but why is that a drawback? It can't be money pumped, right? Agents following resolute choice often choose against their local strict preferences in other decision problems. (E.g. Newcomb's). And this is considered an argument in favour of resolute choice.
I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
Hmm good point. Looking at your dialogues has changed my mind, they have higher karma than the ones I was looking at.
You might also be unusual on some axis that makes arguments easier. It takes me a lot of time to go over peoples words and work out what beliefs are consistent with them. And the inverse, translating model to words, also takes a while.
Dialogues are more difficult to create (if done well between people with different beliefs), and are less pleasant to read, but are often higher value for reaching true beliefs as a group.
Dialogues seem under-incentivised relative to comments, given the amount of effort involved. Maybe they would get more karma if we could vote on individual replies, so it's more like a comment chain?
This could also help with skimming a dialogue because you can skip to the best parts, to see whether it's worth reading the whole thing.
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it's not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it's a good goal). Or, I think the current story for heuristic arguments is using them to "explain" a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the "understood" end of the axis. The "not-understood" end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Also what is its relation to natural language?
Not sure what you're getting at here.
I propose: the best planners must break the beta.
Because if a planner is going to be the best, it needs to be capable of finding unusual (better!) plans. If it's capable of finding those, there's ~no benefit of knowing the conventional wisdom about how to do it (climbing slang: beta).
Edit: or maybe: good planners don't need beta?