quila

Independent researcher theorizing about superintelligence-robust training stories.

If you disagree with me for reasons you expect I'm not aware of, please tell me!

If you have/find an idea that's genuinely novel/out-of-human-distribution while remaining analytical, you're welcome to send it to me to 'introduce chaos into my system'.

Contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1<at>protonmail.com}

“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”

Posts

Sorted by New

2quila's Shortform

5mo

2quila's Shortform

5mo

28Introduction and current research agenda

6mo

Wiki Contributions

Comments

Ilya Sutskever and Jan Leike resign from OpenAI

quila2h20

Leaving to dissuade others within the company is another possibility

Partitioned Book Club

quila2d10

Same as usual, with each person summarizing a chapter, and then there's a group discussion where they try to piece together the true story

Partitioned Book Club

quila3d10

Have you tried it with a book that doesn't have self-contained chapters?

quila's Shortform

quila3d10

Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well.

I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I'm not sure how to word it. I'll see if it becomes clear indirectly.

One difference between our intuitions may be that I'm implicitly thinking within a manyworlds frame. Within that frame it's actually certain that we'll solve alignment in some branches.

So if we then 'condition on solving alignment in the future', my mind defaults to something like this: "this is not much of an update, it just means we're in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning." (I.e I disagree with the top quote)

The most probable reason I can see for this difference is if you're thinking in terms of a single future, where you expect to die.^[1] In this frame, if you observe yourself surviving, it may seem^[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).

Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.

I totally agree that we shouldn't update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as 'alignment is very hard for humans'), we "shouldn't condition on solving alignment, because we haven't yet." I.e that we shouldn't condition on the future not being mostly death outcomes when we haven't averted them and have reason to think they are.

Maybe this helps clarify my position?

On another point:

the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us

I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.

^{^}
(I'm not disagreeing with this being the most common outcome)
^{^}
Though after reflecting on it more I (with low confidence) think this is wrong, and one's logical probabilities shouldn't change after surviving in a 'one-world frame' universe either.
For an intuition pump: consider the case where you've crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you've crafted it correctly, you make no update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.
In the 'single world' frame, continued survival isn't garunteed, but it's still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?

quila's Shortform

quila3d10

It sounds like you're anthropic updating on the fact that we'll exist in the future

The quote you replied to was meant to be about the past.^[1]

I can see why it looks like I'm updating on existing in the future, though.^[2] I think it may be more interpretable when framed as choosing actions based on what kinds of paths into the future are more likely, which I think should include assessing where our observations so far would fall.

Specifically, I think that ("we find a general agent-alignment solution right as takeoff is very near" given "early AGIs take a form that was unexpected") is less probable than ("observing early AGI's causes us to form new insights that lead to a different class of solution" given "early AGIs take a form that was unexpected"). Because I think that, and because I think we're at that point where takeoff is near, it seems like it's some evidence for being on that second path.

This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us

I do think that. I think that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here's a thread with some general arguments for this.

Of course, you could claim that our understanding of the past is not perfect, and thus should still update

I think my understanding of why we've survived so far re:AI is very not perfect. For example, I don't know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far.

~~~

I'm not sure if this reply will address the disagreement, or if it will still seem from your pov that I'm making some logical mistake. I'm not actually fully sure what the disagreement is. You're welcome to try to help me understand if one remains.

I'm sorry if any part of this response is confusing, I'm still learning to write clearly.

^{^}
I originally thought you were asking why it's true of the past, but then I realized we very probably agreed (in principle) in that case.
^{^}
And to an extent it internally feels like I'm doing this, and then asking "what do my actions need to be to make this be true" in a similar sense to how an FDT agent would act in transparent newcombs. But framing it like this is probably unnecessarily confusing and I feel confused about this description.

quila's Shortform

quila3d10

(I think I misinterpreted your question and started drafting another response, will reply to relevant portions of this reply there)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")

quila3d10

Suggestion: A marker for recommended posts which are over x duration old. I was just reading this post which was recommended to me, and got half-way through before seeing it was 2 years out of date :(

https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai

(Or maybe it's unnecessary and I'll get used to checking post dates on the algorithmic frontpage)

quila's Shortform

quila3d42

At what point should I post content as top-level posts rather than shortforms?

For example, a recent writing I posted to shortform was ~250 concise words plus an image: 'Anthropics may support a 'non-agentic superintelligence' agenda'. It would be a top-level post on my blog if I had one set up (maybe soon :p).

Some general guidelines on this would be helpful.

quila's Shortform

quila3d10

Anthropics may support a 'non-agentic superintelligence' agenda^[1]

Creating superintelligence generally leads to runaway optimization.
Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.^[2]
By default, I'd expect the 'consistent underlying reason' to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
That particular 'consistent underlying reason' is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
The actual distribution could look something like this: ^[3]

Note that the yellow portion doesn't imply no effort is made to ensure the first ASI's training setup produces a safe system, i.e. that we 'get lucky' by being on an alignment-by-default path.

I'd expect it to instead be the case that the 'luck'/causal determinant came earlier, i.e. initial capabilities breakthroughs being of a type which first produced non-agentic general intelligences instead of seed agents and inspired us to try to make sure the first superintelligence is non-agentic, too.

(This same argument can also be applied to other possible agendas that may not have been pursued if not for updates caused by early AGIs)

^{^}
Disclaimer: This is presented as Bayesian evidence rather than as a 'sure thing I believe'
^{^}
Rather than expecting to 'get lucky many times in a row', e.g via capabilities researchers continually overlooking a human-findable advance
^{^}
(The proportions over time here aren't precise, I put more effort into making this image easy to read. It also doesn't include unknown unknowns)

quila's Shortform

quila3d20

my language progression on something, becoming increasingly general: goals/value function -> decision policy (not all functions need to be optimizing towards a terminal value) -> output policy (not all systems need to be agents) -> policy (in the space of all possible systems, there exist some whose architectures do not converge to output layer)

(note: this language isn't meant to imply that systems behavior must be describable with some simple function, in the limit the descriptive function and the neural network are the same)

Posts

Wiki Contributions

Comments

Anthropics may support a 'non-agentic superintelligence' agenda[1]

Anthropics may support a 'non-agentic superintelligence' agenda^[1]