quila

“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”

independent researcher theorizing about superintelligence-robust training stories and predictive models
(other things i am: suffering-focused altruist, vegan, fdt agent, artist)

contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1@protonmail.com}

Posts

Sorted by New

2quila's Shortform

5mo

2quila's Shortform

5mo

28Introduction and current research agenda

6mo

Wiki Contributions

Comments

quila's Shortform

quila18m10

It sounds like you're anthropic updating on the fact that we'll exist in the future

The quote you replied to was meant to be about the past.^[1]

I can see why it looks like I'm updating on existing in the future, though.^[2] I think it may be more interpretable when framed as choosing actions based on what kinds of paths into the future are more likely, which I think should include assessing where our observations so far would fall. E.g., I think that "we solve traditional alignment right as the x-risk is very near" is less likely than "we learn something really important from existing AGIs at that point that's what enables us to create safe ASI." Because I think that, it seems like it's some evidence {for being on / for acting as if we're on} that second path, if we're at that point.

This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us

I do think that. I think that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here's a thread with some general arguments for this.

Of course, you could claim that our understanding of the past is not perfect, and thus should still update

I think my understanding of why we've survived so far re:AI is very not perfect. For example, I don't know what would have needed to happen for the more-agentic architectures to have been found first, or (framed inversely) how lucky we needed to be to survive this far.

~~~

I'm not sure if this reply will address the disagreement, or if it will still seem from your pov that I'm making some logical mistake. I'm not actually fully sure what the disagreement is. You're welcome to try to help me understand if one remains.

^{^}
I originally thought you were asking why it's true of the past, but then I realized we very probably agreed (in principle) in that case.
^{^}
And to an extent it internally feels like I'm doing this, and then asking "what do my actions need to be to make this be true" in a similar sense to how an FDT agent would in transparent newcombs. But framing it like this is probably unnecessarily confusing and I feel confused about this description.

quila's Shortform

quila37m10

(I think I misinterpreted your question and started drafting another response, will reply to relevant portions of this reply there)

Take the wheel, Shoggoth! (LW frontpage algorithm experiments)

quila2h10

Suggestion: A marker for recommended posts which are over x duration old. I was just reading this post which was recommended to me, and got half-way through before seeing it was 2 years out of date :(

https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai

(Or maybe it's unnecessary and I'll get used to checking post dates on the algorithmic frontpage)

quila's Shortform

quila3h10

At what point should I post content as top-level posts rather than shortforms?

For example, a recent writing I posted to shortform was ~250 concise words plus an image: 'Anthropics may support a 'non-agentic superintelligence' agenda'. It would be a top-level post on my blog if I had one set up (maybe soon :p).

Some general guidelines on this would be helpful.

quila's Shortform

quila3h10

Anthropics may support a 'non-agentic superintelligence' agenda

Creating superintelligence generally leads to runaway optimization.
Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.
By default, I'd expect the 'consistent underlying reason' to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
That particular 'consistent underlying reason' is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
The actual distribution could look something like this:

Note that the yellow portion doesn't imply no effort is made to ensure the first ASI's training setup produces a safe system, i.e. that we 'get lucky' by being on an alignment-by-default path.

I'd expect it to instead be the case that the 'luck'/causal determinant came earlier, i.e. initial capabilities breakthroughs being of a type which first produced non-agentic general intelligences instead of seed agents and inspired us to try to make sure the first superintelligence is non-agentic, too.

(This same argument can also be applied to other possible agendas that may not have been pursued if not for updates caused by early AGIs)

^{^}
Disclaimer: This is presented as Bayesian evidence rather than as a 'sure thing I believe'
^{^}
Rather than expecting to 'get lucky many times in a row', e.g via capabilities researchers continually overlooking a human-findable advance
^{^}
(The proportions over time here aren't precise, I put more effort into making this image easy to read. It also doesn't include unknown unknowns)

quila's Shortform

quila7h20

my language progression on something, becoming increasingly general: goals/value function -> decision policy (not all functions need to be optimizing towards a terminal value) -> output policy (not all systems need to be agents) -> policy (in the space of all possible systems, there exist some whose architectures do not converge to output layer)

(note: this language isn't meant to imply that systems behavior must be describable with some simple function, in the limit the descriptive function and the neural network are the same)

quila's Shortform

quila4d30

Reflecting on this more, I wrote in a discord server (then edited to post here):

I wasn't aware the concept of pivotal acts was entangled with the frame of formal inner+outer alignment as the only (or only feasible?) way to cause safe ASI.

I suspect that by default, I and someone operating in that frame might mutually believe each others agendas to be probably-doomed. This could make discussion more valuable (as in that case, at least one of us should make a large update).

For anyone interested in trying that discussion, I'd be curious what you think of the post linked above. As a comment on it says:

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."

In my view, solving formal inner alignment, i.e. devising a general method to create ASI with any specified output-selection policy, is hard enough that I don't expect it to be done.^[1] This is why I've been focusing on other approaches which I believe are more likely to succeed.

^{^}
Though I encourage anyone who understands the problem and thinks they can solve it to try to prove me wrong! I can sure see some directions and I think a very creative human could solve it in principle. But I also think a very creative human might find a different class of solution that can be achieved sooner. (Like I've been trying to do :)

quila's Shortform

quila4d10

(see reply to Wei Dai)

quila's Shortform

quila4d50

it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved

Agreed that some think this, and agreed that formally specifying a simple action policy is easier than a more complex one.^[1]

I have a different model of what the earliest safe ASIs will look like, in most futures where they exist. Rather than 'task-aligned' agents, I expect them to be non-agentic systems which can be used to e.g come up with pivotal actions for the human group to take / information to act on.^[2]

^{^}
although formal 'task-aligned agency' seems potentially more complex than the attempt at a 'full' outer alignment solution that I'm aware of (QACI), as in specifying what a {GPU, AI lab, shutdown of an AI lab} is seems more complex than it.
^{^}
I think these systems are more attainable, see this post to possibly infer more info (it's proven very difficult for me to write in a way that I expect will be moving to people who have a model focused on 'formal inner + formal outer alignment', but I think evhub has done so well).

quila's Shortform

quila4d126

On Pivotal Acts

I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.

Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.

This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.^[1]

My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.

We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).

Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.

Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.

^{^}
side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).
If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.
Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.