Lorxus — LessWrong

Mathematician, agent foundations researcher, doctor. A strange primordial spirit left over from the early dreamtime, the conditions for the creation of which no longer exist; a creature who was once told to eat math and grow vast and who took that to heart; an escaped feral academic.

Reach out to me on Discord and tell me you found my profile on LW if you've got something interesting to say; you have my explicit permission to try to guess my Discord handle if so. You can't find my old abandoned-for-being-mildly-infohazardously-named LW account but it's from 2011 and has 280 karma.

A Lorxus Favor is worth (approximately) one labor-day's worth of above-replacement-value specialty labor, given and received in good faith, and used for a goal approximately orthogonal to one's desires, and I like LessWrong because people here will understand me if I say as much.

Apart from that, and the fact that I am under no NDAs, including NDAs whose existence I would have to keep secret or lie about, you'll have to find the rest out yourself.

I didn't realize it when I posted this, but the anvil problem points more sharply at what I want to argue about when I say that making the NAS blind to its own existence will make it give wrong answers; I don't think that the wrong answers would be limited to just such narrow questions, either.

How did this turn out?

Flip a coin if you are struggling to decide between option in a situation where there are relatively low stakes. This exposes to you your gut instinct immediately, which is more than good enough most times, and it is far faster than logically finding an answer.

Better yet, if you subscribe to many-worlds and you do actually care about trying both options, use a quantum coin. Don't take one option - take both of them.

...benign scenarios in which AIs get legal rights and get hired to run our society fair and square. A peaceful AI takeover would be good, IMO.
...humans willingly transfer power to AIs through legal and economic processes. I think this second type will likely be morally good, or at least morally neutral.

Why do you believe this? For my part, one of the major ruinous scenarios on my mind is one where humans delegate control to AIs that then goal-misgeneralize, breaking complex systems in the process; another is one where AIs outcompete ~all human economic efforts "fair and square" and end up owning everything, including (e.g.) rights to all water, partially because no one felt strongly enough about ensuring an adequate minimum baseline existence for humans. What makes those possibilities so unlikely to you?

The more EAs I meet, the more I realize that wanting the challenge is a load-bearing pillar of sanity when working on alignment.
When people first seriously think about alignment, a majority freak out. Existential threats are terrifying. And when people first seriously look at their own capabilities, or the capabilities of the world, to deal with the problem, a majority despair. This is not one of those things where someone says “terrible things will happen, but we have a solution ready to go, all we need is your help!”. Terrible things will happen, we don’t have a solution ready to go, and even figuring out how to help is a nontrivial problem. When people really come to grips with that, tears are a common response.
… but for someone who wants the challenge, the emotional response is different. The problem is terrifying? Our current capabilities seem woefully inadequate? Good; this problem is worthy. The part of me which looks at a rickety ladder 30 feet down into a dark tunnel and says “let’s go!” wants this. The part of me which looks at a cliff face with no clear path up and cracks its knuckles wants this. The part of me which looks at a problem with no clear solution and smiles wants this. The response isn’t tears, it’s “let’s fucking do this”.

"Problems worthy of attack prove their worth by fighting back."

Which is to say - despite a lot of other tragedies about me, there is a core part of me, dinged-up and bruised but still fighting, that looks at a beautiful core mystery and says - "No, unacceptable - we must know. We will know. I am hungry, and will chase this truth down, and it will not evade my jaws for long." (Sometimes it even gets what it wants.)

I'm confused - I don't see any? I certainly have some details of arguable value though.

Holy heck! I'm glad you're alright. I would never have thought to make a LW post out of an experience like that. Winning personality, indeed.

I'd propose "(the problem of) abstraction layer underspecification" or maybe just "engineering dyslocation" (in the sense of engineering being about layered abstractions and it'd be a category error to try to do pure materials science to make a spaceship).

Still in search of a name for this, or did you move on?

(Crossposted from https://tiled-with-pentagons.blogspot.com/ with thanks to @johnswentworth for the nudge.)

I read the paper, and I don't think that Yoshua Bengio's Non-Agentic Scientist (NAS) AI idea is likely to work out especially well. Here are a few reasons I'm worried about LawZero's agenda, in rough order of where in the paper my objection arises:

Making the NAS render predictions as though its words will have no effect and/or as if it didn't exist means that its predictive accuracy will be worse. Half the point here is for humans to take its predictions and use them to some end, and this will have effects; the NAS as described can only make predictions about a world it isn't in. More subtly, the NAS will make its predictions and give its probability estimates with respect to a subtly different world - one where the power it uses, the chips comprising it, and the scientists working at its lab were all doing different things; this has impacts on (e.g.) the economy and the weather (though possibly for the economy they even out?).
If the NAS is dumber than an agentic AI, the agentic AI will probably be able to fool our NAS about the purpose of its actions. Wasn't the hope here for the NAS to give advance warning of what an agentic AI might do?
A NAS as described would not do much about the kind of spoiler who would release an unaligned agentic AI. A lot of other plans share this problem, admittedly, but I think it's worth noting that explicitly.
LIkewise, arms race dynamics mean that a NAS would not be where any nation-state or large corporate actor would want to stop. In particular, I think it's worth noting that no parallel to SALT is nearly as likely to arise - an AGI would be a massive economic boost to whoever controlled it for however long they controlled it; it wouldn't just be an existential threat to keep in one's back pocket.
"...using unbiased and carefully calibrated probabilistic inference does not prevent an AI from exhibiting deception and bias." (p22)
I'm suspicious about the use of purely synthetic data; this runs a risk of overfitting to some unintended pattern in the generated data, or the synthetic-ness of the data meaning that some important messy implicit aspect of the real world gets missed.
It's not at all clear that there should be a "unique correct probability" in the case of an underspecified query, or one which draws on unknown or missing data, or something like economic predictions where the probability itself affects outcomes. In a similar vein, it's not clear how the NAS would generate or label latent variables, or that those latent variables would correspond to anything human-comprehensible.
Natural language is likely too messy, ambiguous, and polysemantic to give nice clean well-defined queries in.
Reaching the global optimum of training objective (that is, training to completion) is already fraught - for one, how do we know that we got there, and not to some faraway local optimum that's nearly as good? Additionally, elsewhere in the paper (p35?), the fact that we only aim at an approximate of the global optimum is mentioned.
It seems plausible to me that a combination of Affordances and Intelligence might lead to the arising of a Goal of some kind, or at least Goal-like behavior.
Even a truly safe ideal NAS could (p27) be a key component of a decidedly unsafe agentic AI, or a potent force-multiplier for malfeasant humans.
The definition of "agent" as given seems importantly incomplete. The capacity to pick your own goals feels important; conversely, acting as though you have goals should make you an agent, even if you have no explicit or closed-form goals.
Checking whether something lacks preferences seems very hard.
Even the mere computation of probabilistic answers is fraught - even if we dodge problems of self-dependent predictions by making the NAS blind to its own existence or effects - itself a fraught move - then I doubt that myopia alone will suffice to dodge agenticity; the NAS could (e.g.) pass notes to itself by way of effects it (unknowingly?) has on the world, which then get fed back in as part of data for the next round of analysis.
The comment about "longer theories [being] exponentially downgraded" makes me think of Solomonoff induction. It's not clear to me what language/(prior/Turing-machine) we pick to express the theories in, and also like that choice matters a lot.
I'm not happy about the "false treasures" thing (p33), nor about the part where L0 currently has no plan for tackling it.
It's not clear what the "human-interpretable form" for the explanations (p37) would look like; also, this conflicts with the principle that the only affordance that the NAS has should be the ability to give probability estimates in reply to queries.
Selection on accuracy (p40) seems like the kind of thing that could deep-deception-style cause us to end up with an agentic AI even despite our best efforts.
The "lack of a feedback loop with the outside world" seems like it would result in increasing error as time passes.

As a meta point, this seems like the most recent in a long line of "oracle AI" ideas. None of them has worked out especially well, not least because humans see money and power to grab by making more agentic ML systems instead.

Also, not a criticism, but I'm curious about the part where (p24) we want a guardrail that's ...an estimator of probabilistic bounds over worst-case scenarios" and where (p29) "[i]t's important to choose our family of theories to be expressive enough" - to what extent does this mean an infrabayesian approach is indicated?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments