The thing I have in mind as north star looks closest to the GD Challenge in scope, but somewhat closer to the CIP one in implementation? The diff is something like:
I'm glad you're seeing the challenges of consequentialism. I think the next crux is something like: My guess that consequentialism is a weed which grows in the cracks of any strong cognitive system, and that without formal guarantees of non-consequentialism, any attempt to build an ecosystem of the kind you describe will end up being eaten by processes which are unboundedly goal-seeking. I don't know of any write-up that hits exactly the notes you'd want here, but some maybe decent intuition pumps in this direction include: The Parable of Predict-O-Matic, Why Tool AIs Want to Be Agent AIs, Averting the convergent instrumental strategy of self-improvement, Averting instrumental pressures, and other articles under arbital corrigibility.
I'd be open to having an on the record chat, but it's possible we'd get into areas of my models which seem too exfohazardous for public record.
Needs as you've framed them have a fuzzy boundary between needs and wants. Do I need respect or just want it in this situation? So it's easy to wonder if I'm pressuring someone by framing it as a need.
Yeah, the idea is to go back to as basic a pattern that's preferred as you can. If I was trying to make it super concrete I'd probably try to unpack it to be "thing grounded in basic human universal reinforcement signals" with a bunch of @Steven Byrnes's neuroscience, esp this stuff.
Nice! Glad you're getting stuck in, and good to hear you've already read a bunch of the background materials.
The idea of bounded non-maximizing agents / multipolar as safer has looked hopeful to many people during the field's development. It's a reasonable place to start, but my guess is if you zoom in on the dynamics of those systems they look profoundly unstable. I'd be enthusiastic to have a quick call to explore the parts of that debate interactively. I'd link a source explaining it, but I think the alignment community has overall done a not great job of writing up the response to this so far.[1]
The very quick headline is something like:
I'd be interested to see if we've been missing something, but my guess is systems containing many moderately capable agents (~top human capabilities researcher) which are trained away from being consequentialists in a fuzzy way almost inevitably falls into the attractor of very capable systems either directly taking power from humans or puppeteering the human's agency as the AIs improve.
Quick answer-sketches to your other questions:
One direction that I could imagine being promising and something your skills might be uniquely suited for would be to, with clarity about what technology at physical limits is capable of, doing a large-scale consultation to collect data about humanity's 'north star'. Let people think through where we would actually like to go, so that a system trying to support humanity's flourishing can better understand our values. I funded a small project to try and map people's visions of utopia a few years back (e.g.), but the sampling and structure wasn't really the right shape to do this properly.
https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-goal-directedness is one of the less bad attempts to cover this, @the gears to ascension might know or be writing up a better source
(plus lots of applause lights for things which are actually great in most domains, but don't super work here afaict)
It's cool to see you involved in this sphere! I've been seeing and hearing about your work for a while, and have been impressed by both your mechanism design and ability to bring it into large-scale use.
Reading through this, I get some impression that it's missing some background related to some of the models of what strong superintelligence looks like. Both the challenges of the kind of alignment that's needed to make that go well, and just how extreme the transition will by default end up being.
Even without focusing on that, this work is useful in some timelines and seems worthwhile, but my guess is you'd get a fair amount of improvement in your aim by picking up ideas from some of the people with the longest history in the field. Some of my top picks would be (approx in order of recommendation):
Or, if you'd like something book-length, AI Does Not Hate You: Rationality, Superintelligence, and the Race to Save the World is the best until later this month when If Anyone Builds it, Everyone Dies comes out.
Working memory bounds isn't super related to non-fuzzy-ness, as you can have a window which slides over context and is still rigorous at every step. Absolute local validity due to well-specifiedness of axioms and rules of inference is closer to the core.
(realised you mean that the axioms and rules of inference are in working memory, not the whole tower, retracted)
I'd go with
Don't make claims that plausibly conflict with their models, except if they can check the claims (you are a valid source for claims about purely your state).
+
Don't make underspecified requests, and found your requests in general needs with space for those needs to be met other ways.
NVC is a form of variable scoping for human communication. With it, you can write communication code that avoids the most common runtime conflicts.
Human brains are neural networks doing predictive processing. We receive data from the external world, and not all of that data is trusted in the computer security sense. Some of the data would modify parts of your world model which you'd like to be able to do your own thinking with. It's jarring and unpleasant for someone to send you an informational packet that as you parse it moves around parts of your cognition you were relying on not being moved. For example, think back to dissonances you've felt, or seen in others, due to direct forceful claims about their internal state. This makes sense! Suffering, in predictive processing, is unresolved error signals, two conflicting world-models contained in the same system. If the other person's data packet tried to make claims directly into your world model, rather than via your normal evaluation functions, you're often going to end up with suffering-flavoured superpositions in your world-model.
NVC is safe-mode restricted subset of communication where you make sure the type signature of your conversational object makes changes to the other person's fuzzy non-secured predictive processing state only in ways carefully scoped to fairly reliably not collide with their thought-code, while keeping enough flexibility to resolve conflict. You don't necessarily want to run it all the time, it does limit a bunch of communication which is nice in high trust conversations where global scope lets you move faster, but it's amazing as a way to get out or avoid of otherwise painful conflict spirals.
So! Safe scopes:
Seconded, I consistently find your comments both much more valuable and ~zero sneer. I would be dismayed by moderation actions towards you, while supporting those against Said. You might not have a sense of how his are different, but you automatically avoid the costly things he brings.
I think there's a happy medium between these two bad extremes, and the vast majority of LWers sit in it generally.
[set 200 years after a positive singularity at a Storyteller's convention]
If We Win Then...
My friends, my friends, good news I say
The anniversary’s today
A challenge faced, a future won
When almost came our world undone
We thought for years, with hopeful hearts
Past every one of the false starts
We found a way to make aligned
With us, the seed of wondrous mind
They say at first our child-god grew
It learned and spread and sought anew
To build itself both vast and true
For so much work there was to do
Once it had learned enough to act
With the desired care and tact
It sent a call to all the people
On this fair Earth, both poor and regal
To let them know that it was here
And nevermore need they to fear
Not every wish was it to grant
For higher values might supplant
But it would help in many ways:
Technologies it built and raised
The smallest bots it could design
Made more and more in ways benign
And as they multiplied untold
It planned ahead, a move so bold
One planet and 6 hours of sun
Eternity it was to run
Countless probes to void disperse
Seed far reaches of universe
With thriving life, and beauty's play
Through endless night to endless day
Now back on Earth the plan continues
Of course, we shared with it our values
So it could learn from everyone
What to create, what we want done
We chose, at first, to end the worst
Diseases, War, Starvation, Thirst
And climate change and fusion bomb
And once these things it did transform
We thought upon what we hold dear
And settled our most ancient fear
No more would any lives be stolen
Nor minds themselves forever broken
Now back to those far speeding probes
What should we make be their payloads?
Well, we are still considering
What to send them; that is our thing.
The sacred task of many aeons
What kinds of joy will fill the heavens?
And now we are at story's end
So come, be us, and let's ascend