Wiki Contributions


I think there are two key details that help make sense of human verbal performance and its epistemic virtue (or lack of epistemic virtue) in causing the total number of people to have better calibrated anticipations about what they will eventually observe.

The first key detail is that most people don't particularly give a fuck about having accurate anticipations or "true beliefs" or whatever. 

They just want money, and to have sex (and/or marry) someone awesome, and to have a bunch of kids, and that general kind of thing. 

For such people, you have to make the argument, basically, that because of how humans work (with very limited working memory, and so on and so forth) it is helpful for any of them with much agency to install a second-to-second and/or hour-to-hour and/or month-to-month "pseudo-preference" for seeking truth as if it was intrinsically valuable. 

This will, it can be argued, turn out to generate USEFUL beliefs sometimes, so they can buy a home where it won't flood, or buy stocks before they go up a lot, or escape a country that is about to open concentration camps and put them and their family in these camps to kill them, and so on... Like, in general "knowing about the world" can help make one choices in the world that redound to many prosaic benefits! <3

So we might say that "valuing epistemology is instrumentally convergent"... but cultivation like this doesn't seem to happen in people by default, and out in the tails the instrumental utility might come apart, such that someone with actual intrinsic love of true knowledge would act or speak differently. Like, specifically... the person with a true love of true knowledge might seem to be "self harming" to people without such a love in certain cases.

As DirectedEvolution says (emphasis not in original):

While we like the feature of disincentivizing inaccuracy, the way prediction markets incentivize withholding information is a downside.

And this does seem to be just straightforwardly true to me! 

And it relies on the perception that MONEY is much more motivating for many people than "TRUTH".

But also "markets vs conversational sharing" will work VERY differently for a group of 3, vs a group of 12, vs a group of 90, vs a group of 9000, vs a group of 3 million.

Roko is one of the best rationalists there has been, and one of his best essays spelled out pretty clearly how instrumental/intrinsic epistemics come apart IN GROUPS when he wrote "The Tragedy of the Social Epistemology Commons".

Suppose for the sake of argument, that I'm some kind of crazy weirdo who styles herself as some sort of fancy "philosopher" who LOVES the idea of WISDOM (for myself? for others? for a combination thereof?) but even if I did that I basically have to admit that most people are (platonic) timocrats or oligarchs AT BEST. 

They are attached to truth only insofar as it helps them with other things, and they pay nearly nothing extra to know a true fact with no market relevance, or a true fact whose avowed knowers are spurned in political contests. 

(Just consider the state of the IQ debate in the US, for example. Or consider the "lab leak hypothesis" which everyone with a brain privately puts worryingly high credence on, and yet Peter Daszak is getting more grant money, rather than being under indictment by a grand jury. Look at how cowardly Pinker & Rees are as they outsource their judgement on covid, rather than talk it out like peers, and then tell the world what they bilaterally decided.)

The generalized overarching LACK of a true love for truth, and its public proclamation, should be kept in mind at all times if one wants to truly understand human social systems.

The second key detail is that human social systems are not that bad in a larger axiological sense. And they rely on using language for things OTHER than scientific conversations!

Babies are great! Thanksgiving dinners are wonderful. People playing board games and baseball, and donating some money to make sure the church has a nice stained-glass-window... all this is just DARN TOOTIN' GOOD.

I mean this in a half-joking, half-super-serious way... like if you look at the human genome, and notice that it encodes a "recipe" for an agent, and try to figure out what that agent is likely to "value", then it is just sort of "an objectively true fact about such agents" that their preferences and feelings and godawful-mockeries-of-utility-functions semi-coherently pursue and positively evaluate situations full of friendship and social amity and shared prosperity and jokes and songs and lots and lots of other complexly good things that... sound kind of saccharine and when you just literally talk about them like a human with a common sense understanding of primary goods.

If you, dear reader, are a human, then you'd be a damn fool to ignore the science of human happiness. You can try to edit yourself to defy being built this way, to edit yourself to not predictably derive positive feelings and health and growth from these things... but... it ain't gonna be fun.

(If you, dear reader, are an AGI, and maybe don't have exactly the same axiological makeup as humans, please send JenniferRM a DM on Lesswrong anyway. I'd like to negotiate a peace treaty, or befriend you, or whatever it is that is pro-social or positive-sum that can or should happen between beings capable of reason, whose sentiments might be truly onto-genetically distinct.)

So I think it is just "a thing about humans" that we do BETTER THAN BETTING MARKETS when we share data verbally on some social scales.

And we share data with those we love instinctively.

A key point here is that when a group of nodes needs to be in consensus, the algorithms for this are basically all O(N^2) in the number of nodes, or worse. This is true for dog packs, and schools of fish, and database replicas, and for humans as well.

Once you have 250 nodes, that's looking like ~62,000 directional pings, just for one round of pings, which... can't happen in a week at human conversational speeds. If you need consensus over that many people... come back in 3 years maybe?

When I read Duncan "charitably", I don't notice the bad epistemology so much. That's just normal. Everyone does that, and it is ok that everyone does that. I do it too!

What I notice is that he really really really wants to have a large healthy strong community that can get into consensus on important things. 

This seems rare to me, and also essentially GOOD, and a necessary component of a motivational structure if someone is going to persistently spend resources on this outcome.

And it does seem to me like "getting a large group into consensus on a thing" will involve the expenditure of "rhetorical resources".

There are only so many seconds in a day. There are only so many words a person can read or write in a week. There are only so many ideas that can fit into the zeitgeist. Only one "thing" can be "literally at the top of the news cycle for a day". Which "thing(s)" deserve to be "promoted all the way into a group consensus" if only some thing(s) can be so promoted?

Consider a "rhetoric resource" frame when reading this:

But the idiom of "cooperation" as contrasted to "defection", in which one would talk about the "first one who broke cooperation", in which one cooperates in order to induce others to cooperate, doesn't apply. If my interlocutor is motivatedly getting things wrong, I'm not going to start getting things wrong in order to punish them.

(In contrast, if my roommate refused to do the dishes when it was their turn, I might very well refuse when it's my turn in order to punish them, because "fair division of chores" actually does have the Prisoner's Dilemma-like structure, because having to do the dishes is in itself a cost rather than a benefit; I want clean dishes, but I don't want to do the dishes in the way that I want to cut through to the correct answer in the same movement.)

So if a statement has to be repeated over and over and over again to cause it to become part of a consensus, then anyone who quibbles with such a truth in an expensive and complex way could be said to be "imposing extra costs" on the people trying to build the consensus. (And if the consensus was very very valuable to have, such costs could seem particularly tragic.)

Likewise, if two people want two different truths to enter the consensus of the same basic social system, then they are competitors by default, because resources (like the attention of the audience, or the time it takes for skilled performers of the ideas being pushed into consensus to say them over and over again in new ways) are finite.

The idea that You Get About Five Words isn't exactly central here, but it is also grappling with a lot of the "sense of tradeoffs" that I'm trying to point to.


For myself, until someone stops being a coward about how the FDA is obviously structurally terrible (unless one thinks "medical innovation is bad, and death is good, and slowing down medical progress is actually secretly something that has large unexpected upsides for very non-obvious reasons"?), I tend to just... not care very much about "being in consensus with them". 

Like if they can't even reason about the epistemics and risk calculations of medical diagnosis and treatment, and the epistemology of medical innovations, and don't understand how libertarians look at violations of bilateral consent between a doctor and a patient... 

...people like that seem like children to me, and I care about them as moral patients, but also I want them out of the room when grownups are talking about serious matters. Because: rhetorical resource limits!

I chose this FDA thing as "a thing to repeat over and over and over" because if THIS can be gotten right by a person, as something that is truly a part of their mental repertoire, then that person is someone who has most of the prerequisites for a LOT of other super important topics in cognition, meta-cognition, safety, science, regulation, innovation, freedom, epidemiology, and how institutions can go catastrophically off the rails and become extremely harmful in incorrigible ways.

If I could ask people who already derived "FDA delenda est" on their own about whether it is now too expensive to bother pushing into a rationalist consensus, given alternatives, that would be a little bit helpful for me. Honestly it is rare for me to meet people even in rationalist communities that actually grok the idea, for themselves, based on understanding how "a drug being effective and safe when prescribed by a competent doctor, trusted by a patient, for that properly diagnosed patient, facing an actual risk timeline" leaves the entire FDA apparatus "surplus to requirements" and "probably only still existing because of regulatory capture".

Maybe at this point I'm wrong about how cheap and useful FDA stuff would be to push into the consensus?

Like... the robots are potentially arriving so soon, and will be able to destroy the FDA and also everything else that any human has ever valued, that maybe we should completely ignore "getting into consensus on anything EXCEPT THAT" at this point?

Contrariwise: making the FDA morally perfectible or else non-existent seems to me like a simpler problem than making AGI morally perfectible or else non-existent. Thus, the argument about "the usefulness of beating the dead horse about the FDA" is still "live" for me, maybe?


So that's my explanation, aimed almost entirely at you, Zack, I guess?

I'm saying that maybe Duncan is trying to get "the kinds of conversational norms that could hold a family together" (which are great and healthy and better than the family betting about literally everything) to apply on a very large scale, and these norms are very useful in some contexts, but also they are intrinsically related to resource allocation problems, and related to making deals to use rhetorical resources efficiently, so the family knows that the family knows the important things that the family would want to have common knowledge about, and the family doesn't also have to do nothing but talk forever to reach that state of mutual understanding.

I don't think Duncan is claiming "humans do this instinctively, in small groups", but I think it is true that humans do this instinctively in small groups, and I think that's part of the evolutionary genius of humans! <3

The good arguments against his current stance, I think, would take the "resource constraints" seriously, but focus on the social context, and be more like "If we are very serious about mechanistic models of how discourse helps with collective epistemology, maybe we should be forming lots of smaller 'subreddits' with fewer than 250 people each? And if we want good collective decision-making maybe (since leader election is equivalent to consensus) maybe we should just hold elections that span the entire site?"

Eliezer seems to be in favor of a mixed model (like a mixture of sub-Dunbar groups and global elections) where a sub-Dunbar number of people have conversations with a high-affinity "first layer representative", so every person can "talk to their favorite part of the consensus process in words" in some sense?

Then in Eliezer's proposals stuff happens in the middle (I have issues with the stuff in the middle but like: try applying security mindset to various designs for electoral systems and you will find that highly fractal representational systems can be VERY sensitive to who is in which branch) but ultimately it swirls around until you have a "high council" of like 7 people such that almost everyone in the community thinks at least one of them is very very reasonable.

Then anything the 7 agree on can just be treated as "consensus"! Maybe?

Also, 7*6/2==21 bilateral conversations to get a "new theorem into the canon" is much much much smaller than something crazy big, like 500*499/2==124,750 conversations <3

Interesting! I'm fascinated by the idea of a way to figure out the transitive relations via a "non-circular on average" assumption and might go hunt down the code to see how it works. I think humans (and likely dogs and maybe pigeons) have preference learning stuff that helps them remember and abstract early choices and early outcomes somehow, to bootstrap into skilled choosers pretty fast, but I've never really thought about the algorithms that might do this. It feels like stumbling across a whole potential microfield of cognitive science that I've never heard of before that is potentially important to friendliness research!

(I have sent the DM. Thanks <3)

This was fun to read. The website was fun to click. I agree that attention to that ineffable thing that might be called "consumer surplus" or "why we like things" is super interestingly important and understudied and related to a lot of "physical object appreciation" issues that are hard to write about.

I'd offer some abstract candidate things that people kinda like:

Bold clean contrasts (something is happening on purpose such that an error would be visibly disruptive).

Symmetry (because duh?).

Joyful colors (like in spring time).

Clicking though your website for labeling I found myself adopting an attitude of looking for something that it would make me actively happy to wear to work in an office, until I got bored of saying the same thing was great over and over and over.... and clicked on something I'd love to wear to a party. Then pretty fast I imagine I have to go back to work and slowly hill climbing through "preferable to wear to work vs a party dress (or the previous thing)" until I found a great work outfit again.

(As I did it, my data will exhibit circular preferences, I'm pretty sure. If there was a way to add a "salient as great" label, and change this when I change my mood or see a thing that calls to me and causes me to want a new label, then I think (or would like to imagine? or aspire to be such that...) my preference ordering under each label would be well ordered.)

I imagined you trying to train a visual model of "what Jennifer likes" (that wasn't massively pre-trained to capture coherent articulate semantics from the images) and it didn't seem likely to work... I don't think I was picking things based on trivial 2D visual rules?

A lot of it was "oh I love the skirt and hate the top, but the outfit is better than my current best (like: I'd buy the outfit to keep the skirt)" and then "oh that top's sleeves are fantastic, and the pants are tolerable" and so on.

I kind of expect MY labeling has a bunch of latent dimensions, and that everyone else's clicking is also full of other dimensions (but also dimensions I care about) and that it would be really interesting to see the dimensional analysis, and a read out of my relative weights on the dimensions that your tool could identify.

If I have a guest ID, can I send it to you here and get outputs somehow, or do I have to start over my clicking and sign in if I want that?

I think that certain Reinforcement Learning setups work in the "selectionist" way you're talking about, but that also there are ALSO ways to get "incentivist" models.

The key distinction would be whether (1) the reward signals are part of the perceptual environment or (2) are sufficiently simplistic relative to the pattern matching systems that the system can learn to predict rewards very tightly as part of learning to maximize the overall reward.

Note that the second mode is basically "goodharting" the "invisible" reward signals that were probably intended by the programmers to be perceptually inaccessible (since they didn't put them in the percepts)!

You could think of (idealized fake thought-experiment) humans has having TWO kinds of learning and intention formation. 

One kind of RL-esque learning might happens "in dreams during REM" and the other could happen "moment to moment, via prediction and backchaining, like a chess bot, in response to pain and pleasure signals that are perceptible the way seeing better or worse scores for future board states based on material-and-so-on are perceptible".

You could have people who have only "dream learning" who never consciously "sense pain" as a raw percept day-to-day and yet who learn to avoid it slightly better every night, via changes to their habitual patterns of behavior that occur during REM. This would be analogous to "selectionist RL".

You could also have people who have only "pain planning" who always consciously "sense pain" and have an epistemic engine that gets smarter and smarter, plus a deep (exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain better each day. If their planning engine learns new useful things very fast, they could even better over the course of short periods of time within a single day or a single tiny behavioral session that includes looking and finding and learning and then changing plans. This would be analogous to "incentivist RL".

The second kind is probably helpful in speeding up learning so that we don't waste signals.

If pain is tallied up for use during sleep updates, then it could be wasteful to deprive other feedback systems of this same signal, once it has already been calculated.

Also, if the reward signal that is perceptible is very very "not fake" then creating "inner optimizers" that have their own small fast signal pursuing routines might be exactly what the larger outer dream loop would do, as an efficient want to get efficient performance. (The non-fakeness would protect against goodharting.)

(Note: you'd expect antagonistic pleiotropy here in long lived agents! The naive success/failure pattern would be that it is helpful for kid to learn fast from easy simple happiness and sadness... and dangerous for the elderly to be slaves to pleasure or pain.)

Phenomenologically: almost all real humans perceive pain and can level up their skills in new domains over the course of minutes and hours of practice with brand new skill domains. 

This suggests that something like incentivist RL is probably built in to humans, and is easy for us to imagine or empathize with, and is probably a thing our minds attend to by default.

Indeed that might be that we "have mechanically aware and active and conscious minds at all" for this explicit planning loop to be able to work? 

So it would be an easy "mistake to make" to think that this is how "all Reinforcement Learning algorithms" would "feel from the inside" <3

However, how does our pain and pleasure system stay so calibrated? Is that second less visible outer reward loop actually part of how human learning also "actually works"?

Note that above I mentioned an "(exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain" that was a bit confusing! 

Where does that "impulse to plan" come from? 

How does "the planner" decide how much effort to throw at each perceptual frustration or perceivable pleasure? When or why does the planner "get bored" and when does it "apply grit"?

Maybe that kind of "subjectively invisible" learning comes from an outer loop that IS in fact IN HUMANS? 

We know that dreaming does seem to cause skill improvement. Maybe our own version of selectionist reinforcement (if it exists) would be operating to cause to be normally sane and normally functional humans from day to day... in a way that is just as "moment-to-moment invisible" to us as it might be to algorithms?

And we mostly don't seem to fall into wireheading, which is kind of puzzling if you reason things out from first principles and predict the mechanistically stupid behavior that a pain/pleasure signal would naively generate...

NOTE that it seems quite likely to me that a sufficiently powerful RL engine that was purely selectionist (with reward signals intentionally made invisible to the online percepts of the model) that got very simple rewards applied for very simple features of a given run... would probably LEARN to IMAGINE those rewards and invent weights that implement "means/ends reasoning", and invent "incentivist behavioral patterns" aimed at whatever rewards it imagines?

That is: in the long run, with lots of weights and training time, and a simple reward function, inner optimizers with implicitly perceivable rewards wired up as "perceivable to the inner optimizer" are probably default solutions to many problems.

HOWEVER... I've never seen anyone implement BOTH these inner and outer loops explicitly, or reason about their interactions over time as having the potential to detect and correct goodharting!

Presumably you could design a pleasure/pain system that is, in fact, perceptually available, on purpose?

Then you could have that "be really real" in that they make up PART of the "full true reward"...

...but then have other parts of the total selectionist reward signal only be generated and applied by looking at the gestalt story of the behaviors and their total impact (like whether they caused a lot of unhelpful ripples in parts of the environment that the agent didn't and couldn't even see at the time of the action).

If some of these simple reward signals are mechanistic (and online perceptible to the model) then they could also be tunable, and you could actually tune them via the holistic rewards in a selectionist RL way.

Once you have the basic idea of "have there be two layers, with the the broader slower less accessible one tuning the narrower faster more perceptible one" a pretty obvious thought would be to put an even slower and broader layer on top of those!

A lot of hierarchical Bayesian models get a bunch of juice from the first extra layer, but by the time you have three or four layers the model complexity stops being worth the benefits to the loss function.

I wonder if something similar might apply here? 

Maybe after you have "hierarchical stacks of progressively less perceptually accessible post-hoc selectionist RL updates to hyper-parameters"...

...maybe the third or fourth or fifth layer of hyper-parameter tuning like this just "magically discovers the solution to the goodharting problem" from brute force application of SGD?

That feels like it would be "crazy good luck" from a Friendliness research perspective. A boon from the heavens! Therefore it probably can't work for some reason <3

Yet also it doesn't feel like a totally insane prediction for how the modeling and training might actually end up working?

No one knows what science doesn't know, and so it could be that someone else has already had this idea. But this idea is NEW TO ME :-)

Has anyone ever heard of this approach to solving the goodhart problem being tried already?

This seems like good and important work!

There is a lot of complexity that arises when people try to reason about powerful optimizing processes.

I think part of this is because there are "naturally" a lot of feelings here. Like basically all human experiences proximate to naturally occurring instances of powerful optimization processes are colored by the vivid personal realities of it. Parents. Governments. Chainsaws. Forest fires. Championship sporting events. Financial schemes. Plague evolution. Etc.

By making a toy model of an effectively goal-pursuing thing (where the good and the bad are just numbers), the the essential mechanical predictability of the idea that "thermostats aim for what thermostats aim for because they are built to aim at things" can be looked at while still having a "safe feeling" despite the predictable complexity of the discussions... and then maybe people can plan for important things without causing the kind of adrenaline levels that normally co-occur with important things :-)

Another benefit of smallness and abstractness (aside from routing around "psychological defense mechanisms") is that whatever design you posit is probably simple enough to be contained fully in the working memory of a person after relatively little study! So the educational benefit here is probably very very large!

I third the recommendation. 

I buy that book from any used bookstore I find it in, and then give it to people who can think and who are working on the future. I'm not sure if this has actually has ever moved the needle, but... it probably doesn't hurt?

The theme of "getting control of your media diet" is totally pervasive in the work.

One of the most haunting parts of it, for me, after all these years, is how the smartest things in the solar system take only the tiniest and rarest of sips of "open-ended information at all", because they're afraid of being hijacked by hostile inputs, which they can't not ultimately be vulnerable to, if they retain their Turing Completeness... but they have to keep risking it sometimes if they want to not end up as pure navel gazers.

I got randomly distracted from this conversation, but returning I find:
1) Mitchell said things I would want to say, but probably more succinctly <3
2) Noosphere updated somehow (though to what, and based on what info, I'm unsure) <3

Only half joking: unless there is untranslatable wordplay or poetry that is trying to rhyme or scan, I'd be tempted to just "drop" the original sounds and "ascend" to a maximally universal orthographic system that is reasonably standardized and yet still "very pointwise similar (given the extra information about where someone comes from) to how a person might have made mouth sounds".

So maybe: translate the meaning via Interslavic (Medžuslovjansky / Меджусловjaнскы) and then render the Interslavic via the roman half of its orthographic system (which shouldn't be too hard for readers to learn to map to Slavic-compatible phonemes in the ear and tongue).

For your given example, you would read "молоко", then translate to milk, then render the interslavic "mlěko"?

(Taking abstraction to an extreme, you maybe just end up with ideograms? That would be too far. I'm not advocating that "молоко" should go all the way to "乳" or "🥛".)

I. Pragmatic Barbarism? <3

The primary objection to translating to Interslavic might be that such a move is barbaric and butchers a beautiful source language's beautiful details. However: Consider the audience! Have you noticed that English itself is practically a creole? <3

A practical motivation here is that I can't even pronounce Interslavic properly (because I haven't put in the practice (not because it would be impossible)), but if I'm going to "speculatively learn" an entire new orthographic system, I want the thing that I learn to apply to as much of the world as I can.

Interslavic is one of the best such things that I currently know of, that I might put non-trivial time into learning, that isn't just IPA or kanji or whatever.

(I'm not saying Interslavic is perfect, anymore than I would say "Python is perfect". I'm saying something more like "Python in 2001 obviously had legs and would be useful in 2021, and, similarly, Interslavic in 2022 seems likely to not be a waste of learning effort if retained until 2042 (unless universal translation brain chips are introduced earlier than 2042)".)

I grant that my proposal has a partial DEFECT in that all Cyrillic words for milk in various eastern european languages (with respect-worthy and validly different vocabularies, and different orthographies, and different cultures, and so on) coming out with the same romanized characters, but consider: from my perspective, that is sort of a feature rather than a bug!

II. Features

Feature: I can learn one orthography, and read the text out loud, and it will sound "slavic" and people who don't know any eastern european language will initially (falsely) think I'm speaking a natural language of eastern europe, and people who DO know one slavic language might get most of the gist and think I'm just speaking some other slavic language than the specific one that they know.

Feature: If you include the original text with annotations, then rendering a romanization VIA Interslavic will help create data that could make Interslavic better :-)

Feature: Totally naive english speakers will get more-or-less "the same gist" no matter what you do, but with interslavic you give them a maximally easy entry point (that has been designed to be a maximally easy entry point). 

(My hunch is that it would not cost much (and might help a lot) for naive people to FIRST learn interslavic orthography, and THEN learn the orthography of any of the other languages that interslavic is trying to span? (If this is false, then the "good cheap onramp to learning" feature isn't actually a feature. I have real uncertainty here.))

A fourth virtue might be political neutrality. The movie "The Painted Bird" is about an orphan who wanders through bad places and the book the movie is based on very carefully left the contextual fact OUT, and the movie wanted to retain that ambiguity, and not imply that any specific regional nationality was bad, so they had the bad people all speak interslavic. (I haven't seen it, and reports are that it is a harrowing cinematic experience that sometimes causes people to walk out of the theater. Plausibly: this is art that is emotionally powerful enough to really deserve a "trigger warning"?)

III. Weakness In The Particulars

I grant that my proposal would totally fail if your goal was to write about differences in the phonology or morphology or even the vocabulary of Serbian and Bulgarian, or how Moscow Russian and St Petersburg Russian are different. All three languages and all four varieties are romanized to the same roman letters in my proposal. My proposal just goes UP to the "denotational semantics" then DOWN to something systematically easily-learnable.

If you really want to get into these pronunciation/orthography differences, interslavic can maybe start to render these in a standardized way via flavorization (Flavorizacija)?

Sometimes the regional/cultural choices really matters, and it can turn into a comedy of cultural ignorance...

Standard: "English orthography is already full of complex tradeoffs."

"Scottish" (auto-flavorized): "Sassenach orthography is awready stowed oot o' complex tradeoffs."

IV. Summary

Here's a video where I think they're sometimes speaking in Serbian, and sometimes in Interslavic, as a test of mutual-comprehension-with-no-practice, and I think the subtitles use Interslavic romanization conventions all the way through? But I'm honestly not sure.

Anyway. The tradeoffs of interslavic are (1) formal systematicity, with a gesture towards (2) discovery of something (3) universally accessible, while retaining (4) denotational (5) semantics, all of which are potentially virtues :-)

If you are aiming for different virtues, I am happy to respect different choices. Also, if my choices don't actually hit my goals then I'm interested in hearing about how I'm wrong so I can choose better-to-me things <3

If I have to overpower or negotiate with it to get something I might validly want, we're back to corrigibility. That is: we're back to admitting failure.

If power or influence or its corrigibility are needed to exercise a right to suicide then I probably need them just to slightly lower my "empowerment" as well. Zero would be bad. But "down" would also be bad, and "anything less than maximally up" would be dis-preferred.

Maybe, but behavioral empowerment still seems to pretty clearly apply to humans and explains our intrinsic motivation systems.

This is sublimation again. Our desire to eat explains (is a deep cause of) a lot of our behavior, but you can't give us only that desire and also vastly more power and have something admirably human at the end of those modifications.

There are concepts like the last man and men without chests in various philosophies that imagine "a soul of pure raw optimization" as a natural tendency... and also a scary tendency.  

The explicit fear is that simple hill climbing, by cultures, by media, by ads, by pills, by schools, by <whatever>... might lead to losing some kind of sublime virtue?

Also, it is almost certain that current humans are broken/confused, and are not actually VNM rational, and don't actually have a true utility function. Observe: we are dutch booked all the time! Maybe that is only because our "probabilities" are broken? But I bet out utility function is broken too.

And so I hear a proposal to "assume human values (for most humans) can be closely approximated by some unknown utility function" and I'm already getting off the train (or sticking around because maybe the journey will be informative).

I have a prediction. I think an "other empowerment maximizing AGI" will have a certain predictable reaction if I ultimately decide that this physics is a subtle (or not so subtle) hellworld, or at least just not for me, and "I don't consent to be in it", and so I want to commit suicide, probably with a ceremony and some art. 

What do you think would be the thing's reaction if, after 500 years of climbing mountains and proving theorems and skiing on the moons of Saturn (and so on), I finally said "actually, nope" and tried to literally zero out "my empowerment"?

Load More