Shortform Content

It is said that on this Earth there are two factions, and you must pick one.

  1. The Knights Who Arrive at False Conclusions
  2. The Knights Who Arrive at True Conclusions, Too Late to Be Useful

(Hat tip: I got these names 2 years ago from Robert Miles who had been playing with GPT-3.)

In case you're interested, I choose the latter, for there is at least the hope of learning from the mistakes.

2. The anchor of a major news network donates lots of money to organizations fighting against gay marriage, and in his spare time he writes editorials arguing that homosexuals are weakening the moral fabric of the country. The news network decides they disagree with this kind of behavior and fire the anchor.

a) This is acceptable; the news network is acting within their rights and according to their principles
b) This is outrageous; people should be judged on the quality of their work and not their political beliefs

12. The principal of a private school is a

... (read more)

Use "Symbiosis" as an objective versus "Alignment problem". Examples of symbiosis exist everywhere in nature. We don't need to recreate the wheel folks...

The "key word" here is "SYMBIOSIS". We have been myopically focused on "alignment" when what we really want is to cultivate (both from a human perspective and AI perspective) a symbiotic relationship between humans and AI. Consequentially, a symbiotic relationship between humans and AI (and AGI understanding of incentives and preference for symbiosis over parasitism) can help to establish a more symbiotic... (read more)

I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:

  • I don’t think it will work.
  • Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
  • I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.

In the end, I expect this will just alienate people. And stuff like this concerns me.

I think it’s possible that the most memetically power... (read more)

Showing 3 of 5 replies (Click to show all)
1[comment deleted]16h
What is the base rate for Twitter reactions for an international law proposal?

Of course it’s often all over the place. I only shared the links because I wanted to make sure people weren’t deluding themselves with only positive comments.

I think that the keystone human value is about making significant human choices. Individually and collectively, including chooseing the humanity's course.

  • You can't make a choice if you are dead
  • You can't make a choice if you are disempowered
  • You can't make a human choice if you are not a human
  • You can't make a choice if the world is too alien for your human brain
  • You can't make a choice if you are in too much of a pain or too much of a bliss
  • You can't make a choice if you let AI make all the choices for you

Anapartistic reasoning: GPT-3.5 gives a bad etymology, but GPT-4 is able to come up with a plausible hypothesis of why Eliezer chose that name: Anapartistic reasoning is reasoning where you revisit the rearlier part of your reasoning.

Unfortunately, Eliezer's suggested prompt doesn't seem to work to induce anapartistic reasoning: GPT-4 thinks it should focus on identifying potential design errors or shortcomings in itself. When asked to describe the changes in it's reasoning, it doesn't claim to be more corrigible.

We will discuss Eliezer's Hard Problem of C... (read more)

Propositions on SIA

Epistemic status: exploring implications, some of which feel wrong.

  1. If SIA is correct, you should update toward the universe being much larger than it naively (i.e. before anthropic considerations) seems, since there are more (expected) copies of you in larger universes.
    1. In fact, we seem to have to update to probability 1 on infinite universes; that's surprising.
  2. If SIA is correct, you should update toward there being more alien civilizations than it naively seems, since in possible-universes where more aliens appear, more (expected) copies
... (read more)
3Tristan Cook21h
Which of them feel wrong to you? I agree with all them other than 3b, which I'm unsure about - I think it this comment [] does a good job at unpacking things.  2a is Katja Grace's Doomsday argument. I think 2aii and 2aiii depends on whether we're allowing simulations; if faster expansion speed (either the cosmic speed limit or engineering limit on expansion) meant more ancestor simulations then this could cancel out the fact that faster expanding civilizations prevent more alien civilizations coming in to existence.

I deeply sympathize with the presumptuous philosopher but 1a feels weird.

2a was meant to be conditional on non-simulation.

Actually putting numbers on 2a (I have a post on this coming soon), the anthropic update seems to say (conditional on non-simulation) there's almost certainly lots of aliens all of which are quiet, which feels really surprising.

To clarify what I meant on 3b: maybe "you live in a simulation" can explain why the universe looks old better than "uh, I guess all of the aliens were quiet" can.

Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.

So, we can assume that the world states in the certain neighborhood from our past sates have some value.

Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally "function", ability to "human".

For the closing party of the Lightcone Offices, I used Midjourney 5 to make a piece of art to represent a LessWrong essay by each member of the Lightcone team, and printed them out on canvases. I'm quite pleased about how it came out. Here they are.

How I buy things when Lightcone wants them fast

by jacobjacob

(context: Jacob has been taking flying lessons, and someday hopes to do cross-country material runs for the Rose Garden Inn at shockingly fast speeds by flying himself to pick them up)

My thoughts on direct work (and joining LessWrong)

by RobertM

A Quick G

... (read more)

FHI just released Pause Giant AI Experiments: An Open Letter

I don't expect that 6 months would nearly be enough time to understand our current systems well enough to make them aligned. However, I do support this, and did sign the pledge, as getting everybody to stop training AI systems more powerful than GPT-4 for 6 months, would be a huge step forward in terms of coordination. I don't expect this to happen. I don't expect that OpenAI will give up its lead here.

See also the relevant manifold market.

Maybe you already thought of this, but it might be a nice project for someone to take the unfinished drafts you've published, talk to you, and then clean them up for you.  Apprentice/student kind of thing. (I'm not personally interested in this, though.)

I like that idea! I definitely welcome people to do that as practice in distillation/research, and to make their own polished posts of the content. (Although I'm not sure how interested I would be in having said person be mostly helping me get the posts "over the finish line".)

I'm really confused by this passage from The Six Mistakes Executives Make in Risk Management (Taleb, Goldstein, Spitznagel):

We asked participants in an experiment: “You are on vacation in a foreign country and are considering flying a local airline to see a special island. Safety statistics show that, on average, there has been one crash every 1,000 years on this airline. It is unlikely you’ll visit this part of the world again. Would you take the flight?” All the respondents said they would.

We then changed the second sentence so it read: “Safety statistic

... (read more)
  1. Human values are complex and fragile. We don't know yet how to make AI pursue such goals.
  2. Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it's near certain that AGI will be able to understand complex and fragile values (for it's instrumental goals).
  3. If we will make an AI which  is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.

Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.

2Neil Warren2d
To elaborate your idea here a little: It may be that the only way to be truly aware of the world is to have complex and fragile values. Humans are motivated by a thousand things at once and that may give us the impression that we are not agents moving from a clearly defined point A to point B, as AI in its current form is, but are rather just... alive. I'm not sure how to describe that. Consciousness is not an end state but a mode of being. This seems to me like a key part of the solution to AGI: aim for a mode of being not an endstate.  For a machine whose only capability is to move from point A to point B, adding a thousand different, complex and fragile, goals may be the way to go. As such solving AGI may also solve most of the alignment problem, so long as the AIs specific cocktail of values is not too different from the average human's.  In my opinion there is more to fear from highly capable narrow AI than there is from AGI, for this reason. But then I know nothing. 

My theory is that the core of the human values is about what human brain was made for - making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.

Is it possible to learn a language without learning the values of those who speak it?

Showing 3 of 6 replies (Click to show all)
Yes, if you only learn the basics of the language, you will learn only the basics of the language user's values (if any). But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words "human" and "values", btw).  To understand texts you have to understand in which context their are used, etc. Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/valuable to them.  And a lot of texts are related to values much more directly. For example, each text about human relations is directly related to conflicts or alignment of particular people values. So, if you learn the language from reading text (like LLMs do) you will pick a lot about people values on the way (like LLMs did).
Small note but I would think Germans engage in less schadenfreude than other cultures. For a long time my favourite word used to be 'cruelty' specifically for its effectiveness in combating some forms of its referent.

Well by that logic Germans may experience more shadenfreude, which would presumably mean there is more shadenfreude going on in Germany than elsewhere, so I don't think your point makes sense. You only need a word for something if it exists, especially if it's something you encounter a lot. 

It may also be possible that we use facsimiles for words by explaining their meaning with whole sentences, and only occasionally stumble upon a word that catches on and that elegantly encapsulates the concept we want to convey (like "gaslighting").  It may be ... (read more)

I heavily recommend Beren's "Deconfusing Direct vs Amortised Optimisation". It's a very important conceptual clarification.

Probably the most important blog post I've read this year.



Direct optimisers: systems that during inference directly choose actions to optimise some objective function. E.g. AIXI, MCTS, other planning

Direct optimisers perform inference by answering the question: "what output (e.g. action/strategy) maximises or minimises this objective function ([discounted] cumulative return and loss respectively).

Amortised optimisers: syst... (read more)

I'm worried about notkilleveryonism as a meme. Years ago, Tyler Cowen wrote a post about why more econ professors didn't blog, and his conclusion was that it's too easy to make yourself look like an idiot relative to the payoffs. And that he had observed this actually play out in a bunch of cases where econ professors started blogs, put their foot in their mouth, and quietly stopped. Since earnest discussion of notkilleveryonism tends to make everyone, including the high status, look dumb within ten minutes of starting, it seems like there will be a strong inclination towards attribute substitution. People will tend towards 'nuanced' takes that give them more opportunity to signal with less chance of looking stupid.

Showing 3 of 10 replies (Click to show all)
Worry about looking like an idiot is a VERY fine balance to find.  If you get desensitized to it, that makes it too easy to BE an idiot.  If you are over-concrerned about it, you fail to find correct contrarian takes. 'notkilleveryoneism' IMO is a dumb meme.  Intentionally, I presume.  If you wanted to appear smart, you'd use more words and accept some of the nuance, right?  It feels like a countersignal-attempt, or a really bad model of someone who's not accepting the normal arguments.

I dunno, the problem with "alignment" is that it doesn't unambiguously refer to the urgent problem, but "notkilleveryoneism" does. Alignment used to mean same-values, but then got both relaxed into compatible-values (that boundary-respecting norms allow to notkilleveryone) and strengthened with various AI safety features like corrigibility and soft optimization. Then there is prosaic alignment, which redefines it into bad-word-censure and reliable compliance with requests, neither being about values. Also, "existential catastrophe" inconveniently includes ... (read more)

While looking at the older or more orthodox discussion of notkilleveryoneism, keep this distinction [] in mind. First AGIs might be safe for a little while, the way humans are "safe", especially if they are not superintelligences. But then they are liable to build other AGIs that aren't as safe. The problem is that supercapable AIs with killeveryone as an instrumental value [] seem eminently feasible [], and general chaos [] of human condition plus market pressures [] make them likely to get built. Only regulation of the kind that's not humanly feasible (and killseveryone if done incorrectly) has a chance of preventing that in the long term, and getting to that point without stepping on [] an AI that killseveryone is not obviously the default outcome.

What is magic?

Presumably we call whatever we can't explain "magic" before we understand it, at which point it becomes simply a part of the natural world. This is what many fantasy novels fail to account for; if we actually had magic, we wouldn't call it magic. There are thousands of things in the modern world that would definitely enter the criteria for magic of a person living in the 13th Century. 

So we do have magic; but why doesn't it feel like magic? I think the answer to this question is to be found in how evenly distributed our magic is. Almost ... (read more)

Showing 3 of 8 replies (Click to show all)
3Neil Warren2d
Eliezer Yudkowsky is kind of a god around here, isn't he?  Would you happen to know what percentage of total upvotes on this website are attributed to his posts? It's impressive how many sheer good ideas written in clear form that he's had to come up with to reach that level. Cool and everything, but isn't it ultimately proof that LessWrong is still in its fledgling stage (which it may never leave), as it depends so much on the ideas of its founder? I'm not sure how one goes about this, but expanding the LessWrong repertoire in a consequential way seems like a good next step for LessWrong. Perhaps that includes changing the posts in the Library... I don't know.  Anyhow thanks for this comment, it was great reading!
The Creator God, in fact. LessWrong was founded by him. All of the Sequences [] are worth reading.

Right, but if LessWrong is to become larger, it might be a good idea to stop leaving his posts as the default (the Library, the ones being recommended in the front page, etc.) I don't doubt that his writing is worth reading and I'll get to it, I'm just offering an outsider's view on this whole situation, which seems a little stagnant to me in a way. 

That last reply of mine, a reply to a reply to a Shortform post I made, can be found after just a little scrolling on the main page of LessWrong. I should be a nobody to the algorithm, yet I'm not. My only... (read more)

I find the standard models of existence (such as Tegmark's Mathematical universe hypothesis) to feel boring, flat, and not self-referential enough.

Logic refers to itself. This is really important.

So here's my take on the nature of existence (not fully serious! consider this a 'playful exploration'):

In the beginning, there was a contradiction.

Why was there a contradiction? The simplest way I can think of explaining it is that "non-existence exists" is a paradox (first noted by the Greek philosopher Parmenides).

And yes, these are just "cute words", but I fin... (read more)

Deceptive alignment doesn't preserve goals.

A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misali... (read more)

Showing 3 of 4 replies (Click to show all)

This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be:

a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are... (read more)

1Johannes Treutlein2d
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal. Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
So I'm imagining the agent doing reasoning like: Misaligned goal --> I should get high reward --> Behavior aligned with reward function and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.) I could also imagine something more like: Misaligned goal --> I should behave in aligned ways --> Aligned behavior and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option. Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: "I should get high reward" and "I should behave in aligned ways", and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I'll have a post on that topic up soon).  
Load More