All of NicholasKross's Comments + Replies

How many times has someone expressed "I'm worried about 'goal-directed optimizers', but I'm not sure what exactly they are, so I'm going to work on deconfusion."? There's something weird about this sentiment, don't you think?

I disagree, and I will take you up on this!

"Optimization" is a real, meaningful thing to fear, because:

  • We don't understand human values, or even necessarily meta-understand them.
  • Therefore, we should be highly open to the idea that a goal (or meta-goal) that we encode (or meta-encode) would be bad for anything powerful to base-level car
... (read more)

If it helps clarify: I (and some others) break down the alignment problem into "being able to steer it at all" and "what to steer it at". This post is about the danger of having the former solved, without the latter being solved well (e.g. through some kind of CEV).

Nah, I think this post is about a third component of the problem: ensuring that the solution to "what to steer at" that's actually deployed is pro-humanity. A totalitarian government successfully figuring out how to load its regime's values into the AGI has by no means failed at figuring out "what to steer at". They know what they want and how to get it. It's just that we don't like the end result.

"Being able to steer at all" is a technical problem of designing AIs, "what to steer at" is a technical problem of precisely translating intuitive human goals into a formal language, and "where is the AI actually steered" is a realpolitiks problem that this post is about.

Love this event series! Can't come this week, but next one I can!

No worries! I make similar mistakes all the time (just check my comment history ;-;)

And I do think your comment is useful, in the same way that Rohin's original comment (which my post is responding to) is useful :)

FWIW, I have an underlying intuition here that's something like “if you're going to go Dark Arts, then go big or go home”, but I don't really know how to operationalize that in detail and am generally confused and sad. In general, I think people who have things like “logical connectives are relevant to the content of the text” threaded through enough of their mindset tend to fall into a trap analogous to the “Average Familiarity” xkcd or to Hofstadter's Law when they try truly-mass communication unless they're willing to wrench things around in what are of

... (read more)

Now, I do separately observe a subset of more normie-feeling/working-class people who don't loudly profess the above lines and are willing to e.g. openly use some generative-model art here and there in a way that suggests they don't have the same loud emotions about the current AI-technology explosion. I'm not as sure what main challenges we would run into with that crowd, and maybe that's whom you mean to target.

That's... basically what my proposal is? Yeah? People that aren't already terminally-online about AI, but may still use chatGPT and/or StableDiff... (read more)

8Rana Dexsin1mo
Facepalm at self. You're right, of course. I think I confused myself about the overall context after reading the end-note link there and went off at an angle. Now to leave the comment up for history and in case it contains some useful parts still, while simultaneously thanking the site designers for letting me un-upvote myself. 😛

Yeah, mostly agreed. My main subquestion (that led me to write the review, besides this post being referenced in Leake's work) was/sort-of-still-is "Where do the ratios in value-handshakes come from?". The default (at least in the tag description quote from SSC) is uncertainty in war-winning, but that seems neither fully-principled nor nice-things-giving (small power differences can still lead to huge win-% chances, and superintelligences would presumably be interested in increasing accuracy). And I thought maybe ROSE bargaining could be related to that.

The relation in my mind was less ROSE --> DT, and more ROSE --?--> value-handshakes --> value-changes --?--> DT.

(On my beliefs, which I acknowledge not everyone shares, expecting something better than "mass delusion of incorrect beliefs that implies that AGI is risky" if you do wide-scale outreach now is assuming your way out of reality.)

I'm from the future, January 2024, and you get some Bayes Points for this!

The "educated savvy left-leaning online person" consensus (as far as I can gather) is something like: "AI art is bad, the real danger is capitalism, and the extinction danger is some kind of fake regulatory-capture hype techbro thing which (if we even bother t... (read more)

So there's a sorta-crux about how much DT alignment researchers would have to encode into the-AI-we-want-to-be-aligned, before that AI is turned on. Right now I'm leaning towards "an AI that implements CEV well, would either turn-out-to-have or quickly-develop good DT on its own", but I can see it going either way. (This was especially true yesterday when I wrote this review.)

And I was trying to think through some of the "DT relevance to alignment" question, and I looked at relevant posts by [Tamsin Leake]( (who... (read more)

2Anthony DiGiovanni1mo
Sorry, to be clear, I'm familiar with the topics you mention. My confusion is that ROSE bargaining per se seems to me pretty orthogonal to decision theory. I think the ROSE post(s) are an answer to questions like, "If you want to establish a norm for an impartial bargaining solution such that agents following that norm don't have perverse incentives, what should that norm be?", or "If you're going to bargain with someone but you didn't have an opportunity for prior discussion of a norm, what might be a particularly salient allocation [because it has some nice properties], meaning that you're likely to zero-shot coordinate on that allocation?"

Selection Bias Rules (Debatably Literally) Everything Around Us

Currently, I think this is a big crux in how to "do alignment research at all". Debatably "the biggest" or even "the only real" crux.

(As you can tell, I'm still uncertain about it.)

Decision theory is hard. In trying to figure out why DT is useful (needed?) for AI alignment in the first place, I keep running into weirdness, including with bargaining.

Without getting too in-the-weeds: I'm pretty damn glad that some people out there are working on DT and bargaining.

1Anthony DiGiovanni1mo
Can you say more about what you think this post has to do with decision theory? I don't see the connection. (I can imagine possible connections, but don't think they're relevant.)

Still seems too early to tell if this is right, but man is it a crux (explicit or implicit).


Terence Tao seems to have gotten some use out of the most recent LLMs.

if you take into account the 4-5 staff months these cost to make each year, we net lost money on these

For the record, if each book-set had cost $40 or even $50, I still would have bought them, right on release, every time. (This was before my financial situation improved, and before the present USD inflation.)

I can't speak for everyone's financial situation, though. But I (personally) mentally categorize these as "community-endorsement luxury-type goods", since all the posts are already online anyway.

The rationality community is unusually good about not selling ingroup-merch when it doesn't need or want to. These book sets are the perfect exceptions.

to quote a fiend, "your mind is a labyrinth of anti-cognitive-bias safeguards, huh?"

[emphasis added]

The implied context/story this is from sure sounds interesting. Mind telling it?

I don't think of governments as being... among other things "unified" enough to be superintelligences.

Also, see "Things That Are Not Superintelligence" and "Maybe The Real Superintelligent AI Is Extremely Smart Computers".

The alignment research that is done will be lower quality due to less access to compute, capability knowhow, and cutting edge AI systems.

I think this is false, though it's a crux in any case.

Capabilities withdrawal is good because we don't need big models to do the best alignment work, because that is theoretical work! Theoretical breakthroughs can make empirical research more efficient. It's OK to stop doing capabilities-promoting empirical alignment, and focus on theory for a while.

(The overall idea of "if all alignment-knowledgeable capabilities people ... (read more)

Good catch! Most of it is hunches to be tested (and/or theorized on, but really tested) currently. Fixed

"Exfohazard" is a quicker way to say "information that should not be leaked". AI capabilities has progressed on seemingly-trivial breakthroughs, and now we have shorter timelines.

The more people who know and understand the "exfohazard" concept, the safer we are from AI risk.

More framings help the clarity of the discussion. If someone doesn't understand (or agree with) classic AI-takeover scenarios, this is one of the posts I'd use to explain them.

Funny thing, I had a similar idea to this (after reading some Sequences and a bit about pedagogy). That was the sort-of-multi-modal-based intuition behind Mathopedia.

Is any EA group *funding* adult human intelligence augmentation? It seems broadly useful for lots of cause areas, especially research-bottlenecked ones like AI alignment.

Why hasn't e.g. OpenPhil funded this project?:

I skimmed the page and haven't found if GeneSmith applied to OpenPhil?..

Seems to usually be good faith. People can still be biased of course (and they can't all be right on the same questions, with the current disagreements), but it really is down to differing intuitions, which background-knowledge posts have been read by which people, etc.

To add onto other people's answers:

People have disagreements over what the key ideas about AI/alignment even are.

People with different basic-intuitions notoriously remain unconvinced by each other's arguments, analogies, and even (the significance of) experiments. This has not been solved yet.

Alignment researchers usually spend most time on their preferred vein of research, rather than trying to convince others.

To (try to) fix this, the community's added concepts like "inferential distance" and "cruxes" to our vocabulary. These should be be discussed and u... (read more)

Do you think these disagreements stem from a sort of egoistic desire to be known as the 'owner' of that concept? Or to be a forerunner for that vein of research should it become popular?  Or is it a genuinely good faith disagreement on the future of AI and what the best approach is? (Perhaps these questions are outlined in the articles you've linked, which I'll begin reading now. Though I do think it's still useful to perhaps include a summary here too.) Thanks for your help.

I relate to this quite a bit ;-;

People's minds are actually extremely large things that you fundamentally can't fully model

Is this "fundamentally" as in "because you, the reader, are also a bounded human, like them"? Or "fundamentally" as in (something more fundamental than that)?

4Tamsin Leake2mo
The first one. Alice fundamentally can't fully model Bob because Bob's brain is as large as Alice's, so she can't fit it all inside her own brain without simply becoming Bob.

If timelines weren't so short, brain-computer-based telepathy would unironically be a big help for alignment.

(If a group had the money/talent to "hedge" on longer timelines by allocating some resources to that... well, instead of a hivemind, they first need to run through the relatively-lower-hanging fruit. Actually, maybe they should work on delaying capabilities research, or funding more hardcore alignment themselves, or...)

This point could definitely be its own post. I'd love to see you write this! (I'd of course be willing to proofread/edit it, title it, etc.)

And the AGI, if it's worth the name, would not fail to exploit this.

This sentence is a good short summary of some AI alignment ideas. Good writing!

Someone may think "Anomalous worlds imply the simulation-runners will save us from failing at alignment!"

My reply is: Why are they running a simulation where we have to solve alignment?

At a first pass, if we're in a simulation, it's probably for research, rather than e.g. a video game or utopia. (H/t an IRL friend for pointing this out).

Therefore, if we observe ourselves needing to solve AI alignment (and not having solved it yet), the simulation-runners potentially also need AI alignment to get solved. And if history is any guide, we should not rely on an... (read more)

But you wouldn't study ... MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work.

This particular bit seems wrong; CNNs and LLMs are both built on neural networks. If the findings don't generalize, that could be called a "failure of theory", not an impossibility thereof. (Then again, maybe humans don't have good setups for going 20 steps ahead of data when building theory, so...)

(To clarify, this post is good and needed, so thank you for writing it.)

2Thane Ruthenis2mo
Yep, there's nonzero mutual information. But not of the sort that's centrally relevant. I'll link to this reply in lieu of just copying it.

I'm most willing to hear meta-level arguments about internal consistency, or specific existing evidence that I don't know about (especially "secret" evidence). Less certain about the governance sections and some of the exact-wordings.

"Enhancing intelligence by making large numbers of gene edits"

So it's not just a mistake. It's a choice, that choice has motivations, and those motivations are in conflict with our motivations, insofar as they shelter themselves from reason.

This still seems, to me, like a special case of "mistake".

It's not just epistemic confusion that can be most easily corrected with good evidence and arguments. That's what I think we're talking about.

"When the Singularity happens, this (money, conflict, the problems I'm experiencing) won't be a problem anymore."

I mean... yeah?

Some things I think would cause people to disagree:

  • They think a friendly-AGI-run society would have some use for money, conflict, etc. I'd say the onus is on them to explain why we would need those things in such a society.
  • They think a good "singularity" would not be particularly "weird" or sci-fi looking, which ignores the evidence of technological development throughout history. I think this is what the "The specific, real reali
... (read more)
3Pee Doom3mo
  They would still understand the concept of paying money for food. The grocery store is pretty amazing but  it's fundamentally the same transaction as the village market. I think the burden of proof is on people claiming that money will be 'done away with' because 'post-scarcity', when there will always be economic scarcity. It might take an hour of explanation and emotional adjustment for a time-displaced peasant to understand the gist of the store, but it's part of a clear incremental evolution of stores over time. I think a basically friendly society is one that exists at all and is reasonably okay (at least somewhat clearly better) compared to the current one. I don't see why economic transactions, conflicts of all sorts, etc wouldn't still happen, assuming the lack of existentially-destructive ones that would preclude the existence of such a hypothetical society. I can see the nature of money changing, but not the fundamentals of there being trades. I don't think AI can just decide to do away with conflicts via unilateral fiat without an enormous amount of multipolar effort, in what I would consider a friendly society not ran by a world dictator. Like, I predict it would be quite likely terrible to have an ASI with such disproportionate power that it is able to do that, given it could/would be co-opted by power-seekers.  I also think that trying to change things too fast or 'do away with problems' is itself something trending along the spectrum of unfriendliness from the perspective of a lot of humans. I don't think the Poof Into Utopia After FOOM model makes sense, that you have one shot to send a singleton rocket into gravity with the right values  or forever hold your peace. This thing itself would be an unfriendly agent to have such totalizing power and make things go Poof without clear democratic deliberation and consent. This seems like one of the planks of SIAI ideology that seems clearly wrong to me, now, though not indubitably so. There seems to b
Because of the laws of thermodynamics holding, basically. I do buy that a lot of stuff could switch over to non-money modes, but if we assume that the basic laws of physics fundamentally still hold true, then this can't work, and this is one of those areas where you need to give hard evidence. Much more generally, the Industrial Revolution is a good example, in that it really did improve the lives of humans massively, even with imperfect distribution of benefits, but it didn't end conflict or money, and I'd argue there was a use for money (Although the Industrial Revolution did drastically reduce the benefits of war to non-ideological actors.) Interestingly enough, while I think this is true over the long-term, and potentially even over the short term, I think a major problem is LWers tend to underestimate how long things take to change, and in general have a bit of a bad habit of assuming everything changing at maximum speed. I agree that the medieval peasant would be very surprised at how much things changed, but they'd also detect a lot of continuity, and would have a lot of commonalities, especially on the human side of things. But it does decrease the credence, potentially substantially, and that could be important. Now, my general view is that I do think there's reason to believe that AI could be the greatest technology in history, but I agree with the OP that there's a little magic often involved, and it's a little bit of a red flag how much AI gets compared to gods. And contra some people, I do think the psychoanalyze your way to the truth is more useful than people think, especially if you have good a priori reason to expect biases to drive the discussion, because they can allow you to detect red flags.

Both people here are making conflation/failure-to-decouple mistakes. E.g. tying "community coordination" together with "how to generate and/or filter one's own ideas".

Tabooing most of the topic-words/phrases under discussion, I reckon, would have made this dialogue 3-10x better.

(Will have more thoughts, possibly a response post, once I'm done reading/thinking-through this.)

Agreed. I think of it as:

You need your mind to have at least barely enough correctness-structure/Lawfulness to make your ideas semi-correct, or at least easy to correct them later.

Then you want to increase originality within that space.

And if you need more original ideas, you go outside that space (e.g. by assuming your premises are false, or by taking drugs; yes, these are the same class of thing), and then clawing those ideas back into the Lawfulness zone.

Reading things like this, and seeing how long it took them to remember "Babble vs Prune", makes me w... (read more)

Apparently the most reliable way to make sure feature requests are seen is to use the Intercom.

Heartbreaking: This tag includes both "human agenty-ness" and "AI becoming more agentic".

It'd be cool if tags could be disambiguation-type pages, to "Agency (human)" and "Agency (AI)". The disambiguation page still lets us talk about both, especially if the disambiguation, itself, is also a usable tag.

Feature request (is there a better place to put this?):

A toggle in my user settings, where I can turn on "require an extra click before publishing draft". So I can prevent accidentally posting something with an "are you sure?" dialogue, if I toggle that "on" on my account.

Yes, this has actually happened to me before. And with the Dialogues feature, the risk is only higher because of how much I'm inside the LW editor.

Another idea that Matt suggested was a BlueDot -style  "Agent Foundations-in-the-broad-sense' course. 

I would love this and take this myself, fwiw. (Even if I didn't get in, I'd still make "working through such a course's syllabus" one of my main activities in the short term.)

Now I'm immediately going to this thought chain:

Maybe such an IQ test could be designed. --> Ah, but then specialized AIs (debatably "all of them" or "most of the non-LLM ones") would just fail equally! --> Maybe that's okay? Like, you don't give a blind kid a typical SAT, you give them a braille SAT or a reader or something (or, often, no accommodation).

The capabilities-in-the-wrong-order point is spot-on.

I could also imagine a thing that's not so similar to IQ in the details, but still measures some kind of generic "information processing/retention... (read more)

Non-exhaustive, and a maybe non-representatively useful selection: I realized how easily I can change my state-of-mind, down to emotions and similar things. Not instant/fully at-will, but more "I can see the X, let it pass over/through me, and then switch to Y if I want".

I got a lot out of the post, including self-understanding from observing my reactions to and thoughts about the post.

What kinds of reactions to and thoughts about the post did you have that you got a lot out of observing?

Don't forget Orthogonal's mathematical alignment research, including QACI!

Thanks - and the fact that we don't know who is working on relevant things is exactly the reason we're doing this! 

OpenAI should have an internal-only directory allowing employees and leadership to write up and see each other's beliefs about AI extinction risk and alignment approaches.

To expand on my option a bit more: OpenAI could make space for real in-depth dissent by doing something like "a legal binding commitment to not fire (or deny severance, even for quitting employees) anyone who shares their opinion, for the next 1-2 years.

This policy is obviously easy to exploit to get salary without working, but honestly that's a low impact given the stakes. (And unlike... (read more)

Anonymity, for at least one or more actions (voting on polls? Creating options?), should maybe be on the table. Then again, they ofc can send the wrong message and create their own incentives(?).

Load More