David James — LessWrong

My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.

During yesterday's interview, Eliezer didn't give a great reply to Ezra Klein's question: i.e. "why does even a small amount of misalignment lead to human extinction." I think many people agree with this; still, my goal isn't to criticize EY. Instead, my goal is to find various levels of explanation that have been tested and tend to work for different audiences with various backgrounds. Suggestions?

speck1447 : ... Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he's not willing to engage with it at all (I of course don't know what's in his brain. This is how it reads to me, and I suspect how it reads to normies.)

(This is only a copy-editing suggestion.) Instead of the current cheese-numbering-scheme "1, 2, (c)", I think the authors intend "(a), (b), (c)". So this would be:

(a) Get the cheese!

(b) Go towards the top-right corner, unless you are already there, in which case get the cheese.

P.S. Off-topic but fun: there are indeed "Cheese numbers" discussed on the "googology" wiki

Epistemic status: I’m fairly sure “ethics” does useful work in building human structures that work. My current explanations of how are wordy; I think there should be a briefer way to conceptualize it; I hope you guys help me with that.

My concerns here are not epistemic; they are about length, summarization, and chunking. I’ll offer two suggestions framed as questions:

Please tell the reader what you’re going to tell us right away. What is your central point? Are you primarily proposing a new ethical framework? Are you primarily aiming to improve particular areas; e.g. to AI safety?
Wouldn’t this be better presented as a sequence of posts? To think in reverse, could you make a good argument for why a single post of ~6000+ words is better?

I enjoy LessWrong because of its epistemic norms but often dislike* longer than necessary articles. I found my way here because of the Sequences; let’s do more of them! I would rather see ~600 to 1200 word self-contained pieces that link to other parts, which might be: prerequisite material, literature reviews, lengthy examples, personal motivations, and so on.

* I want authors to succeed at focusing reader attention and community discussion. Excessive length can hurt more than help.

I don't mean to imply that we can't cooperate, but it seems to me free-thinkers often underinvest in coalition building. Mostly I'm echoing e.g. 'it is ok to endorse a book even if you don't agree with every point'. There is a healthy tension between individual stances and coalition membership; we should lean into these tough tradeoffs rather than retreating to the tempting comfort of purity.

If one wants to synthesize a goal that spans this tension, one can define success more broadly so as to factor in public opinion. There are at least two ways of phrasing this:

Rather than assuming one uniform standard of rigor, we can think more broadly. Plan for the audience's knowledge level and epistemic standards.
Alternatively, define one's top-level goal as successful information transmission rather than merely intellectual rigor. Using the information-theoretic model, plan for the channel [1] and the audience's decoding.

I'll give three examples here:

For a place like LessWrong, aim high. Expect that people have enough knowledge (or can get up to speed) to engage substantively with the object-level details. As I understand it, we want (and have) a community where purely strategic behavior is discouraged and unhelpful, because we want to learn together to unpack the full decision graph relating to future scenarios. [2]
For other social media, think about your status there and plan based on your priorities. You might ask questions like: What do you want to say about IABIED? What mix of advocacy, promotion, clarification, agreement, disagreement are you aiming for? How will the channel change (amplify, distort, etc) your message? How will the audience perceive your comments?
For 1-to-1 in-person discussions, you might have more room for experimentation in choosing your message and style. You might try out different objectives. There is a time and place for being mindful of short inferential distances and therefore building a case slowly and deliberately. There is also a time/place for pushing on the Overton window. What does the "persuasion graph" look like for a particular person? Can you be ok with getting someone to agree with your conclusion even if they get there from a less rigorous direction? Even if that other path isn't durable as the person gains more knowledge? (These are hard strategic questions.)

Personally, I am lucky that get to try out many face-to-face conversations with new people many times a week to see what happens. I am not following any survey methodology; this is more open-ended and exploratory so that I can get the contours.

[1]: Technical note: some think of an information-theoretic channel as only suffering from Gaussian noise, but that's only one case. A channel can be any conditional probability distribution p(y|x) (output given input) and need not be memoryless. (Note that the notion of a conditional probability distribution generalizes over the notion of a function, which must be deterministic by definition.)

[2]: I'd like to see more directed-graph summaries of arguments on LessWrong. Here is one from 2012 by Dmytry titled A belief propagation graph (about AI Risk).

Updated on 2025-09-27.

Perhaps Raemon could say more about what he means by "please don't awkwardly distance yourself"?

Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important.

Raemon, thank for you writing this! I recommend each of us pause and reflect on how we (the rationality community) sometimes have a tendency to undermine our own efforts. See also Why Our Kind Can't Cooperate.

Consider the following numbered points:

In an important sense, other people (and culture) characterize me as perhaps moderate (or something else). I could be right, wrong, anything in between, or not even wrong. I get labeled largely based on what others think and say of me.
How do I decide on my policy positions? One could make a pretty compelling argument (from rationality, broadly speaking) that my best assessments of the world should determine my policy positions.
Therefore, to the extent I do a good job of #2, I should end up recommending policies that I think will accomplish my desired goals even when accounting for how I will be perceived (#1).

This (obvious?) framework, executed well, might subsume various common (even clichéd) advice that gets thrown around:

Be yourself and do what needs to be done, then let the cards fall as they may.
No one will take your advice if you are perceived as crazy.
Many movements are born by passionate people perceived as “extreme” because important issues are often polarizing.
It can be difficult to rally people around a position that feels watered down.
Pick something doable and execute well to build momentum for the next harder thing.
Writing legislation can be an awful slog. Whipping votes requires a lot of negotiation, some unsavory. But all this depends on years of intellectual and cultural groundwork that softened the ground for the key ideas.

P.S. when I first came here to write this comment, I had only a rough feeling along the lines of “shouldn’t I choose my policy positions based on what I think will actually work and not worry about how I’m perceived?” But I chewed on it for a while. I hope this is a better contribution to the discussion, because I think it is quite a messy space to figure out.

Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.

I’m posting this comment in the spirit of reducing confusion, even if only for one other reader.

Daniel’s comments are at the bottom of the post. When I read “mark it up with comments” that suggested to me that a reader can find the comments inline with the text (which isn’t the case here). In other words, I was expecting to see an alternation between blockquotes of Vitalik’s text followed by Daniel’s comments.

Either way works, but with the current style I suggest adding a note clarifying that Daniel’s comments are below the post.

Update Saturday 9 PM ET: I see now that LessWrong’s right margin shows small icons indicating places where the main text has associated comments. I had never noticed this before. Given the intention of this post, these tiny UI elements seem rather too subtle IMO.

LLMs can’t reliably follow rules

I suggest rewriting this as "Present LLMs can’t reliably follow rules". Doing so is clearer and reduces potential misreading. Saying "LLM" is often ambiguous; it could be the current SoTA, but sometimes it means an entire class.

Stronger claims, such as "Vanilla LLMs (without tooling) cannot and will not be able to reliably follow rule-sets as complicated as chess, even with larger context windows, better training, etc ... and here is why." would be very interesting, if there is evidence and reasoning behind them.

A Schelling point is something people can pick without coordination, often because it feels natural or obvious.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments