Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
For future reference, please put text that is pretty close to LLM output into expandable sections and flag them as such. For a relatively fact-heavy post like this LLM output is great and often helpful, but I don't think we are doing anyone any service by dressing it up as human writing. This is generally part of LessWrong content policy, and we would have rejected this post if it had come from a new user (this doesn't mean the core ideas is bad, indeed I find this post useful, but I do really think the attractor of everyone pasting content like this is a much worse attractor than the one we are currently in).
Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
But in that case, wouldn't a rock that has "just ask Evan" written on it, be even better than Claude? Like, I felt confident that you were talking about Claude's extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has "ask Evan" written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
Knowing that you haven't solved the problem is actually really quite useful and important! I think basically no progress has been made on the alignment problem, but I do think the arguments for why it's not been solved yet are as such really quite important for helping humanity navigate the coming decades.
Last week, I was working with a paper that has over 100 upvotes on LessWrong
Just curious whether you meant "score above 100" or "more than 100 votes". Those are quite different facts!
This was the thinking model (I basically always use the thinking model).
I mean, maybe there is a bit of self-deception going on, though what that looks like in LLMs looks messy.
But it's clear that the hallucinations point in the direction of sycophancy, and also clear that the LLM is not trying very hard not to lie, despite this being a thing I obviously care quite a bit about (and the LLM knows this).
If you want to call them "sycophantically adversarial selective hallucinations", then sure, but I honestly think "lying" is a better descriptor, and more predictive of what LLMs will do in similar situations.
I would also simply bet that if we had access to the CoT in the above case, the answer to what happened would not look that much like "hallucinations". It would look more like "the model realized it can't read it, kind of panicked, tried some alternative ways of solving the problem, and eventually just output this answer". Like, I really don't think the model will have ended up in a cognitive state where it thought it could read the PDF, which is what "hallucination" would imply.
LessWrong is not a forum in which posting in good faith is sufficient to be welcomed! Think of it as a professional community. Just because you are writing a physics paper in good faith doesn't mean it will be well-received by the physics community as a contribution. Similarly here, I think you are missing a large number of prerequisites that are assumed to be understood by participants on LW.
I would recommend checking out the New User's Guide to LessWrong .
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn't have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it's doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn't mean it doesn't make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like "if you disagree strongly with the above comment you should force yourself to outline your views" aren't frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don't break that part)
Come on, if you want to argue the fire death point at least give some kind of statistic or do a micromort estimate.
prevented home accidents do not show up in stats, it is akin to survivor bias.
Most people do not own fire blankets, as such there is little survivorship bias going on here. You can just estimate using base rates.
The expected annual property damage from fire is around $60/year per homeowner per this random ChatGPT analysis (in other words not worrying about). A fire blanket would need to result in a 50% reduction of all fire risk to start being worth the cost and attention.
Honestly, this whole conversation just feels like I am on Reddit with people giving random anecdotes without statistical literacy. You can disagree with me, but you speak with weird authority on issues that you seem to not have actually thought that clearly about.
Well, by my values I highly doubt you are going to do anything except to hide a general tendency by patching an individual kind of instance, so I am not sure how I feel about that, but if you learn more about the mechanisms I would be quite curious.