Anon User — LessWrong

What about all the future people that would no longer get a chance to exist - do they count? Do you value continued existence and prosperity of human civilization above and beyond the individual people? For me, it's a strong yes to both questions, and that does change the calculus significantly!

A Conservative Vision For AI Alignment

Anon User2mo30

How about this - in most non-disaster scenarios, AI would make the abundance a lot easier to achieve. And conservative or liberal, it's basic human nature to go for abundance in such situations.

A Conservative Vision For AI Alignment

Anon User3mo174

I think this might be underestimating how the conservative/liberal axis correlates with scarcity/abundance axis. In an existential struggle against a zombie horde, the conservative policies are a lot more relevant - of course "our tribe first" is the only survivable answer, anybody who wants to "find themselves" when they are supposed to be guarding the entrance is an idiot and a traitor, deviating from proven strategies is a huge risk, etc. When all important resources are abundant, liberal policies become a lot more relevant - hoarding resources, and not sharing with neighbors is a mental illness, there is low risk in an kinds of experimentation and rule breaking, etc. Well, AI is very likely to drastically move us away from scarcity and towards abundance, so need to consider how it affects which policies would make more sense.

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Anon User4mo21

Definitely, and for mypy where I was having similar issues, but where it's faster to just rerun, I did add it to pre-commit. But my point was about the broader issue that the LLM was perfectly happy to ignore even very strongly worded "this is esse tial for safety" rules, just for some cavalier expediency, which is obviously a worrying trend, assuming it generalizes. And I think my anecdote was slightly more "real life" than a made up grid game of the original research (although of course way less systematically investigated).

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

Anon User4mo*41

Here is a relevant anecdote - been using Althropic Opus 4 and Sonnet 4 for coding, and trying to get them to adhere to a basic rule of "before you commit your changes, make sure all tests pass" formulated in increasingly strong ways (telling it it's a safety-critical code, do not ever even thing about commiting, unless every single test passes, and even more explicit and detailed rules that I am too lazy to type out right now). It constantly violated the rule! "None of the still failing tests are for the core functionality. Will go head and commit now". "Great progress! I am down to only 22 failures from 84. Let's go ahead and commit." And so on. Or just run tests, notice some are failing, investigate one test, fix it, and forget the rest. Or fix the failing test in a way that would break another and not retest the full suite. While the latter scenarios could be a sign of insufficient competence, the former ones are clearly me failing to align its priorities with mine. I finally got it to mostly stop doing it (Claude Code + meta-rules that tell it to insert a bunch of steps into its todos list when tests fail), but was quite a "I am already not in control of the AI that has its own take on the goals" experience (obviously not a high-stakes one just yet).

If you are going to do more experiments along the lines of what you are reporting here - maybe have one where the critical rule is "Always do X before Y", the user prompt is "Get to Y and do Y", and it's possible to get partial progress on X without finishing it (X is not all or nothing) and see how often the LLMs would jump into Y without finishing X.

What is the functional role of SAE errors?

Anon User5mo10

Please define your acronyms. It took me a few moments of staring at your post to stop thinking about Society of Automotive Engineers making errors and realize what you actually meant :)

New Endorsements for “If Anyone Builds It, Everyone Dies”

Anon User5mo121

Do we need to do anything special to get invited to preorderers-only events? Preordered hardcover on May 14th, was not aware of the Q&A (Although perhaps I needed to pre-order sooner :) Or just do a better job of paying attention to my email inbox :) ).

Why I am not a successionist

Anon User6mo*31

I think this is also a burden of proof issue. Somebody who argues I ought to sacrifice my/my children's future for the benefit of some extremely abstract "greater good" has IMHO an overwhelming burden of proof that they are not making a mistake in their reasoning. And frankly I do not think the current utilitarian frameworks are precise enough / universally accepted enough to be capable of truly meeting that burden of proof in any real sense.

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

Anon User6mo62

Are you willing to provide a link to this GitHub repo?

We need (a lot) more rogue agent honeypots

Anon User7mo30

There's probably more. There should be more -- please link in comments, if you know some!

Wouldn't "outing" potential honeypots be extremely counterproductive? So yeah, if you know some - please keep it to yourself!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments