Beware LLMs' pathological guardrailing

Seconding this observation.

As a practical note, I find that it's useful to, in addition to what you mention about telling the LLM to avoid certain patterns, additionally tell it to go over its changes and rewrite e.g. code which returns a placeholder value in a failure case with code which throws a descriptive error in a failure case. Seems like there are many cases where the LLM isn't able to one-shot follow all of the instructions you give it, but is able to recognize that failure and course-correct after the fact.

Beware LLMs' pathological guardrailing

Modern large language models go through a battery of reinforcement learning where they are trained not to produce code that fails in specific, easily detectable ways, like crashing the program or causing failed unit tests. Almost universally, this means these models have learned to produce code that looks like this:

The above code is very cheeky, and quite bad. It's more likely to pass integration tests than it would be without the try/catch block, but only because it fails silently. Callers of getSomeParticularNumber won't know whether 0 is what’s actually in the database or if there was an intermittent connection error - which could be catastrophic if the number is the price of an item in a store, say. And if it turns out this code contains a bug (for example, if the table should be "Accounts" instead of accounts), testers might not notice that until it's actually impacting users.

Some common ways I've seen this behavior manifest include:

Harmful/unncessary try/catch blocks

Redundant hasattr or getattr in untyped languages

Setting default return values for dictionary keys or array/tuple indices that should always exist

Until reinforcement learning environments get much, much better, these models will probably continue to do this. To help prevent this kind of behavior, I include custom instructions for most of my terminal coders:

When key assumptions that your code relies upon to work appear to be broken, fail early and visibly, rather than attempting to patch things up. In particular: * Lean towards propagating errors up to callers, instead of silently "warning" about them inside of try/catch blocks. * If you are fairly certain data should always exist, assume it does, rather than producing code with unnecessary guardrails or existence checks (esp. if such checks might mislead other programmers) * Avoid the use of hasattr/getattr (or non-python equivalents) when accessing attributes and fields that should always exist. * Never produce invalid 'defaults' as a result of errors, either for users, or downstream callers.

LESSWRONG
LW

LESSWRONG
LW

21

Beware LLMs' pathological guardrailing

21

21

Beware LLMs' pathological guardrailing