lauriewired

LESSWRONG
LW

lauriewired — LessWrong

Replying toPersonal evaluation of LLMs, through chess

Personal evaluation of LLMs, through chess

I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.

It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):

- Write the oracle first, if I can’t eval as true/false the task is too mushy 
- shrink the scope until it breaks cleanly 
- iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks 

The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.

Replying toReligious Persistence: A Missing Primitive for Robust Alignment

lauriewired10mo

Religious Persistence: A Missing Primitive for Robust Alignment

I think your critique hinges on a misunderstanding triggered by the word "religion." You (mis)portray my position as advocating for religion’s worse epistemic practices; in reality I’m trying to highlight durable architectural features when instrumental reward shaping fails.

The claim “religion works by damaging rationality” is a strawman. My post is about borrowing design patterns that might cultivate robust alignment. It does not require you to accept the premise that religion thrives exclusively by “preventing good reasoning”.

I explicitly state to examine the structural concept of intrinsic motivations that remain stable in OOD scenarios; not religion itself. Your assessment glosses over these nuances; a mismodeling of my actual position.

Religious Persistence: A Missing Primitive for Robust Alignment

lauriewired

10mo

Current LLM alignment approaches rely heavily on preference learning and RLHF. This fundamentally creates the issue of “teaching” good behavior through rewards and penalties.

Out-Of-Distribution circumstances pose perhaps the largest risk to principle abandonment. Worse yet, these situations (by nature) are nearly impossible to predict.

I propose that the relative stability of human religious frameworks, over a diverse set of conditions, offers a yet-unexplored structure for alignment.

Fragility of Utility Maximization

As a quick-aside, I’d like to briefly mention some of the issues with EU-max.

Expected utility can be gamed by sufficiently intelligent systems. When a... (read 2276 more words →)

lauriewired

Religious Persistence: A Missing Primitive for Robust Alignment

lauriewired

lauriewired

Religious Persistence: A Missing Primitive for Robust Alignment

Table of Contents:

Fragility of Utility Maximization