enjoy the process
yup, that's going to be my strategy next year, taking a couple of months without doing any software engineering at all last summer might not have solved alignment, but I realized I don't care for humanity that much and that staying away from AI safety for a few years didn't help to keep it out of my mind anyway... so I collected a few points what makes the process more enjoyable for me in My burnout journey and will try to keep up with those points when helping to build infrastructure on various projects 🤞
Ah, ok ... I tried to ask Opus 4.5 about the Hawthorne effect equivalent, would you find its summary of state of the art in that regard fair?
To my knowledge, there's no published work that:
- Shows alignment faking emerging from realistic pretraining distributions (i.e., without targeted synthetic documents)
- Demonstrates eval-conditional behavior changes in models that weren't given explicit training-awareness information
- Provides a "natural experiment" where models acquired situational awareness through ordinary training and then exhibited differential eval/deployment behavior
The closest is the probing work showing eval-awareness representations exist—but that's detection, not demonstration of behavioral consequences.
In my naive intuition, it ought to be a wider basin for gradient descent to find a heuristic "everything is an eval" than actually correctly detecting misalignment evals and only faking alignment in those scenarios while never being aligned in production... but no idea how people test that such hypothesis is false.
Huh? I can't tell if my question is genuinely hard to understand or if you are trolling me or do I have unreasonable expectations about AI research? Is the research method here to compare eval and non-eval situations actually "something from the internet"?
I might be missing some background, but how do people set up true positive situation when the model is NOT in an eval? Transcripts of "free tier" usage from the wild sound like "too late", while setting up an eval that pretends to be not an eval is not "true"...
additionally, optimizing for 1 factor makes other factors less "visible", especially in the short term ... so a tendency to try to "improve things slightly" instead of truly optimizing was probably strongly selected for all the cases the biorobot's value function is only a proxy for unknowable-up-front true reward
what exactly is it about human brains[1] that allows them to not always act like power-seeking ruthless consequentialists?
why focus only on the brains? it's a property of the mind and I thought the standard take why humans are not even approximating utility maximizers is because of properties of the environment (priors), not a hardcoded function/software/architecture/wetwere in the brain .. or?
🤓🔫 "exactly 499" according to google docs which seems to count contractions as single words.. basically split by space but e.g. is 2 words as far as I can tell, but if I and have were to be counted as 2 words, then the precommitment gives John a tiny little fuck right back:
the famous case where a guy lost all his emotions and in many ways things remained fine.
I believe the point of the example was that the guy would take forever to make any decision, even small things like what to pick in a store, wasting hours on trivial stuff (..and when I imagine my lack of patience with friends when they are indecisive sometimes, I suspect his social life might not "remained fine" but that was not mentioned in the interview),, that emotions (like value functions in ML) help to speed up progress even if with unlimited time we might get to the same conclusions
Funny perspective,, but why not model the phenomenon as distributed computing instead of using loaded labels? Mundane things like what to eat for dinner is a collective family-unit decision, modelling is as bi-directional dominance game sounds counter-useful to me, even though the description could be made equally accurate in both frames.
Say, when I cook pasta an hour sooner than I would have otherwise when my husband is hungry already, I even delegate 99% of the decision about various health hazards to the institutions of the civilization I live in - the tap water is fine, the bag of pasta is fine, the tomatoes are meh but fine, I just check the non-moldy cheese is indeed not moldy yet. I don't have a chemical and bio lab to check for all the poisons and microbes myself, yet I don't see it as being submissive to the society, I just trust the external computation that made the decisions for me is aligned enough with my interests..
Incentives and institutions.