Aprillion — LessWrong

https://peter.hozak.info

My perception of time is like sampling from a continuous 2D plane of conceptual space, something akin to git railways but with a thickness to the lines, like electron orbital probability clouds that are dense around plausible points of view of what happened in the past and thin around conspiracy theories, as different linear mind-map perspectives of people standing around a sculpture, each only inferring what's on the other side, but together they can prune down non-overlapping minority reports, sticking to the consensus but never deleting (git) history.

My sense of beauty finds it displeasing to read articles with only point measurements and only n-point predictions and to look at charts from a single interpretation / scenario, I have to hallucinate the gaps, infer the systemic bias and imagine the size of the error bars due to "randomness" as if the authors were good intentioned and as if the authors had an agenda to prove a point, would they stumble upon convenient evidence before they stopped looking?

But alternative timelines that are infinitesimally thin and split only on known unknowns would imply perfect Bayesian approximators, an impossible standard, uncomputable. No one has ever made that kind of precise prediction, why do we allow prediction-readers to behave as if prediction-writers could have made an infinitely precise measurable decidable statements with completely non-ambiguous semantics that will be evaluated to a non-reversible Boolean?

...we aphantasiacs see nothing but darkness.
Instead of seeing images, I think exclusively in terms of concepts. For example, I know what my dog looks like, but not from visualizing her in my mind. Instead, I can describe my dog by listing her characteristics stored in my memory—like I’m accessing words from a txt file.

Hm, it sounds like as if you were grieving a loss... How did you come to the conclusion that there is anything of substance that is desirable to have that you cannot have? I reached the opposite conclusion, that "normal" people are misfortunate if they auto-sample a concrete instance of abstract concepts like "a red apple" so prematurely.

While I myself I don't "see", "hear", "smell", or "touch" my imagined concepts, those are are much richer objects for me than mere sensory input hallucinations would have been - and I have very rich sensory inputs to start with - when I close my eyes, I don't see darkness, I see a red-ish background with dynamic blue-green "pixel noise". And I still know where the windows are in the 3D space around me as I rotate in a room with my eyes closed, while I can also see afterimages from the screen but those stay 2D-projected in front where I point my eyes as I move them inside my eye sockets. I perceive the shape of the room from my other senses as distinct from the visual-only layers, it doesn't go away by merely closing my eyes, but I also cannot help but integrate the visual info of the direction of the light source to inform my sense of the shape of the room - putting my hands in front of my closed eyes makes it harder to walk through the door if I turn around a few times.

..it's just the imagined apple comes from neither of those senses, it floats on a different plane of existence. I can give it colors, rotate it, I can give it smells, I can give it a past, I can give it a future, I know what sound it would make if I eat it and it was still fresh and crisp or if it was old and soggy. I know all that the instant I focus on any aspect of that apple that exists only in my mind, I am not accessing words from a txt file, I am accessing the concept that is much richer than I could put into words.

And if you can shuffle a key into a door even when holding a bunch of bags that obscure the lock, you have evidence that you must be able to imagine the 3D space that you learned from your vision even without seeing it directly with your eyes - you are not describing how to find the lock with just your fingers and your ears in words, I would define the situation as "imagining" how the real world really is even without "seeing" it.

(Though now that I paid too much attention to my vision, I will see the stupid tangled protein floaters in front of the screen until I distract myself with something, aaah...)

first... wouldn't that be evidence that models are NOT actually aware of whether or not they are in an eval? second, how do you figure out whether they are aware of the real situation vs always think they are in an eval with some probability BEFORE releasing to production? I am trying to understand how "eval awareness" is a useful concept for doing something about it and I feel like I'm missing something here

hm, most wannabe new games won't get to the publishing phase tho.. or even attempt to get to steam -📱? (...also probably best to count roblox as "1 game" if we are interested in sales, but...)

Incentives and institutions.

enjoy the process

yup, that's going to be my strategy next year, taking a couple of months without doing any software engineering at all last summer might not have solved alignment, but I realized I don't care for humanity that much and that staying away from AI safety for a few years didn't help to keep it out of my mind anyway... so I collected a few points what makes the process more enjoyable for me in My burnout journey and will try to keep up with those points when helping to build infrastructure on various projects 🤞

Ah, ok ... I tried to ask Opus 4.5 about the Hawthorne effect equivalent, would you find its summary of state of the art in that regard fair?

To my knowledge, there's no published work that:
Shows alignment faking emerging from realistic pretraining distributions (i.e., without targeted synthetic documents)
Demonstrates eval-conditional behavior changes in models that weren't given explicit training-awareness information
Provides a "natural experiment" where models acquired situational awareness through ordinary training and then exhibited differential eval/deployment behavior
The closest is the probing work showing eval-awareness representations exist—but that's detection, not demonstration of behavioral consequences.

In my naive intuition, it ought to be a wider basin for gradient descent to find a heuristic "everything is an eval" than actually correctly detecting misalignment evals and only faking alignment in those scenarios while never being aligned in production... but no idea how people test that such hypothesis is false.

Huh? I can't tell if my question is genuinely hard to understand or if you are trolling me or do I have unreasonable expectations about AI research? Is the research method here to compare eval and non-eval situations actually "something from the internet"?

I might be missing some background, but how do people set up true positive situation when the model is NOT in an eval? Transcripts of "free tier" usage from the wild sound like "too late", while setting up an eval that pretends to be not an eval is not "true"...

additionally, optimizing for 1 factor makes other factors less "visible", especially in the short term ... so a tendency to try to "improve things slightly" instead of truly optimizing was probably strongly selected for all the cases the biorobot's value function is only a proxy for unknowable-up-front true reward

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments