(a sharp distinction from Sniffles the teacup poodle. I don't care if you think you're happy, this would not please the prowling wolves of the stone age.)
This is indeed very much the obvious failure mode! Discovering that an alien species has bred a group of humans into what a pug is to a wolf would be absolutely horrific.
Moreover the path between utopia and "lovecraftian horror" seems pretty fragile? I don't know exactly what property cats had that made the shoggoth take the good one (mostly, maybe except for those flat-faced Persian and hairless Sphynxes) for them, and it's plausible it was just a lucky combination of minor stuff (harder to selectively breed, different social niche, different types of people liking cats) that won't be stable/generalize in extremis.
People generally care more about furthering personal pleasure and minimizing personal pain than the pleasure/pain of others; but this is because internal personal pleasure was a straightforward good heuristic for evolution to take when it wanted to maximize genetic fitness in the ancestral environment where there weren't that many sudden out-of-distribution things (like contraceptives) that could derail it.
I assume a more strongly-optimized intelligent being would have increasingly better correlation between the state of its internal utility to the state of the external world, as it fits whatever goal it was optimized for better. In that case it should more readily collaborate with its clone.
This especially if it gets optimized with other instances of itself so that "cloning" is no longer a weird out-of-distribution event; in which case I expect it to rapidly start behaving like an ant or bee, or even cell or mitochondria, in how it'll sacrifice itself for whatever goal the group has.
Ah, thanks, I should have read more carefully.
My revised thoughts is that by asking it if the question is a real-world question as opposed to an evaluation bumps the prior from "almost certainly not an evaluation" to "the likely fraction of questions I'm asked that are evaluations : real-world, e.g. 1:1 unless explicitly stated". That, combined with the presence or absence of suspicious elements like an ethical dilemma may move the likelihood to the one we see it claim.
Ideally I'd want to see it independently coming up with the evaluation hypothesis itself without having to remind it that it's a possibility.
Surely the basic fact that a model is asked if it thinks it's in an evaluation is extremely strong evidence for it actually being evaluated? Models are more or less never asked that normally, so it'd be very strong evidence that they are. The reasons it gives could then be post-hoc justifications.
I imagine most disagreement comes from the first paragraph.
The problem with assuming that since the publisher is famous their design is necessarily good is that even huge companies make much worse baffling design decisions all the time, and in this case one can directly see the design and know that it's not great – the weak outside-view evidence that prestigious companies usually do good work doesn't move this very much.
The "lightcone-eating" effect on the website is quite cool. The immediate obvious idea is to have that as a background and write the title inside the black area.
If one wanted to be cute you could even make the expansion vaguely skull-shaped; perhaps like so?
I worry that if I remap it to something actually useful I will commit it to muscle memory and begin to inadvertently press it when using a computer that's not my own. Depending on how often you switch computers this could be worse than the status quo.
This issue also shows up when doing surveys to compare support for things across countries.
Here, for example, is a typical example one might find on social media where the connotation of the question might vary wildly depending on the language it's translated to. Reasoning about modest differences in percentage between countries then becomes rather meaningless.
Yeah. An even more obvious example would be something like "what would Spock say if reviewing 'Warp Drives for Dummies'". In that case, it seems pretty clear that the author is expected to invent some "hallucinatory" content for the book, and not output something like "I don't know that one".
The actual examples can be interpreted similarly; the author should assume that the movie/book exists in the hypothetical counterfactual world they are asked to generate content from.
I predict we are shortly going to see platforms using generative AI + A/B testing to make "hyperslop".
Imagine a music service, or a TikTok-like platform with AI-generated shortform videos. The generator gets hooked up to an optimiser which tweaks its input parameters. These could be legible, such as "colour saturation", "cuteness", or "content variability", or entirely opaque weights somewhere. If a tweak is statistically established to increase engagement, it is applied and another A/B test begins.
You could even have specific optimisers which gets run on various subgroups, like "female American teens 16-18" gets their own sub-optimiser, as well as every subculture and every little attractor basin you can identify. This could go all the way down to tweaks for each individual user if content is cheap enough to be personalised.
All the prerequisites for this already exist. We've already gotten a taste of it from YouTube thumbnails, which have been A/B gradient-descended for years on the minds of a billion of mostly children to plaster those inhuman staring open-mouthed faces everywhere. It's just a matter of time before bulk AI generation gets cheap enough to speed it up thousands of times and apply it to the content itself.