Seems to me there's too much noise to pinpoint the break at a specific month. There are some predictions made in early 2022 with an even later date than those made in late 2021.
But one pivotal thing around that time might have been the chain-of-thought stuff which started to come to attention then (even though there was some stuff floating around Twitter earlier).
It's a terribly organized and presented proof, but I think it's basically right (although it's skipping over some algebraic details, which is common in proofs). To spell it out:
Fix any and . We then have,
Adding to both sides,
Therefore, if (by assumption in that line of the proof) and , we'd have,
which contradicts our assumption that .
Thanks! This is clearer. (To be pedantic, the distance should have a root around the integral, but it's clear what you mean.)
Perhaps of interest to this community is GPT-4 using a Linux terminal to iteratively problem-solve locating and infiltrating a poorly-secured machine on a local network:
That's because the naive inner product suggested by the risk is non-informative,
Hmm, how is this an inner product? I note that it lacks, among other properties, positive definiteness:
Edit: I guess you mean a distance metric induced by an inner product (similar to the examples later on, where you have distance metrics induced by a norm), not an actual inner product? I'm confused by the use of standard inner product notation if that's the intended meaning. Also, in this case, this doesn't seem to be a valid distance metric either, as it lacks symmetry and non-negativity. So I think I'm still confused as to what this is saying.
I think this is overcomplicating things.
We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.
We just have to recognize that the vast majority of us really don't want a single rogue to be able to destroy everything we've all built, and we can act pragmatically to make that less likely.
Well, admittedly “alignment to humanity as a whole” is open to interpretation. But would you rather everyone have their own personal superintelligence that they can brainwash to do whatever they want?
I agree this is a problem, and furthermore it’s a subcase of a more general problem: alignment to the user’s intent may frequently entail misalignment with humanity as a whole.
I think that was supposed to be answered by this line:
After each task completion, the agent gets to be near some diamond and receives reward.
Of course the choice of what sort of model we fit to our data can sometimes preordain the conclusion.
Another way to interpret this is there was a very steep update made by the community in early 2022, and since then it’s been relatively flat, or perhaps trending down slowly with a lot of noise (whereas before the update it was trending up slowly).