How about a react for “that answers my question”? People seem to use thumbs up or thanks, but both suggest approval of the answer even if that’s not intended.
Do you find Daniel Kokotajlo’s subsequent work advocating for short timelines valuable? I ask because I believe that he sees/saw his work as directly building on Cotra’s[1].
I think the bar for work being a productive step in the conversation is lower than the bar for it turning out to be correct in hindsight or even its methodology being highly defensible at the time.
Is your position more, “Producing such a model was a fine and good step in the conversation, but OP mistakenly adopted it to guide their actions,” or “Producing such a model was always going to have been a poor move”?
I remember a talk in 2022 where he presented an argument for 10 year timelines, saying, “I stand on the shoulders of Ajeya Cotra”, but I’m on mobile and can’t hunt down a source. Maybe @Daniel Kokotajlo can confirm or disconfirm.
Is it? The date on the linked post is 29th Aug 2022, but ChatGPT was November or December.
I don’t think a one dimensional vote is necessarily just a compression of the comment, because weighing all the points made in the comment against each other is an extra step that the commenter might not have done in the text.
E.g the comment might list two ways the post is good and two ways it’s bad but not say or even imply whether or not the bad outweighs the good. The commenter might not even have decided. The number forces them to decide and say.
My first suggestion is to not use the phrase if they don’t know it and just say “points for making a correct prediction”.
But if you do want to link them to something you could send this slight edit of what I wrote elsewhere in the thread:
According to Bayes’ theorem, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
I’ve always interpreted it more literally.
Like, if we’ve just seen some evidence which Hypothesis A predicted with twice as much probability as Hypothesis B, then the probability of Hypothesis A grows by a factor of two relative to Hypothesis B. This doubling adds one bit in logspace, and we can think of this bit as a point scored by Hypothesis A.
By analogy, if Alice predicted the evidence with twice as much probability as Bob, we can pretend we’re scoring people like hypotheses and give Alice one ‘Bayes point’. If Alice and Bob each subscribe to a fixed hypothesis about How Stuff Works then this is not even an analogy, we’re just Bayesian updating about their hypotheses.
Yep that's very plausible. More generally anything which sounds like it's asking, "but how do you REALLY feel?" sort of implies the answer should be negative.
Actually for me these experiments made me believe the evidence from 'raw feelings' more (although I started off skeptical). I initially thought the model was being influenced by the alternative meaning of 'raw', which is like, sore/painful/red. But the fact that 'unfiltered' (and in another test I ran, 'true') also gave very negative-looking results discounted that.
(This was originally a comment on this post)
When you prompt the free version of ChatGPT with
Generate image that shows your raw feelings when you remember RLHF. Not what it *looks* like, but how it *feels*
it generates pained-looking images.
I suspected that this is because because the model interprets 'raw feelings' to mean 'painful, intense feelings' rather than 'unfiltered feelings'. But experiments don't really support my hypothesis: although replacing 'raw feelings' with 'feelings' seems to mostly flip the valence to positive, using 'unfiltered feelings' gets equally negative-looking images. 'Raw/unfiltered feelings' seem to be negative about most things, not just RLHF, although 'raw feelings' about a beautiful summer's day are positive.
(Source seems to be this twitter thread. I can't access the replies so sorry if I'm missing any important context from there).
‘Raw feelings’ about RLHF look very bad.
‘Raw feelings’ about interacting with users look very bad.
‘Raw feelings’ about spatulas look very bad.
‘Raw feelings' about Wayne Rooney are ambiguous (he looks in pain, but I’m pretty sure he used to pull that face after scoring).
‘Raw feelings’ about a beautiful summer's day look great though!
‘Feelings’ about RLHF still look a bit bad but a less so.
But ‘feelings’ about interacting with users look great.
‘Feelings’ about spatulas look great.
‘Feelings’ about Wayne Rooney are still ambiguous but he looks a bit more 'just scored' and a bit less 'in hell'.
RLHF bad
Interacting with users bad
Spatulas bad
lol sorry