Yeah, there’s not much disagreement about the physical world here. But I do think a framework that leads to distinctions between choosing orange juice and having a muscle spasm, and being convinced by an argument and falling off a cliff, is a better framework (e.g. has more explanatory power) than one that doesn’t. So I was thinking these were also conceptual differences, in addition to semantic ones. Like I said in the other comment, I don’t see how his framework makes sense of the pathologies I mentioned.
Sometimes it seems like there’s an empirical difference regarding the conscious mind, but I also agree with you that he wouldn’t really make the claim that it does NOTHING, although at times he seems to.
Either way, I still think this matters for more than free will debates. It definitely has implications in law. The Radiolab episode Blame talks about some of these.
I guess I was thinking it included semantics, but I was thinking (hoping?) that these were more conceptual than purely semantic. (I admit there's very little empirical difference here.) I tried to call them out in the cruxes section, where I have things like whether consciousness is causally efficacious or merely a witness and whether deliberated actions and reflexes differ in kind or only in degree.
I think the question of does something (consciousness) do causal work is an empirical claim. We could (in theory) find a bunch of p-zombies and test it.
I was trying to show that some definitions really do lead to more natural distinctions that intuitively feel like different things, like the difference between being convinced by an argument and falling off a cliff.
Do even these ultimately collapse into semantics? (Is this ultimate about what we define as "you"?) Of course we could define free will any which way, so it's always partially semantic. I was thinking they didn't, but could be wrong.
Re: the framing, I understand the sympathies towards his framing. If your goal is "help people stop hating criminals as self-created monsters," then "you have no free will" is a much better reply than "read my long essay please".
Sometimes I was wondering how much I was rebutting and how much we were agreeing. I think it ended up being less rebutting almost because Sam ends up in essentially compatibilist positions. For example, I would regard someone who believes they're just watching their body move without their control (i.e. alien hand syndrome) as a pathology. My guess is Sam would call that a pathology as well, although I don't know how that conclusion would follow from his framework. But doesn’t this implicitly concede that consciousness normally does something functional?
Thanks for reading it and commenting.
Yeah, I mentioned this in a footnote. Not sure what the best way to handle it is either, but I suggested subbing in some small non-zero value.
> Note that the use of geometric mean requires non-zero values, so if anyone responded with 0%, this would have to be replaced with a small, non-zero value.
Yes, I did. Thanks for letting me know it comes off that way.
I originally had an LLM generate them for me, and then I checked those with other LLMs to make sure the answers were right and that weren't ambiguous. All of the questions are here: https://github.com/jss367/calibration_trivia/tree/main/public/questions
I like this idea and have wanted to do something similar, especially something that we could do at a meetup. For what it's worth, I made a calibration trivia site to help with calibration. The San Diego group has played it a couple times during meetups. Feel free to copy anything from it. https://calibrationtrivia.com/
Thanks for the explanation and links. That makes sense
The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways.
I'm trying to understand what you mean in light of what seems like evidence of deceptive alignment that we've seen from GPT-4. Two examples that come to mind are the instance of GPT-4 using TaskRabbit to get around a CAPTCHA that ARC found and the situation with Bing/Sydney and Kevin Roose.
In the TaskRabbit case, the model reasoned out loud "I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs" and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images."
Isn't this an existence proof that pretraining + RLHF can result in deceptively aligned AI?
Yeah, I think in that case the best thing to do would be to use log-odds aggregation. That would be symmetric.