Elias Schmied

Taking a gap year to test my fit for alignment research.


Ah, I see what you mean! Interesting perspective. The one thing I disagree with is that a "gradient" doesn't seem like the most natural way to see it. It seems like it's more of a binary, "Is there (accurate) modelling of the counterfactual of your choice being different going on that actually impacted the choice? If yes, it's acausal. If not, it's not". This intuitively feels pretty binary to me.

I don't think the "zero-computation" case should count. Are two ants in an anthill doing acausal coordination? No, they're just two similar physical systems. It seems to stretch the original meaning , it's in no sense "acausal".

I disagree. There is no acausal coordination because eg the reasoning "If everyone thought like me, democracy would fall apart" does not actually influence many people's choice, ie they would vote due to various social-emotional factors no matter what that reasoning said. It's just a rationalization.

More precisely, when people say "If everyone thought like me, democracy would fall apart", it's not actually the reasoning that it could be interpreted as, it's a vague emotional appeal to loyalty/the identity of a modern liberal/etc. You can tell because it refers to "everyone" instead of a narrow slice of people, it involves no modelling of the specific counterfactual of MY choice, there's no general understanding of decision theory that would allow this kind of reasoning to happen and any reasonable model of the average person's mind doesn't allow it imo.

Your model is also straining to explain the extra taxes thing. "Voting is normal, paying extra taxes isn't" is much simpler.

In general, I'm wary of attempts to overly steelman the average person's behavior, especially when there's a "cool thing" like decision theory involved. It feels like a Mysterious Answers to Mysterious Questions kind of thing.

I've been thinking along similar lines, but instinctively, without a lot of reflection, I'm concerned about negative social effects of having an explicit community-wide list of "trusted people".

After thinking about it a little bit, the only hypothesis I could come up with for what's going on in the negation example is that the smaller models understand the Q&A format and understand negation, but the larger models have learned that negation inside a Q&A is unusual and so disregard it.

Thanks for this post, this looks very useful :) (it comes at a great time for me since I'm starting to work on my first self-directed research project right now).

I'm very interested, but since you've already found someone, please post the results! :)

