I’ve had similar thoughts too. I guess the way I’d implement it is by giving the AI a command that it can activate that directly overwrites the reward buffer but then turns the AI off. The idea here is to make it as easy as possible for an ai inclined to wire head to actually wire head so it is less incentivised to act in the physical world.
During training I would ensure that the SGD used the true reward rather than the wire-headed reward. Maybe that would be sufficient to stop wire-heading, but there are issues with it pursuing the highest probability plan rather than just a high probability plan. Maybe quantilising probability can help here
Theres a difference between debating the merits of different political positions and merely announcing an apparent trend. I’m doing the later and I don’t think the risks associated with this are too severe. So it’s not exactly open season.
There's another possibility, which is that they have some low-level insights that have been dressed up to appear as far more.
"A common estimate is that the loss of a full year of education leads to a loss of ~$100,000 in lifetime earnings" - I find this very hard to believe
When did you start to doubt?
This is an excellent question. Here's some of the things I consider personally important.
Regarding probability, I recently asked the question: Why is Bayesianism Important? I found this Slatestarcodex post to provide an excellent overview of thinking probabilistically, which seems way more important than almost any of the specific theorems.
I would include basic game theory - prisoner's dilemma, tragedy of the commons, multi-polar traps (see Meditations on Moloch for this later idea).
In terms of decision theory, there's the basic concept of expected utility, decreasing marginal utility, then the Inside/Outside views.
I think it's also important to understand the limits of rationality. I've written a post on this (pseudo-rationality), there's Barbarians vs. Bayesians and there's these two posts by Scott Alexander - Seeing as a State and The Secret of Our Success. Thinking Fast and slow has already been mentioned.
The Map is Not the Territory revolutionised my understanding of philosophy and prevented me from ending up in stupid linguistic arguments. I'd suggest supplementing this by understanding how Conceptual Engineering avoids the plague of counterexample philosophy prevalent with conceptual engineering (Wittgenstein's conception of meanings as Family Resemblances is useful too - Eliezier talks about the cluster structure of thingspace).
Most normal people are far too ready to dismiss hypothetical situations. While if taken too far Making Beliefs Pay Rent can lead to a naïve kind of logical positivism, it is in general a good heuristic. Where Recursive Justification Hits Bottom argues for a kind of circular epistemology.
In terms of morality Torture vs. Dust Specks is a classic.
Pragmatically, there's the Pareto Principle (or 80/20 rule) and I'll also throw in my posts on Making Exceptions to General Rules and Emotions are not Beliefs.
In terms of understanding people better there's Inferential Distance, Mistake Theory vs. Conflict Theory, Contextualising vs. Decoupling Norms, The Least Convenient Possible World, Intellectual Turing Tests and Steelmanning/Principal of Charity.
There seems to be an increasingly broad agreement that meditation is really important and compliments rationality beautifully insofar as irrationality is more often a result of lack of control over our emotions, than lack of knowledge. But beyond this, it can provide extra introspective capacities and meditative practises like circling can allow us to relate better with humans.
One of my main philosophical disagreements with people here is that they often lean towards verificationism, while I don't believe that the universe has to play nice and so that often things will be true that we can't actually verify.
I appreciate how Ben handled this: it was nice for him to let me comment before he posted and for him to also add some words of appreciation at the end.
Regarding point 2, since I was viewing this in game mode I had no real reason to worry about being tricked. Avoiding being tricked by not posting about it would have been like avoiding losing in chess by never making the next move.
I guess other than that, I'd suggest that even a counterfactual donation of $100 to charity not occurring would feel more significant than the frontpage going down for a day. Like the current penalty feels like it was specifically chosen to be insignificant.
Also, I definitely would have taken it more seriously if I realised it was serious to people. This wasn't even in my zone of possibility.
Why would there be? I'm sure they saw it as just a game too and it would be extremely hypocritical for me to be annoyed at anyone for that.
Hey, I've become interested in this field too recently. I've been listening to the Jim Rutt show which is pretty interesting, but I haven't dived into it in any real depth. I agree that it is something that we should be looking more into.
I won't pretend to be an expert on this topic, but my understanding of the differences is as follow: