If you are going to read just one thing I wrote, read The Problem of the Criterion.
More AI related stuff collected over at PAISRI
Part of the issue is like that celebrity, as wikipedia approaches the word, is broader than just modern TV, film, etc. celebrity and instead includes a wide variety of people who are not likely to be exceptionally attractive but are well known in some other way. There's individual preferences in terms of who they think are attractive, but many politicians, authors, radio personalities, famous scientists, etc. are not conventionally attractive in the way movie stars are attractive and yet these people are still celebrities in a broad sense. However, I've not dug into the depths of wikipedia to see if, for example, this gap you see holds up if looking at pages that more directly talk about the qualities of film stars, for example.
AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make m(X) adequately evaluate X as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed m(X) we forgot about minimizing side effects.
Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.
I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.
Fun (for some version of "fun").
There's obvious connections to other countdown clocks, like the nuclear doomsday one. But here's a fun connection to think about. The World Transhumanist Association used to have a "death clock" on their website that counted up deaths so you could see how many people died while you were busy browsing their site. Given that AGI maybe could lead to technological breakthroughs that end death (or could kill everyone), maybe there's something in the space of how many more people have to die before we get the tech that can save them?
IDK, just spitballing here.
"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.
For what it's worth, I think this is trying to get at the same insight as logical time but via a different path.
For the curious reader, this is also the same reason we use vector clocks to build distributed systems when we can't synchronize the clocks very well.
And there's something quite interesting about computation as a partial order. It might seem that this only comes up when you have a "distributed" system, but actually you need partial orders to reason about unitary programs when they are non-deterministic (any program with loops and conditionals that can't be unrolled because they depend on inputs not known before runtime are non-deterministic in this sense). For this reason, partial orders are the bread-and-butter of program verification.
I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.
Suppose we have a simple control system with output X and a governor G. G takes a measurement m(X) (an observation) of X. So long as m(X) is not error free (and I think we can agree that no real world system can be actually error free), then X=m(X)+ϵ for some error factor ϵ. Since G uses m(X) to regulate the system to change X, we now have error influencing the value of X. Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e. G regulating the value of X for long enough), ϵ comes to dominate the value of X.
This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.
I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.
The source of my take comes from two facts:
Stuart Armstrong has made a case for (2) with his no free lunch theorem. I've not seen anyone formally make the case for (1), though.
Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?
I'm many years out from doing math full time so I'm not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don't get very precise about what that means.
Indeed. I left a comment on the Facebook version of this basically saying "it's all hermeneutics unless you're just directly experiencing the world without conceptions, so worrying about woo specifically is worrying about the wrong frame".
You do say "a lot"/"most", but at least for me this is totally backwards. I only looked at woo type stuff because it was the only place attempting to explain some aspects of my experience. Rationalists were leaving bits of reality on the floor so I had to look elsewhere and then perform hermeneutics to pick out the useful bits (and then try to bring it back to rationalists, with varying degrees of success).