Alex Flint

Independent AI alignment researcher

Sequences

The accumulation of knowledge

Wiki Contributions

Comments

Thanks for writing this.

Alignment research has a track record of being a long slow slog. It seems that what we’re looking for is a kind of insight that is just very very hard to see, and people who have made real progress seem to have done so through long periods of staring at the problem.

With your two week research sprints, how do you decide what to work on for a given sprint?

Well suffering is a real thing, like bread or stones. It's not a word that refers to a term in anyone's utility function, although it's of course possible to formulate utility functions that refer to it.

The direct information I'm aware of is (1) CZ's tweets about not acquiring, (2) SBF's own tweets yesterday, (3) the leaked P&L doc from Alameda. I don't think any of these are sufficient to decide "SBF committed fraud" or "SBF did something unethical". Perhaps there is additional information that I haven't seen, though.

(I do think that if SBF committed fraud, then he did something unethical.)

If you view people as machiavelian actors using models to pursue goals then you will eventually find social interactions to be bewildering and terrifying, because there actually is no way to discern honesty or kindness or good intention if you start from the view that each person is ultimately pursuing some kind of goal in an ends-justify-means way.

But neither does it really make sense to say "hey let's give everyone the benefit of the doubt because then such-and-such".

I think in the end you have to find a way to trust something that is not the particular beliefs or goals of a person.

In Buddhist ideology, the reason to pick one set of values over another is to find an end to suffering. The Buddha claimed that certain values tended to lead towards the end of suffering and other values tended to lead in the opposite direction. He recommended that people check this claim for themselves.

In this way values are seen as instrumental rather than fundamental in Buddhism -- that is, Buddhists pick values on the basis of the consequences of holding those values, rather than any fundamental rightness of the values themselves.

Now you may say that the "end of suffering" is itself a value; that there is nothing special about the end of suffering except to one who happens to value it. If you take this perspective then you're essentially saying: there is nothing objectively worthwhile in life, only things that certain people happen to value. But if this was true then you'd expect to be able to go through life and see that each seemingly-worthwhile thing is not intrinsically worthwhile, but only worthwhile from a certain parochial perspective. Is that really the case?

There's mounting evidence that FTX was engaged in theft/fraud, which would be straightforwardly unethical.

I think it's way too early to decide anything remotely like that. As far as I understand, we have a single leaked balance sheet from Alameda and a handful of tweets from CZ (CEO of Binance) who presumably got to look at some aspect of FTX internals when deciding whether to acquire. Do we have any other real information?

I'm curious about this too. I actually have the sense that overall funding for AI alignment was already larger than overall shovel-ready projects before FTX was involved. This is normal and expected in a field that many people is working on an important problem but where most of the work is funding for research, and where hardly anyone has promising scalable uses for money.

I think this led a lot of prizes being announced. A prize is a good way to fund if you don't see enough shovel-ready projects to exhaust your funding. You offer prizes for anyone who can formulate and execute new projects, hence enticing people who weren't previously working on the problem to start working on the problem. This is a pretty good approach IMO.

With the collapse of FTX, I guess a bunch of prizes will go away.

What else? I'm interested.

Regarding your point on ELK: to make the output of the opaque machine learning system counterfactable, wouldn't it be sufficient to include the whole program trace? Program trace means the results of all the intermediate computations computed along the way. Yet including a program trace wouldn't help us much if we don't know what function of that program trace will tell us, for example, whether the machine learning system is deliberately deceiving us.

So yes it's necessary to have an information set that includes the relevant information, but isn't the main part of the (ELK) problem to determine what function of that information corresponds to the particular latent variable that we're looking for?

If I understand you correctly, the reason that this notion of counterfactable connects with what we normally call a counterfactual is that when an event screens of its own history, it's easy to consider other "values" of the "variable" underlying that event without coming into any logical contradictions with other events ("values of other variables") that we're holding fixed.

For example if I try to consider what would have happened if there had been a snow storm in Vermont last night, while holding fixed the particular weather patterns observed in Vermont and surrounding areas on the preceding day, then I'm in kind of a tricky spot, because on the one hand I'm considering the weather patterns from the previous day as fixed (which did not in fact give rise to a snow storm in Vermont last night), and yet I'm also trying to "consider" a snow storm in Vermont. The closer I look into this the more confused I'm going to get, and in the end I'll find that this notion of "consider a snow storm took place in Vermont last night" is a bit ill-defined.

What I would like to say is: let's consider a snow storm in Vermont last night; in order to do that let's forget everything that would mess with that consideration.

My question for you is: in the world we live in, the full causal history of any real event contains almost the whole history of Earth from the time of the event backwards, because the Earth is so small relative to the speed of light, and everything that could have interacted with the event is part of the history of that event. So in practice, won't all counterfactable events need to be a more-or-less a full specification of the whole state of the world at a certain point in time?

I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).

I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.

I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system that carefully balances itself and the whole world in such a way that does not slip down the evolutionary slope towards pursuing psychopathic goals by any means necessary?

I think this balancing is possible, it's just not the default attractor, and the default attractor seems to have a huge basin.

Load More