LessWrong dev & admin as of July 5th, 2022.
At a high level, I'm sort of confused by why you're choosing to respond to the extremely simplified presentation of Eliezer's arguments that he presented in this podcast.
I do also have some object-level thoughts.
When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms. E.g., I expect that we can apply current alignment techniques such as reinforcement learning from human feedback (RLHF) to evolved architectures.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe, we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors. (I have a further objection to your argument about dimensionality which I'll address below.)
However, I think such issues largely fall under "ordinary engineering challenges", not "we made too many capabilities advances, and now all our alignment techniques are totally useless". I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
But they don't need to completely break the previous generations' alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly. For this to be comforting you need to argue against the disjunctive nature of the "pessimistic" arguments, or else rebut each one individually.
The manifold of mind designs is thus:
- Vastly more compact than mind design space itself.
- More similar to humans than you'd expect.
- Less differentiated by learning process detail (architecture, optimizer, etc), as compared to data content, since learning processes are much simpler than data.
This can all be true, while still leaving the manifold of "likely" mind designs vastly larger than "basically human". But even if that turned out to not be the case, I don't think it matters, since the relevant difference (for the point he's making) is not the architecture but the values embedded in it.
It also assumes that the orthogonality thesis should hold in respect to alignment techniques - that such techniques should be equally capable of aligning models to any possible objective.
This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It's thus vastly easier to align models to goals where we have many examples of people executing said goals.
The difficulty he's referring to is not one of implementing a known alignment technique to target a goal with no existing examples of success (generating a molecularly-identical strawberry), but of devising an alignment technique (or several) which will work at all. I think you're taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF "working" in a meaningful way), and then saying that, assuming those are true, Eliezer's conclusions don't follow? Which, I mean, sure, maybe, but... is not an actual argument that attacks the disagreement.
As far as I can tell, the answer is: don't reward your AIs for taking bad actions.
As you say later, this doesn't seem trivial, since our current paradigm for SotA basically doesn't allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don't scale nearly as well. (This seems like some evidence against your earlier claim that "When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.")
As it happens, I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens.
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn't happen by default.
I may come back with more object-level thoughts later. I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. "will the org that gets there even bother doing the thing correctly" (& others laid out in Ray's recent post on organizational failure modes). But for now, some positives (not remotely comprehensive):
Though obviously not in practice, since humans will still make mistakes, will fail to anticipate many possible directions of generalization, etc, etc.
I don't understand what part of my comment this is meant to be replying to. Is the claim that modern consumer software isn't extremely buggy because customers have a preference for less buggy software, and therefore will strongly prefer providers of less buggy software?
This model doesn't capture much of the relevant detail:
But also, you could just check whether software has bugs in real life, instead of attempting to derive it from that model (which would give you bad results anyways).
Having both used and written quite a lot of software, I am sorry to tell you that it has a lot of bugs across nearly all domains, and that decisions about whether to fix bugs are only ever driven by revenue considerations to the extent that the company can measure the impact of any given bug in a straightforward enough manner. Tech companies are more likely to catch bugs in payment and user registration flows, because those tend to be closely monitored, but coverage elsewhere can be extremely spotty (and bugs definitely slip through in payment and user registration flows too).
But, ultimately, this seems irrelevant to the point I was making, since I don't really expect an unaligned superintelligence to, what, cause company revenues to dip by behaving badly before it's succeeded in its takeover attempt?
We don't need superintelligence to explain why a person or organization training a model on some new architecture would either fail to notice its growth in capabilities, or stop it if they did notice:
We don't currently live in a world where we have any idea of the capabilities of the models we're training, either before, during, or even for a while after their training. Models are not even robustly tested before deployment,[1] not that this would necessarily make it safe to test them after training (or even train them past a certain point). This is not an accurate representation of reality, even with respect to traditional software, which is much easier to inspect, test, and debug than the outputs of modern ML:
like most all computer systems today, very well tested to assure that its behavior was aligned well with its owners’ goals across its domains of usage
As a rule, this doesn't happen! There are a very small number of exceptions where testing is rather more rigorous (chip design, medical & aerospace stuff, etc) but even those domains there is a constant stream of software failures, and we cannot easily apply most of the useful testing techniques used by those fields (such as fuzzing & property-based testing) to ML models.
Bing.
Fixed the image. The table of contents relies on some hacky heuristics that include checking for bolded text and I'm not totally sure what's going wrong there.
There are people working in the field who I would say are like, sort of like unabashedly good, like, [??], is taking a microscope to these giant inscrutable matrices and trying to figure out what goes on inside there.
This would be Chris Olah, but I don't know if it came through in the audio.
The missing examples are for claims of the form:
The Rationalists repeatedly rely upon sparse evidence, while claiming certainty
They have self-selected for a community of people who call Bayes the be-all-end-all, all of them agreeing they’re right, and they don’t know that they’re horribly wrong… because they don’t check!
...then you DON’T know the be-all-end-all statistical technique — and neither do Scott Alexander or Eliezer Yudkowski, as much as they’d like you to believe otherwise.
I would not be surprised if some random "rationlist" you ran into somewhere was sloppy or imprecise with their usage of Bayes. I would also not be surprised if you misinterpreted some offhand comment as an unjustified claim to statistical rigor. Maybe it was some third, other thing.
As an aside, all the ways in which you claim that Bayes is wrong are... wrong? Applications of the theorem gives you wrong results insofar as the inputs are wrong, which in real life is ~always, and yet the same is true of the techniques you mention (which, notably, rely on Bayes). There is always the question of what tool is best for a given job, and here we circle back to the question of where exactly this grevious misuse of Bayes is occurring.
It seems like you want to rate-limit me for an unspecified duration? What are the empirical metrics for that rate-limit being removed? And, the fact that you claim I "didn't provide specific, uncontroversial examples," when I just showed you those specifics again here, implies that you either weren't reading everything very carefully, or you want to mischaracterize me to silence any opposition of your preferred technique: Bayes'-Theorem-by-itself.
Deeply uncharitable interpretations of others' motives is not something we especially tolerate on LessWrong.
Just as a data point, the impression I got with respect to DeepMind was that they'd approved the conversation (contra some other orgs, for which the post said otherwise) and the review was in progress.
That's definitely one of the problems with this post, and while rudeness is generally undesirable it's slightly more forgiveable when there's some evidence of the thing that "justifies" it.
This post, and many of @AnthonyRepetto's subsequent replies to comments on it, seem to be attacking a position that the named individuals don't hold, while stridently throwing out a bunch of weird accusations and deeply underspecified claims. "Bayes is persistently wrong" - about what, exactly?
Content like this should include specific, uncontroversial examples of all the claimed intellectual bankruptcy, and not include a bunch of random (and wrong) snipes.
I'm rate-limiting your ability to comment to once per day. You may consider this a warning; if the quality of your argumentation doesn't improve then you will no longer be welcome to post on the site.
I was also looking to do alignment-focused work remotely, and then, while failing to find any appropriate[1] opportunities, had a bit of a wake-up call which led to me changing my mind.
From the "inside", there are some pretty compelling considerations for avoiding remote work.
"Context is that which is scarce" - the less "shovel-ready" the work is, the more important it is to have very high bandwidth communication. I liked remote work at my last job because I was working at a tech company where we had quarterly planning cycles and projects were structured in a way such that everyone working remotely barely made a difference, most of the time. (There were a couple projects near the end where it was clearly a significant drag on our ability to make forward progress, due to the increasing number of stakeholders, and the difficulty of coordinating everything).
LessWrong is a three-person[2] team, and if we spent basically all of our time developing features the way mature tech companies do, we could probably also be remote with maybe only a 30-40% performance penalty. But in fact a good chunk of our effort goes into attempting to backchain from "solve the alignment problem/end the acute risk period" into "what should we actually be doing". This often does involve working on LessWrong, but not 100% of the time. As an example, we're currently in the middle of a two-week "alignment sprint", where we're spending most of our timing diving into object-level research. To say that this style of work[3] benefits from co-location would be understating things.
Now, I do think that LessWrong is on the far end of the spectrum here, but I think this is substantially true for most alignment orgs, given that they tend to be smaller and working in a domain that's both extremely high context and also fairly pre-paradigmatic. In general, coordination and management capacity are severely constrained, and remote work is at its best when you need less coordination effort to achieve good outcomes.
Ones where I had some reasonable model of their theory of change, and where I expected I would be happy with day-to-day work itself.
Sort of. It's complicated.
Including the ability to pivot on relatively short notice.