A Confused Chemist's Review of AlphaFold 2

J Bostock

(This article was originally going to be titled "A Chemist's Review of AlphaFold 2")

Most of the protein chemists I know have a dismissive view of AlphaFold. Common criticisms generally refer to concerns of "pattern matching". I wanted to address these concerns, and have found a couple of concerns of my own.

The main method for assessment of AlphaFold 2 has been the Critical Assessment of Protein Structure (CASP). This is a competition held based on a set of protein structures which have been determined by established experimental methods, but deliberately held back from publishing. Entrant algorithms then attempt to predict the structure based on amino acid sequence alone. AlphaFold 2 did much better than any other entrant in 2020, scoring 244 compared to the second place entrant's 91 by CASP's scoring method.

The first thing that struck me during my investigation is how large AlphaFold is, in terms of disk space. On top of neural network weights, it has a 2.2 TB protein structure database. A model which does ab initio calculations i.e. does a simulation of the protein based on physical and chemical principles, will be much smaller. For example Rosetta, a leading ab initio software package recommends 1 GB of working memory per processor in use while running, and gives no warnings at all about the file size of the program itself.

DeepMind has an explicit goal of replacing crystallography as a method for determining protein structure. Almost all crystallography is carried out on naturally occurring proteins isolated from organisms under study. This means the proteins are products of evolution, which generally conserves protein structure as a means of conserving function. Predicting the structure of an evolved protein is a subtly different problem to predicting the structure of a sequence of random amino acids. For this purpose AlphaFold 2 is doing an excellent job.

On the other hand, I have a few nagging doubts about how exactly DeepMind are going about solving the protein folding problem. Whether these are a result of my own biases or not is unclear to me. I am certainly sympathetic to the concerns of my peers that something is missing from AlphaFold 2.

Representations

One of the core elements is that the representation(s) of protein structure is fed through the same network(s) multiple times. This is referred to as "recycling" in the paper, and it makes sense. What's interesting is that there are multiple layers which seem to refine the structure in completely different ways.

Some of these updates act on the "pair representation", which is pretty much a bunch of distances between amino acid residues (amino acids in proteins are called residues). Well it's not that, but it's not not that. I think it's best thought of as a sort of "affinity" or "interaction" between residues, which is over time refined to be constrained to 3D space.

There is also a separate representation called the "multiple system alignment (MSA) representation" which is not a bunch of distances between residues.

The MSA representation (roughly) starts with finding a bunch of proteins with a similar amino acid sequence to the input sequence. The search space of this is the 2.2 TB of data. Then it comes up with some representation, with each residue of our input protein being assigned some number relating it to a protein which looks like the input protein. To be honest I don't really understand exactly what this representation encodes it and I can't find a good explanation. I think it somehow encodes two things, although I can't confirm this as I don't know much about the actual data structure involved.

Thing 1 is that the input protein probably has a structure similar to these proteins. This is sort of a reasonable expectation in general, but makes even more sense from an evolutionary perspective. Mutations which disrupt the structure of a protein significantly usually break its function and die out.

Thing 2 is a sort of site correlation. If the proteins all have the same-ish structure, then we can look for correlations between residues. Imagine if we saw that when the 15th residue is positively charged, the 56th one is always negatively charged, and vice versa. This would give us information that they're close to one another.

Evoformer Module

The first bunch of changes to the representations comes from the "evoformer" which seems to be the workhorse of AlphaFold. This is the part that sets it apart from ordinary simulations. A bunch of these models sequentially update the representations.

The first few transformations are the two representations interacting to exchange information. This makes sense as something to do and I'm not particularly sure I can interpret it any more than "neural network magic". The MSA representation isn't modified any further and is passed forwards to the next evoformer run.

The pair representation is updated based on some outer product with the MSA representation then continues.

The next stages are a few constraints relating to 3D Euclidean space being enforced on the pair representation. Again not much commentary here. All this stuff is applied 48 times in sequence, but there are no shared weights between the iterations of the evoformer. Only the overall structure is the same.

Structure Module and AMBER

This section explicitly considers the atoms in 3D space. One of the things it does is use a a "residue gas" model which treats each residue as a free floating molecule. This is an interesting way of doing things. This allows all parts of the protein to be updated at once without dealing with loops in the structure. Then a later module applies a constraint that they have to be joined into a chain.

They also use the AMBER force-field (which is a simulation of the atoms based on chemical principles) to "relax" the protein sequence at some points. This does not improve accuracy by the atom-to-atom distance measures, but it does remove physically impossible occlusions of atoms. The authors describe these as "distracting" but strongly imply that the AMBER part isn't very important.

Attention

I think this is what gives AlphaFold a lot of its edge, and unfortunately I don't understand it all that well. It's very similar to human attention in that the network first does some computations to decide where to look, then does more computations to make changes in that region. This is much better (I think) than an "mechanical" model which simulates the protein atom by atom, and devotes equal amounts of computation to each step.

Thoughts and Conclusions

The second placing team in CASP14 was the Baker group, who also used an approach based on neural networks and protein databases. Knowing this, it doesn't surprise me much that DeepMind were able to outperform them, given the differences in resources (both human and technical) available to them. Perhaps this is a corollary of the "bitter lesson": perhaps computation-specialized groups will always eventually outperform domain-specialized groups.

I do have two concerns though:

My first concern is that I strongly suspect that the database-heavy approach introduces certain biases, in the no-free-lunch sense of the word. The selection of proteins available is selected for in two ways: first by evolution, and secondly by ease of analysis.

Evolved proteins are not subject to random changes, only changes which allow the organism to survive. Mutations which significantly destabilize or change the structure of the protein are unlikely to be advantageous and outcompete the existing protein. CASP14 seems to be the prime source of validation for the model. This consists entirely of evolved proteins, so does not provide any evidence of performance against non-evolved (i.e. engineered) proteins. This strongly limits the usage of AlphaFold for protein engineering and design.

Secondly, not all proteins can be crystallized easily, or even at all. Also, some proteins only take on defined structure (and can only be crystallized) when bound to another molecule. DeepMind are working on functional predictions of small molecules binding to proteins, but including bound non-protein molecules in their structural predictions is outside the current scope of AlphaFold.

Both of these cases are much rarer than the typical case of protein crystallography, which is the main aim of AlphaFold. For most protein researchers, particularly medical researchers, I suspect that the trade-off of using the database approach is worth it.

My second concern is more nebulous, and relates to their usage of AMBER. This feels like they're outsourcing their low-level physical modelling.

This sort of thing is actually quite common when analysing crystallography results, which often require multiple rounds of refinement to get from crystallographic data to a sensible (read: physically possible, without atoms overlapping each other) structure. However the "first-draft" output of a crystallographic data is basically just a fit to the output of a 3D Fourier transform on some potentially noisy x-ray scattering data.

This somehow still feels different to Alphafold. If lots of their neural network layers are outputting "impossible" structures with atoms overlapping then it suggests those layers have failed to learn something about the physical properties of the world. I'm not sure whether or not this will turn out to be important in the end. It may be that I just have some irrational unease about models which learn in a very different way to human minds, learning complicated rules before simple ones.

AlphaFold will only grow in accuracy and scope. I suspect it will eventually overcome its current limitations.

Sources and Thanks:

Many thanks to the LessWrong mod team, and their feedback system. This could have not been written without the feedback I received on an early draft of the article.

The actual paper: https://www.nature.com/articles/s41586-021-03819-2

CASP14: https://predictioncenter.org/casp14/

OPIG who have a much more in-depth analysis of the mechanics of AlphaFold 2: https://www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/

LESSWRONG
LW