[Linkpost] AlphaFold: a solution to a 50-year-old grand challenge in biology

by adamShimi1 min read30th Nov 202023 comments

54

BiologyMachine LearningWorld ModelingAI
Frontpage

To my eyes, this looks like the sort of useful advance in applying AI that doesn't really improve capabilities, and thus is just net positive, even judged by a safety mindset. But I'm curious to know if other people think differently.

(Note that it might be easier to discuss this after the paper is posted online)

23 comments, sorted by Highlighting new comments since Today at 11:16 PM
New Comment

I can't wait to see how it works when you apply it to orphan proteins that don't have any evolutionary relatives in the training dataset.  At least some of this efficacy probably comes from effectively encoding an ability to see very deep evolutionary homology hidden in sequences, and variations around ancient motifs.

I also wonder if the system can be reversed, such that you give it a 3-dimensional backbone arrangement and it dreams up a sequence to fold into it.

 

EDIT:  See this writeup https://www.blopig.com/blog/2020/12/casp14-what-google-deepminds-alphafold-2-really-achieved-and-what-it-means-for-protein-folding-biology-and-bioinformatics/

Cool ideas!

I'm especially curious about the second one. Is there any known situation where we have a 3-dimensional arrangement but not the sequence? I can think of two possibilities: either we have an existing protein for which we know the structure but not the sequence (which from my modest understanding looks improbable, because sequence seems easier) or we have an application (medical for example) that needs a specific kind of structure, and we want to know how to create such a protein.

Are these what you had in mind?

I would mostly be thinking of engineering novel proteins.  That field is pretty rudimentary (though the work of Michael Hecht at Princeton fascinates me).

This could also lead to an interesting race dynamic between biotech/pharma companies in the near future. If a novel disease target protein is identified and DeepMind decides to license their technology, all companies would in theory have access to the same structural information at the same time.  Then it becomes a matter of who can execute screening, medicinal chemistry optimization, and clinical evaluation the fastest.  Having a strategy in place to be the first to obtain patents for chemical matter modulating that target protein would also be a large advantage in this kind of situation.

See this writeup

 

https://www.blopig.com/blog/2020/12/casp14-what-google-deepminds-alphafold-2-really-achieved-and-what-it-means-for-protein-folding-biology-and-bioinformatics/

I don’t know if there are too many tech advances that are unqualified “good” from an X risk perspective. In this case, any advances in bioengineering might make it easier to create bioweapons, for example. Any advances in AI create more demand for AI...

Fair enough. My idea was focused on AI existential risk; from that perspective, it seems to me that this result doesn't increase directly the existential risk from AI, in the way that GPT-3 does, for example. But the effect of pushing more people in the field is probably a real issue.

Prediction: this won't make much difference for either biology or medicine in general. The one big thing it will do is cause funding agencies to stop wasting so much money on protein structure studies (assuming that AlphaFold's results generalize beyond this particular challenge, which I'm uncertain about). The whole field of structural biology was 95% useless anyway.

It is an interesting result from the AI angle, though.

That's an interesting take.

Do you have a simple explanation of why you consider structural biology useless? My outside view impression was that protein shape and folding was really important to understanding how to work. Isn't that useful in practice?

We mainly want to know (a) what reactions a protein is involved in, and (b) the rate constants on those reactions. In practice, protein shape tells us very little about either of those without extensive additional simulation. (It can give some hints as to what broad classes of reaction the protein might be involved in, but my understanding is that we can get most of those same hints from the sequence alone.)

In principle, folded protein structures could be used as an input to those sorts of simulations, but the simulation is expensive in much the same way as the folding problem itself, and as far as I know the cutting edge in simulation still can't provide precision or speed comparable to high-throughput assays (even given folded structures).

In gears terms: everything we care about in a high-dimensional protein structure is summarized by low-dimensional reaction rates, so proteins make really good gears. A practical consequence is that directly measuring reaction rates is way more efficient than simulating all the low-level activity. There are things that approach can't handle - e.g. we don't know how a change to the protein will change reaction rates - but even with protein folding "solved", simulation isn't at the point where it can make those predictions faster and more precisely than a new experiment.

Does knowing the structure of a protein help with simulating how it responds to any arbitrary/unknown protein/molecule/agonist/antagonist/superagonist? [it seems that even with all the protein structures that we do know well, that finding appropriate agonists of the protein with the desired action is still a huge unsolved problem]. Is simulation a much more difficult problem than "folding"?

This allows us to design "efficient" proteins (proteins designed "intelligently" often do tend to be smaller, less "messy" and "bulky" than naturally-evolved proteins [which also cross over at the most pedagogically unhelpful sites ever], and with protein folding solved, it may be easier for us to design proteins that are less complicated/more amenable to simulation than the natural set of proteins that exist => not to mention that it may be possible to find a specific transferase protein that is able to precisely add a methyl or carboxyl group to any molecule at any location, or a ligase that is able to split a molecule at any arbitrary location). We may also be able to design them based on properties like how easy it is to introduce them into the cell via mRNA (the genes for many natural proteins are not easy to introduce into the cell via CRISPR or AAV, but as protein design-space is so large, you can probably design another protein that carries out the same function that can be delivered into cells via mRNA or CMV-based vectors, without needing to force the corresponding gene at the right location at the cell's nucleus). 

Anyhow, designing proteins for industrial chemistry (eg properly degrade polyethylene plastics in the ocean) [and also those with a specific physical property rather than those that perform a very specific function] is a much easier problem than, say, figuring out how to make an extremely particular histone acetyltransferase or DNA methyltransferase or chaperone enzyme [often those at the center of hub networks and whose evolved messiness naturally evolves due to the necessity of needing to have other extremely precise interactions with other proteins that have also evolved to become messy bloated behemoths] localize/diffuse at the locations where it can precisely do the right things at {X} sites and not do the wrong things at the {Y} other sites. 

Also, this helps us develop a "periodic table of protein function" where you can design proteins that can carry out X function if you change certain motifs to it, and it will turn out as much cleaner/more organizeable/more predictable than the natural super-messy [and hard to organize] set of protein motifs we find in the wild. I think this is especially relevant for manufacturing and industrial chemistry - proteins that broadly carry out functions sort of similar to zymogen. 

The whole field of structural biology was 95% useless anyway.

As long as it produces machine-interpretable output, it's useful for training new algorithms, even if the vast majority of humans are unable to properly interpret protein structure.

^Anyhow, this post was replying to the idealized version. Protein folding is still far from solved, as https://twitter.com/mctucsf/status/1333447404910112768 explains. It's an exciting advance to be sure. I think this allows us to better figure out what a stable system of ultrastructural scaffolds is first before figuring out what precise things can be built USING those ultrastructural scaffolds.

I disagree with your assessment that structural biology is useless.  Knowing the shape of a protein can be pretty important if you want to perturb the protein's function by, say, finding or creating a small molecule that binds to it.  Crystal structures or cryo-EM structures can shed a lot of light on how a molecule binds to its target, which in turn can suggest further modifications to try and make a tighter binder.  It's not clear to me yet how easy or hard it will be to simulate ligand-protein binding using AlphaFold.  I'd lean toward 'hard' but maybe molecular dynamics simulations would dovetail well with a structure determined by AlphaFold.  

If you have a protein, and you know it's designed to bind to something, but you don't know to what, then maybe running a lot of imprecise simulations (using it's folded structure) will allow you to narrow down the list of candidates, and thereby significantly save the time and cost of experiments?

(Not an expert, just guessing)

That is the dream. The reality is harder, and the combinatorics are not friendly.

In practice, trying to "catch 2 proteins hanging out together" has usually been easier.


The main way we actually check to see if 2 proteins are interacting is... well, this metaphor is fun.

We try to work out which proteins are a couple, by trying to catch the proteins holding hands at the school dance. Either by freezing them, or sticking glue on their hands.

Sometimes even dragging one of them out of the school dance, and then checking to see if the other one tagged along.

Or if you already have a pretty good guess, try just grounding one of them and see if the other one starts acting weird.

I guess this turns the simulation method into "computer-modeling which people are likely to end up in a relationship together" which... seems to capture some of the right intuitions for how hard it is, and how much knowing "they were present in the same place at the same time" matters (whether they had an opportunity to meet in a cell type & cell compartment; something protein-shape doesn't tell you). Watching for hand-holding has typically been easier.


Un-metaphoring: there's multiple variants of this broad class of technique, and there's even a variant of it for DNA-DNA, DNA-protein, or RNA-protein interactions.

Here's some slightly-de-metaphored executions:

  • Glue: A chimeric-protein with a sticky-end (and then isolating one of the proteins in a binding column, and checking what else tagged along).
  • Freeze: Chemicals that halt cellular processes and cause semi-random-binding (ideally reversible) of things that happen to be next to each other whenever you took the freeze-frame.
  • Grounding: Here that means either altering, removing, or silencing one protein, to see how it affects the behavior of another.

And of course, whenever you do this, you still have to do: isolating, sequencing, and identifying the batch of proteins you've nabbed.

Yes, I do know the physics involved on some level, and some about the computational methods.

I think that, if deep learning can predict protein folding then it should eventually be able to predict protein binding as well, since most of the physics is the same: it's just amino acids on two different peptide chains interacting, instead of amino acids on the same chain.

On the other hand, predicting which reaction an enzyme catalyzes involves more physics, so it could be much harder: but then again, maybe it isn't. Or maybe we can at least predict with which biomolecules a given protein is likely to react and do experimental work to find out the details.

That's the dream.

I find it hard to believe your prediction that this breakthrough will be insignificant given what I've read in other reputable sources. I give a pretty high initial credence to the scientific claims of publications like Nature which had this to say in their article on Alphafold2:

The ability to accurately predict protein structures from their amino-acid sequence would be a huge boon to life sciences and medicine. It would vastly accelerate efforts to understand the building blocks of cells and enable quicker and more advanced drug discovery.

reference

This naively seems like it should have a large positive impact on the valuation of gene-editing companies, no? Solving the protein-folding problem means that there is one less mountain standing between gene-editing and true nanotech.

From an AI safety viewpoint, this might greatly increase AI funding and drive talent into the field and so advance when we get a general artificial superintelligence.

Agreed. But that's true for any AI advance. At least this one doesn't seem to increase directly the existential risk (for AI at least) and to provide some positive in the world. So my point is more that if AI advances are unavoidable, I prefer to see more like this one.

A nice way to see how hard protein folding is: https://i.imgur.com/sYpAQQr.png

For hands on experience with the difficulties try https://fold.it/portal/

A simple app that you get to try manipulating a give protein to achieve the required shapes.

I don't think it's particularly impactful from an X-risk standpoint (at least in terms of first-order consequences), but in terms of timelines I think it represents another update in favor of shorter timelines, in a similar vein to AlphaGo/AlphaZero.