Why AlphaFold Has Not Solved the Protein-Folding Problem.
Paul Nelson
The online database AlphaFold represents an amazing breakthrough by any measure of the word “breakthrough.” Biology is a much stronger science today for having AlphaFold in its analytical armamentarium.
But the algorithm, powerful as it is, has NOT solved the protein-folding problem, if we take that problem to mean this:
predicting the three-dimensional conformation of a protein strictly from its primary DNA sequence, ab initio.
An analogy to natural language may help. Suppose I give you a character string in English which you’ve never seen before, with no surrounding semantic context, and no corresponding lexicon or dictionary referents, even approximate. Here are two such words — these are words used weekly in Nelson family conversations for over 25 years:
googlimasha
mecky
My wife and daughters know EXACTLY what these words mean. Do you? Unless we’ve told you, almost certainly not. (Scroll down to the end for their meanings.) As far as the reader is concerned, these words are singletons, and you can only guess at their meanings (functional roles in English).
AlphaFold uses existing sequences and their known conformations / structures to predict unknown structures. Under the natural language analogy, AlphaFold levers itself off the existing genetic and proteomic dictionaries. But if a sequence exists as a singleton, in an isolated region of sequence space, AlphaFold performs poorly. Which means the protein folding problem, in its original form, remains unsolved.
Yours to Discover
A new unpublished MS by Yves-Henri Sanejouand of the French National Centre for Scientific Research is worth your attention, in relation to the protein folding problem, but also the high frequency of unique (singleton) proteins in eukaryotic species. See, “On the unknown proteins of eukaryotic proteomes.” The fascinating implications of Sanejouand’s preliminary analysis are yours to discover.
But if one extends one’s scope to include ALL nucleic acid sequences on Earth (not just eukaryotes), things get really wild. In a new paper, in press at Environmental Microbiology, Eugene Koonin and colleagues argue that — given their sequence diversity — viruses on Earth must have many independent origins. See, “The global virome: how much diversity and how many independent origins?”
No Current Viable Theory
After you read Koonin et al.’s paper, reflect for a moment on its implications. The vast majority of nucleic acid diversity on this planet is unique, represented by singletons (emphasis added):
…we can also roughly estimate the size of the virus pangenome, in other words, the total number of genes in the virosphere. Large viruses encompass many poorly conserved, species-specific genes that obviously represent the bulk of the virus pangenome. Assuming 10 such unique genes per virus species, there would be 108 to 1010 unique virus genes altogether, a vast gene repertoire, to put it modestly.
All these sequences must have been processed through a ribosome, borrowed from a free-living cell. There is currently no viable theory for the replication of viral genomes without the simultaneous presence of organismal systems (basically, ribosomes) to be hijacked. Thus the evolutionary clock for the origin of 108 to 1010 viral genes cannot start ticking until the origin of ribosomes.
This appears to be the Mother of All Waiting Times Problems.
Oh, and those words I mentioned earlier? “Googlimasha” is a noun. It means “what Paul made that afternoon for dinner, but doesn’t want to tell his daughters when he picks them up at the end of their school day, because they will complain that they’re not in the mood for pork chops, or whatever, and Paul — having just slaved over dinner prep — simply isn’t interested in their spoiled suburban bellyaching.”
As for “mecky,” it can be a noun but most often is an adjective. It describes the hybrid state of “heck” and “messy,” in other words, an awful situation getting steadily worse. In its noun form, it is a term of endearment for Paul himself, frequently used by his daughter who is now a high school science teacher in Yonkers, NY.