Clearcut: Publish the Claim, then Destroy the Data
Evolution News & Views January 16, 2013 5:55 PM
Most of the data to support claims about Darwin's Tree of Life over the last two decades have been "irrevocably lost," a survey found.
Bryan T. Drew, working in conjunction with the Open Tree of Life Project, decided to check the data supporting the growing construction of Darwin's tree. He reported the somewhat shocking results in Nature:
Of 6,193 papers we surveyed in more than 100 peer-reviewed journals, only 17% present accessible trees and alignments (used to infer relatedness). Contacting lead authors to procure data sets was only 19% successful. DNA sequences were deposited in GenBank for almost all these studies, but it is the actual character alignments that are pivotal for reproducing phylogenetic analyses. We estimate that more than 64% of existing alignments or trees are permanently lost. (Emphasis added.)
This implies that nearly 4,000 published papers cannot back up their claims with the actual evidence. Perhaps improved practices in the future can rectify the situation, but Drew believes it is serious:
This problem will increasingly hinder phylogenetic inference as the use of whole-genome data sets becomes common. Journals need to reinforce a policy of online data deposition, either as supplementary material or in repositories such as TreeBASE (http://treebase.org) or Dryad (http://datadryad.org) -- including for data sets based on previously published sequences. Ecologists, evolutionary biologists and others will then have access to rigorous phylogenetics for testing their hypotheses.
The "others" could reasonably include skeptics of the Tree of Life who want to evaluate whether the claims made are justified by the data or instead are held aloft by prior assumptions that Darwinian evolutionary common descent is true.
The Open Tree of Life website adds more reason for concern. Only a small portion of phylogenetic data is stored publicly, an article headlined "The Glass Is Still Pretty Empty" warns:
Sometimes you wonder whether the glass is half full or half empty.
But when it is only filled for four percent -- the other 96 percent is just air -- there is only one conclusion: it is time for more.
Despite Drew's estimate in Nature that more than 64% of phylogenetic data are lost, the actual number could be much higher. This article says that only a tiny portion of published trees can be checked against the data:
At least that is what some scientists in the phylogenetic community argue, because only about four percent of all published phylogenies are stored in places such as TreeBASE or Dryad. Their message is quite simple: it is time to bring together more databases with estimations on how species are possibly related to each other.
Several journals in the evolutionary biology field recently adopted policies that encourage or require contributors to make their data publicly available online. Yet, this only leads to the storage of a very small percentage of ten-thousands of phylogenies that have been constructed in the past few decades.
What this implies is that evolutionary phylogenies are largely baseless (if you'll pardon the pun). Sure, there may be ways to reconstruct some of them, but who would be willing to go to the trouble?
Of course, there are also other ways to receive data that are not stored on the Internet, but those alternatives are commonly not the most efficient routes. For instance, it is possible to send an email to a scientist who published a phylogenetic tree and "sometimes wait for six months to maybe get a response -- either with or without the data," says Keith Crandall, one of the Open Tree of Life investigators and the founding director of the Computational Biology Institute at George Washington University.
Multiply that by 4,000 papers and you can be sure nobody has that kind of time or patience. Evolutionary biologists are not necessarily being deceptive about their claims; it's just too much bother to upload all those sequences when there are no standards. Nevertheless, this data loss hinders one of science's primary ideals: reproducibility. How many "Libraries of Alexandria" are figuratively being burned by negligence? (See "Science Can Perpetuate Myths.")
Only an overwhelming amount [of online data] provides scientists the opportunity to efficiently explore where prior studies are in agreement on how species are related, but also where there are conflicts that still need to be resolved.
Where replication has been tried, it has often failed:
A group of scientists from the Netherlands, United Kingdom, and United States recently published an article about current practices for storing datasets with tree estimates. They concluded that "most phylogenetic knowledge is not easily re-used due to a lack of archiving, lack of awareness of best practices, and lack of community-wide standards for formatting data, naming entities, and annotating data." As a result, "[m]ost attempts at data re-use seem to end in disappointment."
The article ends with optimism that things will get better. Up till now, some participants have been "appropriately cautious" about promised solutions. They also want to see "some positive signs that the Open Tree of Life project will be successful" first. This implies that, to date, it has not been successful.
Maybe things will get better. Maybe someday inferences about Darwin's Tree will be justified by overwhelming data available for all to check. But for now, despite all the advances in genome sequencing, Darwin's Tree is a phantom, not a reality. Knock on the wood. It's 96% air.
Meanwhile, a recent paper in PNAS finds "widespread horizontal gene transfer of retrotransposons" in vertebrate genomes, including those of cattle, cats and koalas. This could further confound evolutionary inferences:
A phylogenetic tree built from BovB sequences from species in all of these groups does not conform to expected evolutionary relationships of the species, and our analysis indicates that at least nine HT [horizontal transfer] events are required to explain the observed topology. Our results provide compelling evidence for HT of genetic material that has transformed vertebrate genomes.
This is not the kind of "transformism" that Darwin hoped to document. It implies the documentation itself has been transformed, making it harder to see a tree.
Evolution News & Views January 16, 2013 5:55 PM
Most of the data to support claims about Darwin's Tree of Life over the last two decades have been "irrevocably lost," a survey found.
Bryan T. Drew, working in conjunction with the Open Tree of Life Project, decided to check the data supporting the growing construction of Darwin's tree. He reported the somewhat shocking results in Nature:
Of 6,193 papers we surveyed in more than 100 peer-reviewed journals, only 17% present accessible trees and alignments (used to infer relatedness). Contacting lead authors to procure data sets was only 19% successful. DNA sequences were deposited in GenBank for almost all these studies, but it is the actual character alignments that are pivotal for reproducing phylogenetic analyses. We estimate that more than 64% of existing alignments or trees are permanently lost. (Emphasis added.)
This implies that nearly 4,000 published papers cannot back up their claims with the actual evidence. Perhaps improved practices in the future can rectify the situation, but Drew believes it is serious:
This problem will increasingly hinder phylogenetic inference as the use of whole-genome data sets becomes common. Journals need to reinforce a policy of online data deposition, either as supplementary material or in repositories such as TreeBASE (http://treebase.org) or Dryad (http://datadryad.org) -- including for data sets based on previously published sequences. Ecologists, evolutionary biologists and others will then have access to rigorous phylogenetics for testing their hypotheses.
The "others" could reasonably include skeptics of the Tree of Life who want to evaluate whether the claims made are justified by the data or instead are held aloft by prior assumptions that Darwinian evolutionary common descent is true.
The Open Tree of Life website adds more reason for concern. Only a small portion of phylogenetic data is stored publicly, an article headlined "The Glass Is Still Pretty Empty" warns:
Sometimes you wonder whether the glass is half full or half empty.
But when it is only filled for four percent -- the other 96 percent is just air -- there is only one conclusion: it is time for more.
Despite Drew's estimate in Nature that more than 64% of phylogenetic data are lost, the actual number could be much higher. This article says that only a tiny portion of published trees can be checked against the data:
At least that is what some scientists in the phylogenetic community argue, because only about four percent of all published phylogenies are stored in places such as TreeBASE or Dryad. Their message is quite simple: it is time to bring together more databases with estimations on how species are possibly related to each other.
Several journals in the evolutionary biology field recently adopted policies that encourage or require contributors to make their data publicly available online. Yet, this only leads to the storage of a very small percentage of ten-thousands of phylogenies that have been constructed in the past few decades.
What this implies is that evolutionary phylogenies are largely baseless (if you'll pardon the pun). Sure, there may be ways to reconstruct some of them, but who would be willing to go to the trouble?
Of course, there are also other ways to receive data that are not stored on the Internet, but those alternatives are commonly not the most efficient routes. For instance, it is possible to send an email to a scientist who published a phylogenetic tree and "sometimes wait for six months to maybe get a response -- either with or without the data," says Keith Crandall, one of the Open Tree of Life investigators and the founding director of the Computational Biology Institute at George Washington University.
Multiply that by 4,000 papers and you can be sure nobody has that kind of time or patience. Evolutionary biologists are not necessarily being deceptive about their claims; it's just too much bother to upload all those sequences when there are no standards. Nevertheless, this data loss hinders one of science's primary ideals: reproducibility. How many "Libraries of Alexandria" are figuratively being burned by negligence? (See "Science Can Perpetuate Myths.")
Only an overwhelming amount [of online data] provides scientists the opportunity to efficiently explore where prior studies are in agreement on how species are related, but also where there are conflicts that still need to be resolved.
Where replication has been tried, it has often failed:
A group of scientists from the Netherlands, United Kingdom, and United States recently published an article about current practices for storing datasets with tree estimates. They concluded that "most phylogenetic knowledge is not easily re-used due to a lack of archiving, lack of awareness of best practices, and lack of community-wide standards for formatting data, naming entities, and annotating data." As a result, "[m]ost attempts at data re-use seem to end in disappointment."
The article ends with optimism that things will get better. Up till now, some participants have been "appropriately cautious" about promised solutions. They also want to see "some positive signs that the Open Tree of Life project will be successful" first. This implies that, to date, it has not been successful.
Maybe things will get better. Maybe someday inferences about Darwin's Tree will be justified by overwhelming data available for all to check. But for now, despite all the advances in genome sequencing, Darwin's Tree is a phantom, not a reality. Knock on the wood. It's 96% air.
Meanwhile, a recent paper in PNAS finds "widespread horizontal gene transfer of retrotransposons" in vertebrate genomes, including those of cattle, cats and koalas. This could further confound evolutionary inferences:
A phylogenetic tree built from BovB sequences from species in all of these groups does not conform to expected evolutionary relationships of the species, and our analysis indicates that at least nine HT [horizontal transfer] events are required to explain the observed topology. Our results provide compelling evidence for HT of genetic material that has transformed vertebrate genomes.
This is not the kind of "transformism" that Darwin hoped to document. It implies the documentation itself has been transformed, making it harder to see a tree.
No comments:
Post a Comment