The project "Genome of prokaryotes" is a scientific startup

This project was conceived long ago. 5 years ago I believed that many of the results in genomics can be obtained by people distant from biology, which I fully am. Of course during this time I picked up a bit of terminology and little recognized as working professionals. But the more I learned as working professionals the more rejection that I have caused. I think they are clearly a lot of undeserved complicate resulting in a difficult area to become impassable. While everything is quite simple and can be done efficiently. And Yes I am with them trying to compete (of course, only in a certain narrow field), as naively as it looked.

The problem of this project is that I'm his only full-fledged party. Of course, I had many during this time to talk and many have had a real impact on the project. I thank all of them. It is clear that a non-profit project not much could count on success. Yes, indeed, every scientific project is respectable about millions pouring and a team of serious scientists. We do not, and there is only humanity and enthusiasm.

So first of all I need advice from those who have experience in a start-up such projects on non-commercial basis. Secondly, you need a proper team of programmers (knowledge of biology, if necessary, I will set you free :) ). And now I would like to find such enthusiasts, which could provide work (say modestly) home web page of the project (please write to me on mail tac@inbox.lv or personal messages Habra). And, of course, important for any other feedback and suggestions.

And below I will describe the idea and then claims that the project, as well as on the current results, but they are at worst comparable with those which are given by professionals. But I am quite self-critical so it's always ready to listen to criticism — preferably not in my address and the address of the project.

From idea to computer experiments

Crude idea I state I will not, already much travelled and was described in previous articles on habré. [However, a few words will stick because below many complain that I started "off the bat". The main idea/objective of the project to understand how bacteria evolved and how consistently changed their DNA. To build the tree of the divergence of species and analyze them.] I will describe a new, all-wheel drive is called an experiment. But first I need to introduce you to the issues and then understand how to evaluate the results of the experiment.

Phylogenetic signal

Here let's try to discuss this term, which drew my attention one biologist.

When the evolutionary origin of animals from a common ancestor, it is believed that it is possible to build a single tree-like hierarchical structure of the origin of species. There is no fundamental difference what are the signs to base. Just the more genes included in the analysis, the less remains in the tree are poorly informed areas. At the same time if classified objects are not descended from a common ancestor, there is no single hierarchical tree structure. The classification of these objects is obtained or fundamentally different when using different sets of features (genes) or is fundamentally a "woody" look.

But the coincidence of the resulting "trees" are built on different grounds supposedly tells us about the presence of "phylogenetic signal". And the smaller the differences between the trees, built on different sets of genes, the more strong "phylogenetic signal" we have. But what is important, the reverse is not true.

Often say that the signal really exists and is identical. But this is not the case, so I got one statethat is slightly more critical on this score.

First, they point out that:

it is Assumed that the analysis of a set of genes can enhance the phylogenetic signal to its excess above the noise and achieve proper resolution of conflicts between different genes. But
[there are a number of private examples]

All of this suggests that the current methods of reconstruction of the phylogeny of a large number of genes do not eliminate artifacts known to single genes. Here are just as likely to affect the assumptions of evolution, the difference in speed of evolution of species, errors of alignment and selection of orthologous, insufficient taxonomic representativeness of the sample. For elimination of artifacts multigenic phylogenetic analysis of the proposed data selection, which of course makes it not so formal. Thus, the practice of modern phylogenomic shows that the statistical support for the phylogeny reconstructions increases with the number of genes to compare, but the high level of statistical support of the tree as a whole or its individual nodes cannot serve as an indicator of the correctness of the phylogenetic reconstruction.

And the second question:

How to find to test a gene or nucleotide, worthy of unlimited confidence? The smaller geological period there was a stem group, the lower the probability that randomly selected gene will carry synapomorphy, and not subject to homoplasy and reverse. To get probably the winning ticket in the lottery, there is a way to buy up the entire edition. Given the rate of development of sequencing technology and computer processing, in the case of genomes that maybe in a few years seem not such a stupid idea. On the other hand, if the sibling similarity in species is large, it is found in many genes selected at random, and even, probably, in a fairly long gene like 18S or 28S rRNA.

It's called a classic of biology. Now try to think about it.

In previous articles on the role of such genes are "trust worthy" I offered and showed what happens if it is a tRNA gene. This gene is no worse than rRNA, which now has a "boundless trust." But in this article [continued] I will show, what will happen if "to buy the whole edition." But before that, we need to understand the bad option when the "unlimited confidence" is rRNA.

And it turns out that it's not about choosing one or another gene or nucleotide sequence. And rightly so, that dream (but I don't) about the comparison on a large set of genes. It is in the method. And he has a statistical nature, and those who are a little more sober look at it will recognize as above in the article the problems "Here are just as likely to affect the assumptions of evolution, the difference in speed of evolution of species, alignment fault and selection of orthologous, the lack of representativeness of taxonomic sampling".

All this alone degrades one way or another phylogenetic signal. Most claims errors vyravnivania (I won't explain what it read the Wikipedia link). Because of this, we have to deal with statistics, and the associated errors. To properly do the alignment, especially for small sequences now can not — it really doesn't take into account the conservatism of some of the fragments. For this you need to take into account the hydrogen bonds in the tertiary structure — but this is usually when the alignment is not done.

But rRNA, first, long, second, individually, there are a lot of mistakes, but statistically they still give a signal. But what he quality below and we'll look at the example of comparison of trees constructed for the 16S rRNA and 23S rRNA (this is the longest RNA sequence which is the ribosome). Such trees were obtained in the project The All-Species Living Tree. But, and third, now write sufficient number of articles on phylogenetic trees, but here's a question as "the analysis of the prevalence of phylogenetic signal over the noise" why is that not discussed.

what's the alternative?

The only option to object to criticism like the above ("high level of statistics support the tree as a whole or its individual nodes cannot serve as an indicator of the correctness of the phylogenetic reconstruction") is to move from statistical reasoning to which common sense does not believe with 100% certainty is to jump to conclusions deterministic in nature. And for this we need to get rid of the alignment analysis and selecting those nucleotide sequences that can be evaluated without alignment.

I'm surprised, but the alternatives do not offer and do not see. Although it at least shows more stable results. Why? With that, let's face it.

After all, no matter what tree I would not give in conclusion, its credibility will be no more / no less than to other trees. But there was building professionals (such as project The All-Species Living Tree), and then say you built a "quack". And there will always be objections.

Likewise, any method is vulnerable to criticism, that there is confidence in the results. Therefore, we need a criterion of correctness of the results. This criterion claims the stability of "phylogenetic signal".

But before it for this to choose — I would like that the reader would understand why this signal may be unstable. Can be 3 reasons:

1. Evolution is not Darwinian, ie simply organisms have no common ancestor and it never was. Considering, first, that now is the phenomenon of horizontal transfer, and secondly that the hypothesis about the RNA world is largely inferred, and then separate the organisms could have arisen independently of each Darwinian evolution is really a big question. So here we'll just agree that the human mind is the hierarchy to consider the origin of species and evolution of Darwin for us just a convenient way of presenting information, similar to drawing graphs instead of textual information.
2. Error of the method. For example, alignment to which I expressed a lot of mistrust. It is because of misalignment is the deviation of the signal to a great extent.
3. Different number of examples in the sample.

When we have the influence of all three reasons, we cannot confidently distinguish the received noise is an objective or subjective reason. Ie we can't say or a problem in our method, the problem in our representative sample or all of evolution is not quite Darwinian.

Researchers can very easily say "you know our method works perfect, the selection wonderful, and those small errors that you see — it's just the way it is in nature." But first, let us quantify the error. Second, we replace the statistical approach to deterministic. Third, do the analysis only available for the deterministic approach.

the Advantage of a deterministic approach

To demonstrate the advantage of deterministic approach I will propose a thought experiment. It can actually be done experimentally, but the public will get tired of the dryness of the presentation, and most importantly since Aristotle we know that the experiment does not prove anything in absolute categories, but only allows you to say "on these data we can see that, but that doesn't mean that there can be different". And we need to judge it in absolute categories.

So a thought experiment. Compare statistical and deterministic approach. In statistics, we analyze 1000 organisms for a single gene 16S rRNA, which has a length of about 1600 characters (as is done in the majority of cases in the study). Let's say we have a reliable set of rRNA for all 1000 organisms. But for the construction of a phylogenetic tree we need to do the alignment. But before the alignment of rRNA divide into two equal parts and do the alignment and subsequent tree building on the first and the second part separately.

What we have for the deterministic approach. Here we orientirueshsya for such genes in different organisms are identical, but they may not be long, because all the long more likely prone to mutations. But instead of one gene in 1600 characters, we have a set of 10-20 genes in the 70-150 characters. Such characteristics, for example, corresponds to tRNA genes. Again, suppose that we have a reliable set of these genes. Then the question is: if the sequences of tRNA split into two parts and to build two different tree — they match or not? Answer: they match 100%. This is due to the fact that when building a tree is actually a string replace on IDs, and all manipulations occur only on the basis of combinations of genes. Therefore, if genes were correctly identified on the basis of half of the sequence, the more distortion will be.

That is, in ideal conditions and the same sample deterministic approach has a clear advantage, and has no errors of the 2nd kind.br>
And then you can talk about mistakes of the 3rd kind and how they affect the phylogenetic signal. But we must understand that in the deterministic approach, we only have the errors of the 3rd kind, and in statistical, which is accepted everywhere now, we cannot separate the influence of error "noise" of the 2nd and 3rd childbirth.

the experiment Itself

No. 1. Compare trees of 16S and 23S

So we need to compare between the two tree built by gene 23S rRNA and built by gene 16S rRNA is the final result of the project The All-Species Living Tree.

But you can compare only comparable things. And then it's time to talk about how to measure the error of the 3rd kind, i.e. how does the value of the sample and its composition is the result. Experts here would be we'd have to do statistical research any probability distributions, estimates of bias, variance, etc. muddy indexes and nothing telling coefficients. In contrast, we have to compare so that each digit would allow to understand what it means.

First, the format of phylogenetic trees, hides one important thing — they don't display clearly the parent, although it is there as the intersection of the lines on the same level. In fact, here we need to solve the issue of format conversion .for example, in newick format .gml, i.e., to obtain a full tree, where the ancestors have a title.

Second, the fact that data for the gene 16S almost 10 times more. And we need to remove are the leaves of the trees that are in the 16S tree but not in the tree, 23S, and Vice versa. Only then will we get what can be compared. But after such a removal (excision) of the "leaves" on the tree, that we have no way to compare, you can stay alleged their ancestors and if they don't have other "leaves", they also need to be removed so they don't clog up a tree.

Thirdly, and most importantly, made the above described circumcision does not solve all problems of bringing the tree to a common denominator. The situation may arise that the ancestor has only one sheet, and this ancestor is in turn again only one ancestor, and so a few times. I.e. in the result tree we have a "long thread". All of these "single" parents did not allow us to compare with another tree (23S) in which these ancestors, because it was based on another smaller sample, and of course, that large sample suggests a large number of ancestors to more accurately depict the divergence of species. But for this to be comparable it is necessary to exclude such "single" ancestors, and leaves them to raise to such a level, which is the ancestor of more than one sheet (i.e. where there is a real divergence).

This process of "raising the leaves into the space of divergence" again leave ancestors, which can also be excluded from phases 2 and 3 need to be repeated until you have eliminated all unnecessary ancestors.
Small zarisovochka to understand:

The right option to all the manipulations. Center option where the trimmed sheet "Escherichia_albertii" which is not in the compared tree. On the left, which removed unnecessary ancestor "n23". In reality, all the more serious with 18,000 nodes remain only need 3000. May also give the impression that removed important ancestors, but if they do not remove the result of the comparison will only get worse, as less wood harvested ancestors appear not can and compare, still need to comparable things and not the "pot pot".

Now if strictly to approach the comparison, the matching of the trees is when the leaves have one parent in the same tree have the same parent in the compared tree. And we can count the number of such cases. But to appreciate the closeness we should also have some distribution of errors. The amount of error you can count. If a couple of the "leaves" in a tree has one parent, then compare the tree we find them the lowest common ancestor LCA and count the number of intermediate ancestors from one sheet to LCA and from the second to the LCA — the resulting numbers add up and apply as a point on the distribution of errors.

In the end we have such a schedule, about 50% of correct cases and the remaining few erroneous error really fades.

As you can see from the experts all far from ideal, it is somewhere on 50% noisy and further though, and breaks some regularity, but fragile. So there is something to improve.

to be Continued...

Something long, so the results of the deterministic approach, I will make a separate article. There we will look at how we could improve the quality of the evolutionary tree (phylogenetic signal). The experiment is not fully finished, but I hope for the best :)

PS upd. There is a high probability that the issue with the site is resolved. Thank you good people :) Now to the team we require the site editor / image maker — so to say able to correct as a grammar and semantic correction of the text to my "sassy style" does not warp specialists, and was at the same time understandable to ordinary people.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express