The word of honour.
With correlated also called autocorrelated models, the rate distribution for a particular branch depends on the rate value of the neighboring branches; the use of correlated models seems to be the preferred choice with large groups of slowly evolving species, for example mammals, where it has been demonstrated that some subgroups evolve faster than others e.
However, the advantages and limitations of this large variety of models is still a question of debate Drummond et al. All these models and methods have shown to be useful in a number of studies, but they are computationally intensive, making it virtually impossible to deal with the larger data sets available today, even when using sophisticated implementations and powerful computers Ayres et al. Typically, days of computations are required to analyze a few hundred taxa, although faster approaches are available, using complex algorithmic approaches Akerborg et al.
Here we are interested in dating very large phylogenies, typically with a thousand tips or more, a need that is becoming increasingly common, for example, in molecular epidemiology. We propose distance-based algorithms to estimate rates and dates, a mathematical and computational framework that has proven to produce fast and fairly accurate tools in phylogenetics e.
Several distance-based as opposed to sequence-based, see above dating methods have already been proposed.
Fast Dating Using Least-Squares Criteria and Algorithms.
Most of these methods deal with time calibration points, where the dates of certain ancestral nodes in the tree are known, possibly with uncertainty e.
These methods input a rooted tree with time calibration points, and return a time-scaled, ultrametric tree. PATHd8 Britton et al. Xia and Yang's method assumes a SMC or two different local clocks, and achieves least-squares estimations under these assumptions. Sanderson'sapproach is based on a penalized-likelihood criterion to account for the autocorrelation of rates, combined with standard optimization techniques see also TreePL, Smith and O'Meara Based on computer simulations, these fast methods were shown to be accurate by their authors, producing time-scaled trees similar to those obtained using sequence-based approaches.
The focus of the present study is on serial phylogenies, where the tips of the tree have been sampled through times. Such phylogenies are common with fast-evolving organisms e. Serial phylogenies are also used with ancient DNA Lambert et al.
Moreover, close relationships exist between the calibration-points and dated-tips approaches Ronquist et al.
Several methods have been proposed in this framework. One of the very first is root-to-tip regression RTT Shankarappa et al. This method is very fast and can be extended to unrooted trees by searching among all tree branches for the best root position, according to some numerical criterion e.
However, this method does not provide estimates for the dates of internal nodes, and thus does not output time-scaled trees.
To obtain date estimates of the internal nodes, sUPGMA Drummond and Rodrigo combines a regression method to estimate the substitution rate in a first step, corrects the non-contemporaneous tips into contemporaneous tips in a second step and then uses UPGMA Sokal and Michener to compute the tree. Unlike the former approaches, Langley and Fitch's LF; method uses an explicit model.
The LF method assumes a SMC with a constant substitution rate, and models the number of substitutions along each branch of the tree by a Poisson distribution. The estimates of the global substitution rate and of the internal node dates are then obtained by maximizing the likelihood of the input, rooted tree.
LF is implemented in r8s Sanderson In this article, we study a model analogous to LF's, but using a normal approximation that allows for a least-squares approach, and show that this model is robust to uncorrelated violations of the molecular clock. Using the tree recursive structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree nodes.
With rooted trees, the time complexity is nearly linear i.
Fast dating using least-squares criteria and algorithms
The article is organized as follows: we first define the model and show its ability to handle uncorrelated rate variations among tree branches, as is commonly assumed with virus data. We then present our two main algorithms, distinguishing the unconstrained setting and the case where the temporal precedence constraints i. Last, we compare these algorithms to standard approaches using simulated data and a large influenza data set. Our algorithms take as input a binary phylogenetic tree with branch lengths, inferred by any tree building program, and sampling dates associated with the taxa.
As our algorithms are very fast, it is consistent to combine them with fast tree-building methods, for example distance-based methods e. However, we shall see that results obtained with both approaches are close. The algorithms accept a rooted or unrooted tree, and for unrooted trees we propose a method to estimate the root position, though simulations show that the use of an outgroup is generally preferable.
Given a set of n serially dated sequences, let R be the input rooted binary phylogenetic tree on these sequences with known branch lengths. Node 1 corresponds to the root. The date of node i is denoted by t i. For every internal node ilet s 1 i and s 2 i be the two direct descendants of i. Let b i be the length of the branch ia i ; b i is an estimate of the number of substitutions per site that occurred along the branch from time t a i to t i.
With a SMC, the substitution rate i. The higher c is, the closer we are to equal variances, that is, ordinary least squares OLS. To summarize, our model Eq. This corresponds to the default option in several programs e. We certainly do not pretend that this model depicts all the complexity of sequence evolution, but it makes possible very efficient calculations with little loss in terms of estimation accuracy, as described later.
This is an obvious requirement, analogous to the positivity of branch lengths in phylogenetic trees. However, not all dating methods comply with this requirement e. The reasons for this are mostly computational. Imposing positivity constraints has a computational cost, as we shall see below in our dating context. This function is a convex quadratic form O'Meara and has a unique minimum see Proof in the Online Appendix.
Therefore, Equation 2 also has a unique minimum. We propose two different algorithms. One takes into account the temporal precedence constraints, while the other does not. We present the weighted versions in the following, as the unweighted versions are simply obtained by fixing the w i to 1.
This algorithm can be extended to non-binary trees. However, nothing guarantees that the date estimates satisfy the temporal precedence constraints. This is why we designed the QPD quadratic programming dating algorithm, which we describe now. With strictly convex quadratic functions, this method is ensured to converge to the unique global minimum Nocedal and Wright Although Equation 2 does not comply with these requirements, a proof of QPD convergence to the unique minimum is provided in the Online Appendix.
The active-set method is especially efficient here, because we can find the stationary point of the Lagrange function Eq.
Published in Systematic biology - 16 Sep To TH, Jung M, Lycett S, Gascuel O. Link to Pubmed [PMID] – Syst Biol. Jan. Fast dating algorithms, based on a Gaussian model closely related to the Langley-Fitch Fast Dating Using Least-Squares Criteria and Algorithms. Syst. Speed Dating using Least-Squares. Thu Hien TO Fast algorithms are needed. • We must rely on . We use these equalities in WLS criterion to obtain in linear.
In our experiments described belowQPD performs 3 iterations on average with simulated trees of taxa, and 69 iterations with an H1N1 influenza data set of taxa. Although, it is difficult to extrapolate from these experiments, it seems that in practice f is much smaller than nand thus the computing time of QPD appears to be nearly linear. We implemented a tree generator based on a simple birth—death model with periodic sampling times, mimicking typical intrahost studies with yearly sampling, or interhost epidemic surveillance through time.
Let us start with SMC.
This process is continued until we have individuals. Then we proceed with sampling and death: the evolution of a number of individuals e. The process continues with the nonculled and nonsampled individuals in our examplewhich are further divided using the same Yule-type rule until we again have individuals to be sampled, culled, or conserved for the next step.
The whole process is continued until we attain the desired number of sampling times. The final set of sampled individuals is exactly the taxon set or leaves of the final tree. This tree is then rescaled so that the time between the first and the last sampling time is 20 years, with the root date being zero.
An advantage of this scheme is that the time elapsed from one sampling time to the next one is constant, thus emulating the sampling of DNA sequences from an evolving population on a regular basis, as opposed to standard birth—death tree generators Stadler Moreover, with birth—death trees the divergence times vary among replicates, while here we use fixed divergence times for easy estimation of method accuracy and presentation of the results.
We generated two kinds of trees, intended to simulate interhost and intrahost HIV evolution Volz et al. For each, we used 3 sampling times separated by 10 years with 25 selected individuals at each time, and 11 sampling times separated by 2 years with 10 selected individuals at each time.
See Figure 1 for examples of trees. Additionally, we added one outgroup to simulate the search for the root position using the standard outgroup-based approach. The length of the branch from the ingroup root to the outgroup was three times the length from the ingroup root to the nearest ingroup leaf. With each combination of these parameters, trees were randomly generated.
Our algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear. Citation: Bayesian phylogenetic and phylodynamic data integration using BEAST Olivier Gascuel; Fast Dating Using Least-Squares Criteria and Algorithms. To T-H, Jung M, Lycett S, Gascuel O. Fast dating using least-squares criteria and algorithms. Syst Biol. ;– bellasoulshop.com
Examples of simulated trees. Four examples of trees extracted from our simulated data sets. Trees a and b have each 3 sampling dates with 25 sampled strains each.
Trees c and d have each 11 sampling dates with 10 sampled strains each. See text and Volz et al. For this purpose, we reused the previous trees, but multiplied every branch length by a random variable following a lognormal distribution with mean 1 and standard deviation 0. This value is between the estimates we obtained for pol and env HIV genes unpublished results. These parameter values are similar to estimates already observed with the env region of HIV Posada and Crandall To assess the accuracy of the distance-based dating methods, we inferred trees from these alignments.
All these trees were used in two ways: i the outgroup was used to produce rooted trees, from which the outgroup was deleted; ii we simply removed the outgroup to obtain unrooted trees.
All of our data sets model trees, alignments, distance matrices, inferred trees, etc. For RTT, we re-implemented the linear regression method, which takes both rooted and unrooted trees as input. Given unrooted trees, it estimates the position of the root by minimizing the sum of squared residues.
For dozens of data sets, we checked that our implementation gives the same result as Path-O-Gen v1. Unlike other methods used here, RTT does not estimate the dates of internal nodes but only the root date and the substitution rate.
We used a SMC with an uninformative prior clock rate had a uniform distribution between 0 and 1. For the relaxed-clock data, we also used a lognormal relaxed-clock model i.
These parameter values are standard and default options were used in all of our analyses. Additional runs with several alternative priors were also performed uniform prior in a much more narrow interval [0, 0. Moreover, other runs of BEAST were carried out to assess the accuracy of internal node date estimations. We then used the true rooted tree topology otherwise date comparisons are meaninglessand forced it to be constant in BEAST, so that only the branch lengths were re-estimated, just as with PhyML see above.
In all of our analyses, we used meanRate estimator for rate estimations with BRMC, since it was more accurate than ucld. With simulated data, the true value of the parameters substitution rate, root and node dates are known.
We used standard quadratic error measures to compare the true and estimated values and assess the accuracy of the methods being compared.
An advantage of these measures is that they can be decomposed into variance and bias terms, thus indicating whether the estimation method shows some tendency to over- or underestimate the true parameter value, and whether the main source of errors is, or is not, the variance of the estimates.
For a fair comparison, we also have to account for tree building, as BEAST infers both the tree and the dates. However, PhyML is much faster, requiring 8 min for the largest taxon trees. The computing time difference between distance-based approaches and BEAST is thus very large see Online Appendix Supplementary Table S1 for detailsbut does not correspond to gains in estimation accuracy, as discussed below.
With SMC data Fig. As a general tendency Fig. Surprisingly, the accuracy of rate and root date estimations are not significantly affected by topological errors: although the FastME and PhyML trees contain a substantial amount of erroneous branches, we see very little difference in accuracy between the results obtained with the true and inferred topologies.
This suggests the use of much faster FastME rather than PhyML, when the aim is not to obtain a fully correct tree topology but to quickly estimate rates and dates, or to perform bootstrap analyses. Summary results with simulated data. Panels a and b show the relative error of the substitution rate estimates, panels c and d show the relative error of the root date estimates, panels e and f show the average error in years of the data estimates of all tree nodes.
See text for the definitions of these measures. With RMC data Fig. Again, the topological errors have little impact on the accuracy of rate and date estimations, and cannot explain the differences among the various methods, especially with BEAST the topological accuracy of which is still slightly better than PhyML's Supplementary Table S4. As expected the main factor is root positioning, which has a high impact on root date estimations.
If the root is misplaced, the tree cannot be dated precisely.Linear Regression - Least Squares Criterion Part 1
Among the methods directly inferring the root position i. Moreover, the global average results Fig. Up until now, we mostly discussed average results over the four types of model trees Fig.
As expected, the accuracy of the various methods differs depending on the model tree Online Appendix Supplementary Figs. The accuracy of the estimates is better with the larger sample of dated sequences, than with 75 sequences, and the impact is especially sensible with date estimations since we have 11 sampling times every 2 years instead of 3 every 10 years.
However, the global properties and the ranking of the various methods remain similar compared to average analysis except with BEAST, see above. Most results in these simulations were expected. The main surprise comes from the results of BEAST, expected to be the best due to its sophisticated model, being identical or very close to the data model, but in fact the results on the data sets used here do not suggest this.
Let us conclude these simulations with practical guidelines. Tree rooting is a difficult task; thus, if possible, use an outgroup and compare the results with the direct ones, obtained by assuming some relaxed clock model.
ML trees are preferable to minimize topological errors, but fast distance-based trees provide nearly identical rate and date estimates. LD and QPD resp. To illustrate the results of our algorithms on large data sets, we used a set comprising strains of influenza A virus subtype H1N1pdm09, which caused the first human influenza pandemic of the 21st century. The first two cases were reported in children from southern California on 21 April Soon after, other cases were reported, and by 11 June27, cases of infection had been observed from 74 countries, including deaths.
On that date, the World Health Organization WHO declared a pandemic, and the end of the pandemic was declared in August for details, see Christman et al.
Molecular epidemiology studies on this virus were performed at an early stage of the epidemic, using strains collected between 30 March and 12 July Lemey et al. These studies indicated that this virus has a high evolutionary rate of 4. To our knowledge, no other molecular dating study has been published on a more comprehensive set of strains sampled over a longer time period.
The strains used here were collected worldwide between 13 March and 9 June see Online Appendix Supplementary Table S5 for further details.
As many sequences were identical but collected at different time points, we retained for each set of identical sequences only one exemplar with a sampling date equal to the average of the dates of the corresponding strains.
Note that grouping identical sequences does not impact phylogeny inference identical sequences are separated by branches of length zero but accelerates the computations and is consistent with our dating model which has difficulty in dealing with branches of length zero but different dates at both extremities see Eqs.
However, this simplification was not used with BEAST, which handles such data due to its coalescent, population genetics model. To run our dating algorithms, we first have to infer a phylogenetic tree. For both methods we analyzed the ingroup sequences only and considered both the outgroup-based rooted tree, and the unrooted tree obtained by root removal.
To improve computational efficiency, the tree topology was kept constant and equal to the topology inferred using the original data set; only the branch lengths were re-estimated from the bootstrap samples. We compared the same methods as in the previous sections, using the same options. Two independent MCMC chains were used per model, with a minimum of million generations each, sampling every 10, generations.
BEAST unifies molecular phylogenetic reconstruction with complex discrete and continuous trait evolution, divergence-time dating, and coalescent demographic models in an efficient statistical inference engine using Markov chain Monte Carlo integration.
A convenient, cross-platform, graphical user interface allows the flexible construction of complex evolutionary analyses.
Syst Biol. Jan;65(1) doi: /sysbio/syv Epub Sep Fast Dating Using Least-Squares Criteria and Algorithms. To TH(1), Jung M(2). If you use this software, please cite: “ Fast dating using least-squares criteria and algorithms”, T-H. To, M. Jung, S. Lycett, O. Gascuel, Syst Biol. Jan. Active-set method, algorithms, computer simulations, dating, influenza (H1N1), least-squares, linear algebra, molecular clock, serial data.
It can read and analyse contemporaneous trees where all sequences have been collected at the same time and dated-tip trees where sequences have been collected at different dates. It is designed for analysing trees that have not been inferred under a molecular-clock assumption to see how valid this assumption may be. It can also root the tree at the position that is likely to be the most compatible with the assumption of the molecular clock.
TreeTime provides routines for ancestral sequence reconstruction and inference of molecular-clock phylogenies, i. The LSD software includes very fast dating algorithms, based on a Gaussian model closely related to the Langley—Fitch molecular-clock model. This model is robust to uncorrelated violations of the molecular clock. The algorithms apply to serial data, where the tips of the tree have been sampled through times.
They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods e.