QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Bui Quang Minh; Cuong Cao Dang; Le Sy Vinh; Robert Lanfear

doi:10.1093/sysbio/syab010

. 2021 Feb 22;70(5):1046–1060. doi: 10.1093/sysbio/syab010

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Bui Quang Minh ^1,², Cuong Cao Dang ³, Le Sy Vinh ^3,^✉, Robert Lanfear ^2,^✉

Editor: Bryan Carstens

PMCID: PMC8357343 PMID: 33616668

Abstract

Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible Inline graphic matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.]

Amino acid substitution models are crucial for model-based phylogenetic analyses of protein sequences, including for maximum likelihood (ML) and Bayesian inference approaches. The most commonly used protein models are Markov processes summarized in a 20-by-20 replacement matrix, denoted as Inline graphic , which describes the rates of substitutions between pairs of amino acids. Because they have so many parameters, matrices are computationally very expensive to estimate. Together with the fear of model overfitting, they are not usually estimated during a phylogenetic analysis of a single amino-acid multiple sequence alignment (MSA). Instead, the best Inline graphic matrix for each locus in a multilocus MSA is usually selected from a set of pre-estimated matrices using model selection software such as ModelFinder (Kalyaanamoorthy et al. 2017), ModelTest (Darriba et al. 2019), or PartitionFinder (Lanfear et al. 2017). Estimating matrices from large collections of empirical MSAs, where one derives the so-called empirical Inline graphic matrix that jointly explains substitution patterns across all MSAs, remains challenging both because the task is computationally expensive, and because there is no user-friendly software implementation that facilitates the task. As a result, the publication of new empirical matrices remains infrequent, and empirical phylogeneticists rarely estimate their own Inline graphic matrix even in those cases where they have sufficient data.

The first empirical Inline graphic matrices, Dayhoff (Dayhoff et al. 1978) and JTT (Jones et al. 1992), were estimated using the Maximum Parsimony (MP) principle. This approach was based on counting the minimum number of amino-acid changes along a phylogeny required to explain the MSA. The MP approach was known for underestimation of the true number of multiple amino-acid substitutions on single branches of the tree (Whelan and Goldman 2001). Although Dayhoff et al. (1978) and Jones et al. (1992) employed some adjustments to reduce this limitation, the problem remained to some extent. The advent of ML methods led to further improvements in addressing this issue. In ML, one estimates the Inline graphic matrix that maximizes the joint likelihood of observing a large collection of MSAs given independently estimated tree topologies for each MSA. The most widely used matrices, WAG (Whelan and Goldman 2001) and LG (Le and Gascuel 2008), were estimated using the ML approach. These matrices substantially improved model fit on a range of MSAs compared with the older matrices. However, the methods used to estimate the LG and WAG matrices used several approximations to make the analyses computationally feasible. For example, Whelan and Goldman (2001) ignored rate heterogeneity across sites (RHAS), although this phenomenon is widely observed in empirical MSAs. Le and Gascuel (2008) later improved this method by incorporating RHAS with a discrete Gamma distribution (Yang 1994) but using a site-rate partition model instead of the originally designed mixture model. Moreover, the Pfam database (Bateman et al. 2002) used to estimate LG has now increased 8-fold (El-Gebali et al. 2019). As the most widely used Inline graphic matrices were estimated more than a decade ago, improvements in the available data and phylogenetic inference methods suggest that it might be possible to estimate improved matrices.

Recent attention has shifted towards mixture models (Le et al. 2008; Wang et al. 2008; Le and Gascuel 2010; Le et al. 2012). In standard (i.e., nonmixture) models, different Q matrices may be applied to different sites (often called ‘partitioning’), but the likelihood of each site in the alignment is only ever calculated under a single Q matrix. In contrast, with a mixture model, the likelihood of every site in the alignment is calculated under a preassigned range of different Q matrices (the mixture), and these likelihoods are combined into a joint likelihood of the site under this mixture. However, despite their attractive properties, mixture models are still employed far more rarely in empirical analyses than their single Q matrix counterparts, probably because only a limited number of phylogenetic software tools support mixture models as standard, and because mixture models remain computationally more intensive to use than single-matrix models. For example, the papers that introduced the three most popular single-matrix models (LG, WAG, and JTT) were cited a total of 916 times in 2019, while the papers that introduced the three most popular mixture models (LG4X, CAT60, and EX) were cited a total of 39 times in the same year according to Google Scholar. In this study, we focus on improving the methods used to estimate single Q matrix models. We then estimate a suite of new single Q matrices, which can be combined together to form mixture models using the IQ-TREE software. Our methodological improvements for estimating single-matrix models lay the groundwork for future improvements in the estimation of explicit mixture models such as the LG4X model.

Here, we present QMaker, an ML method and software implementation which estimates an empirical Inline graphic matrix for any set of protein MSAs. Figure 1 shows a schematic overview of the QMaker workflow (see Materials and Methods for full details). QMaker improves upon previously published ML procedures on a number of fronts (Table 1). These include the use of the efficient ML tree search algorithm of IQ-TREE (Nguyen et al. 2015), consideration of a distribution-free model of RHAS (Kalyaanamoorthy et al. 2017), full usage of the rate mixture model, support for multiple CPU cores, and an explicit separation of training and testing data. Furthermore, we provide an easy-to-use implementation of QMaker as part of the IQ-TREE software package (http://www.iqtree.org). We employed our new software to estimate and compare six new amino acid replacement matrices: one general matrix based on version 31 of the Pfam database, and five clade-specific matrices for mammals, birds, insects, yeasts, and plants. We show that the new matrices not only fit better to the data but also influence the topologies of inferred trees.

Figure 1. — Schematic overview of QMaker consisting of five main steps: 1) initialize the current best replacement matrix as LG and the set of candidate matrices as WAG, LG, and JTT; 2) for each alignment from the total alignments, find the best-fit matrix (chosen from the list of candidate matrices), the rate heterogeneity across sites model , the best-fit tree topology and the estimated branch lengths (not shown in the figure); 3) maximize the joint log-likelihood to obtain a new matrix that best explains all ; 4) if the Pearson correlation between and is higher than 0.999, return as the best matrix for the data; and 5) otherwise, replace by , extend the set of candidate matrices with and go back to Step 2.

Table 1.

Feature comparisons between QMaker and two previously published estimation procedures (Whelan and Goldman 2001; Le and Gascuel 2008)

Feature	Whelan and Goldman	Le and Gascuel	QMaker
Tree reconstruction	Neighbor-joining (Saitou and Nei 1987) with DayhoffF distances	PhyML (Guindon and Gascuel 2003) with WAG4 model	IQ-TREE(Nguyen et al. 2015) with best-fit model
Branch length estimation	Scaled on JTTF	ML estimate	ML estimate
Rate heterogeneity across sites	No	Gamma model (Yang 1994)	Gamma(Yang 1994), invariant site (Gu et al. 1995), and free rate model (Kalyaanamoorthy et al. 2017)
Algorithm to optimize the ML parameters	Expectationmaximization	Expectation maximization	Broyden–Fletcher–Goldfarb–Shanno (Fletcher 1987)
Multicore CPU support	No	No	Yes
Explicit separation of training and testing data	No	No	Yes

Open in a new tab

Materials and Methods

Data Sets Used for Training and Testing

We downloaded a total of 16,712 Pfam MSAs from version 31 of the database (El-Gebali et al. 2019), removed identical sequences from each MSA and only retained MSAs having between 5 and 1000 sequences and at least 50 sites. This leaves us with 13,308 remaining MSAs, denoted as the Pfam data set. We randomly divided the MSAs into two groups and used one half as the training set and the other half as the test set. Moreover, we downloaded five data sets for plant (Ran et al. 2018), bird (Jarvis et al. 2015), mammal (Wu et al. 2018), insect (Misof et al. 2014), and yeast (Shen et al. 2018). For each of these data sets, we randomly selected 1000 loci as the training set and used the remaining loci as the test set. For the bird and mammal data sets, we used two nonoverlapping training sets to examine the effect of random test-set selection. For the yeast training set, we additionally subsampled to 100 taxa from the alignment due to the excessive computational burden of estimating the Inline graphic matrix from the alignment containing all of the taxa in this data set. The training set was used to estimate and while the test set is used to compare the model fit between the estimated . Details of the data sets are summarized in Table 2. All data are available from the supplementary material available at https://doi.org/10.6084/m9.figshare.9768101.

Table 2.

Summary of the data sets used to estimate new amino-acid replacement matrices

Data set	References	Seqs	Sites	Loci	Training	Testing
Pfam	El-Gebali et al. (2019)	1,150,099	3,433,343	13,308	6654	6654
Plant	Ran et al. (2018)	38	432,014	1308	1000	308
Bird	Jarvis et al. (2015)	52	4,519,041	8295	1000 2	6295
Mammal	Wu et al. (2018)	90	3,050,199	5162	1000 2	3162
Insect	Misof et al. (2014)	144	595,033	2868	1000	1868
Yeast	Shen et al. (2018)	343	1,162,805	2408	1000 100 seqs	1408

Open in a new tab

For each data set, we randomly subsampled half (Pfam) or 1000 MSAs (others) as the training set and remaining loci as the test set. For bird and plant data sets, we used two nonoverlapping training sets to examine the effect of random subsampling. For the yeast data set, we additionally subsampled 100 sequences from the training set due to the excessive computational burden.

Model of Amino Acid Substitutions

The model of amino acid substitutions follows the continuous Markov process that is stationary, reversible, and homogeneous. This process is summarized in a 20-by-20 rate matrix Inline graphic , that describes the rate of change from one amino acid to another per time unit. Because of the reversibility assumption, entries of can be written as the product of the symmetric exchangeability rates and the amino-acid frequencies :

where the diagonal entries of Inline graphic are chosen such that its row sums equal 0. Before is used for tree inference, it is divided by the normalization factor (IQ-TREE will do this automatically):

where Inline graphic is the diagonal entries of , so that the total number of substitutions is 1. Because of this constraint and also , we have 189 free exchangeability parameters and 19 free amino-acid frequency parameters, which can only be reliably estimated from a large amount of data, such as the BRKALN (Jones D., unpublished data) and Pfam (El-Gebali et al. 2019) alignment databases. Quite often the amino acid frequencies can be empirically estimated from the data set at hand, denoted by “ Inline graphic F.” For example, the WAGF model uses the exchangeability rates defined by WAG but amino acid frequencies from the current MSA.

Model of Rate Heterogeneity across Sites

It is well known that MSA sites may have evolved at different rates. The so-called rate heterogeneity across sites (RHAS) has been typically modeled by a discrete Gamma distribution (Yang 1994) w/o a proportion of invariable sites (Gu et al. 1995). For example, LG Inline graphic I means that while all sites follow the LG exchangeability matrix, a fraction of sites is invariable (i.e., with zero evolutionary rates due to e.g. selective pressure) and the rates of the remaining variable sites follow a Gamma distribution.

Recently, it has been shown that a distribution-free rate model frequently provides a better fit than the Gamma model (Kalyaanamoorthy et al. 2017). The distribution-free rate model allows several site-rate categories, where the rates and proportions of each category are independent from one another and are estimated from the data. Hence, we do not assume any prior distribution of rates across sites.

Estimating a Joint Replacement Matrix from a Protein Data Set

We now introduce a new method to estimate a replacement matrix Inline graphic from a database of protein MSAs . Here, we want to find a single that best explains, in terms of ML, the pattern of amino acid substitutions for all MSAs. We denote by the set of candidate replacement matrices (Table 3).

Table 3.

Existing amino-acid replacement matrices

Matrix	Reference	Genomic regions
Blosum62	Henikoff and Henikoff (1992)	General
Dayhoff	Dayhoff et al. (1978)	General
JTT	Jones et al. (1992)	General
LG	Le and Gascuel (2008)	General
PMB	Veerassamy et al. (2003)	General
VT	Muller and Vingron (2000)	General
WAG	Whelan and Goldman (2001)	General
mtArt	Abascal et al. (2007)	Mitochondrial
mtMam	Yang et al. (1998)	Mitochondrial
mtRev	Adachi and Hasegawa (1996)	Mitochondrial
mtZoa	Rota-Stabelli et al. (2009)	Mitochondrial
mtMet	Le et al. (2017)	Mitochondrial
mtVer	Le et al. (2017)	Mitochondrial
mtInv	Le et al. (2017)	Mitochondrial
cpRev	Adachi et al. (2000)	Chloroplast
FLU	Cuong et al. (2010)	Viral
HIVb	Nickle et al. (2007)	Viral
HIVw	Nickle et al. (2007)	Viral
rtREV	Dimmic et al. (2002)	Viral

Open in a new tab

We first determine, for each Inline graphic , the best-fit matrix , the best RHAS model (e.g., I and R5) using ModelFinder (Kalyaanamoorthy et al. 2017), and the ML tree with the set of branch lengths using IQ-TREE (Nguyen et al. 2015). Next, we fix and to estimate the that maximizes the total log-likelihood across all MSAs in Inline graphic :

To estimate the 208 parameters of Inline graphic (189 exchangeabilities plus 19 frequencies ), we use the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm, an iterative, hill-climbing method, specifically for solving unconstrained nonlinear optimization problem (Fletcher 1987). This optimization results in an estimate of a new rate matrix denoted as Inline graphic . Some MSAs may now show as the better-fit model. Therefore, we extend the set of candidate rate matrices by and repeat the procedure above to re-estimate

The overall workflow of QMaker is as follows (Fig. 1):

Initialize the set of candidate replacement matrices as and the current best matrix as . Here, we use three commonly used matrices to initialize , but in principle can be initialized with any set of matrices.
For each MSA , find the best-fit matrix , rate heterogeneity across sites model and estimate the ML tree with branch lengths based on and (note that an edge-linked model sharing the topology is used for the five clade-specific data sets estimated below). Step 2 is necessary to obtain the initial topology and branch lengths for every locus, with which the new matrix are then estimated in Step 3.
Given and , jointly estimate and that maximizes the log-likelihood function (1), resulting in a new replacement matrix . Specifically,
- (a) estimate given , and
- (b) estimate given , and
- (c) if the log-likelihood of and is increased more than 0.1 (compared to the log-likelihood of previous and , go to Step 3b, otherwise, go to Step 4.
Following Le and Gascuel, let be the Pearson correlation coefficient between and . If report as the best replacement matrix for the database and stop. Otherwise, go to Step 5.
Assign , add to the set of candidate matrices: , and go back to Step 2.

Comparisons with Previous Estimation Procedures

Compared with the ML procedures used to estimate WAG (Whelan and Goldman 2001) and LG (Le and Gascuel 2008), QMaker has a number of differences (Table 1). Among others, Whelan and Goldman (2001) omitted rate heterogeneity across sites and employed neighbor-joining trees for computational efficiency. Le and Gascuel (2008) improved this method by incorporating the Inline graphic model of rate heterogeneity and inferring the ML tree with PhyML (Guindon and Gascuel 2003). However, they did not use the original as a mixture model when estimating . Rather, they partitioned the sites in each into their most likely rate category resulting in four sub-MSAs for each Inline graphic , and essentially applied the method of Whelan and Goldman to derive the matrix from the expanded .

Here, QMaker improves both methods by i) additionally considering the free rate and invariant site mixture models; ii) inferring the ML tree with IQ-TREE, which has been shown to outperform PhyML (Guindon and Gascuel 2003) and other software (Zhou et al. 2018); and iii) directly optimizing the log-likelihood function (1) to obtain Q instead of aforementioned approximations.

We note that because log-likelihoods are summed across sites, larger genes may have more influence on the model parameters than smaller genes. However, given that Inline graphic matrices are typically estimated from large data sets comprising a large number of genes, it is unlikely that any single gene will have an undue influence on the model. Regardless, summing likelihood across sites ensures that every site in the analysis is treated with a weight proportional to the amount of information it contains.

Software Implementation

We provided an implementation of QMaker as part of the IQ-TREE software version 2.0-rc1. The entire training stage for the Pfam data set can be accomplished with just two command lines. The first one is

iqtree -S ALN_DIR -nt 48

to find the best-fit models and ML trees for all MSAs residing in the folder ALN_DIR; -nt option is to specify the number of CPU cores. Note that for this study, due to the excessive size of the Pfam training set, we additionally used two options: -mset LG,WAG,JTT to consider only these three models and -cmax 4 to restrict up to four categories for the rate heterogeneity across sites model. The second command line is

iqtree -S ALN_DIR.best_model.nex -nt 48 -te ALN_DIR.treefile –model-joint GTR20+FO

to perform Step 3 of estimating the replacement matrix (GTR20 for general time reversible model with 20-state data), given the best models (ALN_DIR.best_model.nex) and best trees (ALN_DIR.treefile) found above.

For the five clade-specific data sets, to count best-fit models we performed an edge-linked partition model, which assumes a single tree topology and rescales edge lengths across the loci. This model was shown to best balance between schemes of model parameterization (Duchene et al. 2019). For this purpose, the -S option is changed to -p.

To test the model fit of the trained Inline graphic matrices we ran ModelFinder (Kalyaanamoorthy et al. 2017) as implemented in IQ-TREE:

iqtree -S TEST_DIR -m MF -mset JTT,WAG,LG,Q.pfam,

Q.plant,Q.bird,Q.mammal,Q.insect,Q.yeast,

where TEST_DIR is a directory containing the testing MSAs.

Results

QMaker Outperforms Existing Estimation Methods

To establish whether the approach, we propose here improves upon previously suggested approaches, we compared QMaker (Fig. 1; Materials and Methods) to the method used to estimate the LG matrix (Le and Gascuel 2008) on the same training data. Because both methods use the same data to estimate a Inline graphic matrix, any differences between the matrices must be due solely to the estimation procedure.

To compare the two approaches, we first obtained the training set of 3412 Pfam MSAs originally used to estimate the LG matrix (http://www.atgc-montpellier.fr/models/index.php?model=lg) and then applied QMaker to estimate a new Inline graphic matrix from these data. We called the resulting matrix Q.LG to reflect the origin of the data set. QMaker took about 28 h wall-clock time using 36 CPU cores on a 2.3-GHz server to estimate Q.LG.

To compare the performance of the LG and Q.LG matrix, we asked how frequently each matrix was selected as the best-fit model for the 500 test MSAs originally used to test the LG matrix (Le and Gascuel 2008). To do this, we calculated the best-fit model for each MSA from a set of four candidate Inline graphic matrices comprised of the two comparator matrices LG and Q.LG, plus two other frequently used matrices: WAG (Whelan and Goldman 2001) and JTT (Jones et al. 1992). Q.LG was the most frequently selected matrix (246 MSAs), followed by LG (166 MSAs), WAG (55 MSAs), and JTT (33 MSAs). If we focus only on the relative fit of Q.LG and LG, we find that Q.LG is the best-fit model for 281 MSAs (56%) while LG is the best-fit model for 219 MSAs (44%), again confirming that the Q.LG model slightly outperforms the LG model on these data. Among the 246 MSAs that Q.LG is the best, 84 showed that Q.LG is significantly better the other three matrices (LG, WAG, and JTT) according to the approximately unbiased (AU, Inline graphic ) test (Shimodaira 2002). The AU test is appropriate here because all of the models being compared have the same number of parameters. The AU compares the site likelihoods calculated under each model to ask whether the likelihoods from the better model significantly better than those from the worse model. The AU test was done using consel package (Shimodaira and Hasegawa 2001; Shimodaira 2002).

To better understand which of the differences between the method used to estimate the LG matrix and QMaker most contributed to QMaker improved performance, we benchmarked three key differences between the two methods: i) tree reconstruction, ii) models of RHAS, and iii) parameter optimization technique (Table 1). To evaluate the contribution of each of these improvements, we replace each improvement with the approach previously used in the LG procedure. For example, to evaluate our improved approach to tree reconstruction, we changed the tree reconstruction method to use PhyML (Guindon and Gascuel 2003) instead of IQ-TREE while keeping the rest of the pipeline unchanged. This resulted in three new Q matrices, which we name for the component that they were estimated to benchmark: i) Q Inline graphic , ii) Q, and iii) Q. On the 500 test MSAs, Q.LG is better than Q, Q, and Q for 343 (69%), 331 (66%), and 294 (59%) MSAs, respectively. This reveals that the improvements in tree reconstruction and the modeling of rate heterogeneity across sites led to the largest improvements for QMaker, and that changing the optimization technique had the smallest but still nonnegligible influence.

These results demonstrate that the QMaker method substantially improves the fit between models and data compared to previous estimation procedures. For the sake of reproducibility, we provide the Q.LG matrix in the Supplementary material. We do not, however, intend for the Q.LG matrix to be widely used, as it is estimated from a now-outdated version of the Pfam database.

Larger Amino Acid Databases Improve Model Fit, but Primarily to Target Alignments

Given these improvements we applied QMaker to estimate a new amino-acid substitution matrix from the latest version of the Pfam database, and call the resulting matrix Q.pfam. We estimated Q.pfam from a training set of half of the MSAs (6654 MSAs in total) available in Pfam database version 31 (El-Gebali et al. 2019), reserving the remaining half of the database as a test set with which to compare Q.pfam to three previously estimated matrices (LG, WAG, and JTT).

Q.pfam outperformed other matrices on the test MSAs. Q.pfam was the most frequently selected matrix in 40.7% of the test MSAs, followed by LG (35.5%), JTT (14.7%), and WAG (9.1%).

We further tested the new matrix on a collection 13,041 single-locus MSAs from five recently published phylogenomic data sets (Table 2). To do this, we compared the fit of the same four models (Q.pfam, LG, WAG, and JTT) to each of the 13,041 empirical MSAs. Surprisingly, the most commonly selected matrix across all 13,041 MSAs was JTT (74.9%), followed by LG (5.9%), Q.pfam (3.0%), and WAG (2.0%). The JTT matrix was the most commonly selected matrix for three out of the five data sets (birds, plants, and mammals; Appendix Fig. 1), and the LG matrix was the most commonly selected matrix for the remaining two data sets (insects and yeasts; Appendix Fig. 1). This shows that amino acid models estimated from the Pfam database (Q.pfam, LG) often fail to provide the best fit to MSAs used for phylogenomic inference on commonly studied clades.

Five New Clade-Specific Q Matrices Improve Model Fit on Phylogenomic Data

The surprisingly poor fit of Pfam-based matrices (Q.pfam, LG) to the empirical MSAs of birds, mammals and plants, combined with the high variation in the identity of the best model in each data set, suggests that there may be substantial between-clade variation in the way that proteins used for phylogenetic inference evolve. If this is the case, then accounting for this by estimating independent Inline graphic matrices for each clade should improve model fit. To test this, we estimated a clade-specific matrix for each of the five phylogenomic data sets of nuclear sequences (Table 2): Q.plant, Q.bird, Q.mammal, Q.insect, and Q.yeast. For each data set, we used 1000 training MSAs to estimate the Inline graphic matrix, and the remaining MSAs from each data set as test sets (see Materials and Methods for more details). The time taken to estimate these matrices depended on the size of the loci and the number of taxa in the data set, but using 15 CPU cores on a 2.3 GHz server the times ranged from 68 h for the plant data set to 385 h for the insect data set.

Figure 2 shows the frequency with which each of the six new (Q.pfam, Q.plant, Q.bird, Q.mammal, Q.insect, and Q.yeast) and three existing (JTT, WAG, and LG) matrices were selected as the best-fit for the six test sets. As expected, the best fit Inline graphic matrix for each test set was the matrix estimated from the corresponding training set, although the strength of the association varied widely among test data sets. For example, Q.plant was the best model for 92.2% of plant test MSAs, with the next best model selected for fewer than 5% of test MSAs. But Q.pfam was only selected as the best model for 34.1% of the Pfam test MSAs, with the next best model (LG) selected for 24.2% of MSAs. These results are likely driven in part by the fact that the set of models we considered included many models that are similar to Q.pfam (e.g., LG and WAG), but few that are similar to Q.plant.

We then applied the approximately unbiased (AU, Inline graphic ) test (Shimodaira 2002) to count how frequently each clade-specific model is significantly better than Q.pfam on the corresponding test set. The results (Table 4) show that in the overwhelming majority of cases, the clade-specific model fits the data significantly better than the Q.pfam model. Remarkably, this is even the case for the Q.yeast matrix, which is highly similar to the Q.pfam matrix (Pearson correlation Inline graphic ) but has 135/190 exchangeabilities smaller than those of Q.pfam (Appendix Fig. 2). This reveals that even small differences in the Q matrix can lead to statistically significant differences in model fit.

Table 4.

Likelihood comparisons between Q.pfam, Q.plant, Q.bird, Q.mammal, Q.insect, and Q.yeast

Compared model	Q.pfam is better	Compared model is better
Q.plant	4 (1)	304 (288)
Q.bird	468 (110)	5827 (5106)
Q.mammal	94 (24)	3068 (2872)
Q.insect	222 (62)	1646 (1291)
Q.yeast	72 (25)	1336 (1227)

Open in a new tab

Note: The second column shows the number of alignments having higher likelihood with Q.pfam than the comparing model and the number of alignments with significantly higher likelihood in parentheses (AU test, Inline graphic ). The third column shows the number of alignments where the compared model has higher likelihood than Q.pfam, with the number of significant cases in parentheses.

Principle Components Analysis Reveal the Landscape of Amino Acid Models

We used principle components analysis (PCA) to compare the properties of the six new amino acid models presented here to 19 previously estimated models (Table 3). The PCA plot of the Inline graphic matrices (Fig. 3a) shows a clear separation between matrices inferred from the nuclear, mitochondrial, chloroplast and viral genomes and the clade-specific matrices, with the clade-specific matrices falling between the mitochondrial and viral matrices, and the two Pfam-based matrices (LG, Q.pfam) in close proximity. The PCA plot of the models’ amino acid frequencies (Fig. 3b) reveals that most of the variation among frequency vectors comes from differences between and within the viral and mitochondrial models, with more limited separation between the clade-specific matrices (Q.bird, Q.plant, Q.insect, Q.mammal, and Q.yeast) and the general-purpose matrices (LG, WAG, JTT, and Q.pfam).

Figure 3. — Principle component analysis (PCA) of all matrices with respect to a) the amino acid exchangeabilities and b) the amino acid frequencies.

Figure 4 shows visual comparison of exchangeabilities between LG and six new models. We found that, as expected, Q.pfam and LG are highly correlated (Pearson correlation Inline graphic ) but many exchangeabilities of Q.pfam are larger than those of LG (137/190). In particular, all exchangeabilities involving Cysteine (C) or Proline (P) are larger for Q.pfam while many entries involving Tryptophan (W) are a little larger for LG. Q.pfam is also highly correlated to Q.yeast (Pearson correlation Inline graphic ). Supplementary Table S1 shows correlation values between six new matrices and 20 existing matrices.

Incorporating the New Matrices into Model Selection Changes Locus-Tree Inference

To examine whether the six new matrices, we propose here affect the inference of phylogenetic trees, we asked how often the new matrices affected the phylogenetic tree when they were selected as the best model. For each single-locus MSA in each data set, if one of the new models was selected as best model, we inferred the ML tree using the new model which we denoted Inline graphic . We then compared this tree to the tree inferred for the same MSA using the best-fit model among JTT, WAG, and LG, which we denoted . Differences between and could come from two sources: the effects of using a different amino acid substitution model or the stochasticity in tree search. The stochasticity in tree search arises because each gene is short, hence the phylogenetic signal islow, therefore the estimate of the ML tree can be somewhat unstable such that two different runs may produce different trees. To decouple these two factors, we performed another independent tree search to infer Inline graphic in the same way as we inferred but using a different random-number seed. If is different from , then the difference is merely due to tree search stochasticity. For each data set, we then compared the distribution of normalized Robinson–Foulds (nRF) (Robinson and Foulds 1981) distances between Inline graphic and to the distribution of the nRF distances between and (Fig. 5, left-hand column). The extent to which nRF distances between and are larger than those between and indicates the extent to which the new model affects tree inference, independently of stochasticity in the tree search. We used normalized RF distance, which is RF divided by 2*( Inline graphic ), where is the number of taxa, because this procedure allows us to compare tree distances between data sets with different numbers of taxa. nRF distances always scale between 0 and 1, regardless of the number of taxa. To ensure that this procedure works as we expect, we also made the same comparisons in cases where the best-fit model was not one of the new matrices we infer here. In this case, Inline graphic , , and are all inferred from the same model such that all differences between the trees are due to stochasticity in the tree search. As a result, we expect in this case that the distribution of nRF distances between and should be the same as the distribution between and (Fig. 5, right-hand column).

Figure 5 confirms that the nRF distances between Inline graphic and are moderately higher than those between and , indicating that using the newly proposed (and better fit) models changes locus tree topologies in every data set. In fact, the two distributions are significantly different for all data sets ( from a Kolmogorov–Smirnov test comparing the two distributions in each data set), indicating that the new models of evolution affect a nontrivial number of single-locus tree topologies in every data set. Of note, Figure 5 shows that in some data sets the topologies of a large number of loci differ considerably depending only on the random-number seed. For example, in both the Bird and Insect data sets, a considerable fraction of the topologies differ by more than half of the splits in the tree (nRF Inline graphic ). Manual inspection of a subset of these loci revealed that they tend to be very short and uninformative loci, such that many splits in the gene trees are supported by no substitutions and so resolved randomly depending on the random-number seed.

We also looked at the tree lengths of Inline graphic and and found that is often longer than (Table 5). For example, on the Plant test set, is longer than in 293/308 cases. The average length of and T are 5.104 and 4.665, respectively, suggesting that the new matrices allow us to infer, on average, 9.4% more substitutions than the existing matrices. This increase in branch length results from the combination of the changes in the transition rates and the amino acid frequencies, because the new trees were estimated using both the new transition matrix and its associated empirical amino acid frequencies.

Table 5.

The average lengths of trees inferred using new Q matrices ( Inline graphic ) and that using existing matrices ()

Data set
Plant	5.104	4.665	293/308
Bird	2.341	2.276	4698/6295
Mammal	6.091	5.786	2509/3162
Insect	39.873	38.291	1198/1868
Yeast	86.761	83.606	1057/1408

Open in a new tab

Note: The last column shows the number of cases when Inline graphic is longer than / the total number of cases.

The cases we discuss here are those in which the information-theoretic approaches (here we use the BIC) suggest that the new models are a better fit to the data than the old models. Since, we should prefer the tree topologies and branch lengths estimated under the best model we have, to the extent that information-theoretic approaches to model selection in phylogenetics are accurate (Sullivan and Joyce 2005), our results suggest that the custom-matrices we infer with QMaker make meaningful improvements to both the topologies and branch lengths of a large number of inferred single-locus trees. These differences are likely to impact downstream analyses such as molecular dating and species-tree estimates. For example, species trees estimated with the Multi-Species Coalescent (MSC) model in software such as ASTRAL can be sensitive to relatively small changes in the gene trees used as input data (Sayyari et al. 2017). It is therefore plausible that using improved models of sequence evolution like those that we estimate in this study could have effects on the species trees and divergence dates estimated from amino acid data.

Discussion

In this study, we describe and implement QMaker, an easy-to-use tool to estimate an amino acid replacement matrix Q for any data set of one or more amino-acid alignments. Phylogenetic inference from amino acid alignments relies heavily on precomputed Inline graphic matrices. It is a little surprising, therefore, that new matrices are published relatively infrequently (e.g., Table 3), particularly in the age of phylogenomics when an increasing number of studies collect sufficient data to reliably estimate a matrix. We hope that the development of QMaker will democratize the inference of Inline graphic matrices, and that this will lead to concomitant improvements in phylogenetic inference and our understanding of molecular evolution. Similarly, we hope that the publication of five new matrices for some highly studied clades will improve phylogenetic inference for those clades.

The approach we implement in QMaker builds on previously described approaches (Whelan and Goldman 2001; Le and Gascuel 2008), and our analyses reveal that it improves on them in terms of model fit to the data. We applied QMaker to estimate one general-purpose Inline graphic matrix and five clade-specific matrices for mammals, plants, birds, insects, and yeasts. We showed that they not only improve the fit between the model and the data but also influence the tree topologies. We note that inferring topologies for some specific organisms is not the main focus of this article. Practitioners are encouraged to use QMaker to infer new Inline graphic matrices and phylogenies for their own data of interest. All of the new matrices are now implemented in IQ-TREE version 2 (Minh et al. 2020) and incorporated as part of the model selection procedure, and the data necessary to implement all matrices in other phylogenetic software packages are provided in Supplementary material.

When estimating a new empirical model of amino acid substitution, it is important to consider how much data is required. There is no single answer to this question that can apply across data sets because different data sets will contain different amounts of information. Nevertheless, we can ask whether a Q Matrix estimated from a data set is useful. An amino acid replacement matrix requires the estimation of 208 parameters, representing the transition rates between pairs of amino acids and the frequencies of each amino acid in the data. Thus, a general rule of thumb is that the data should contain enough estimated substitutions of each type to reliably estimate relative transition rates. A pragmatic approach to establishing whether this is the case is to follow the approach we outline in the methods of this article. Namely, first split the loci in the data set into training and testing sets, then apply QMaker to the training set to obtain a new replacement matrix. Next estimate the AICc or BIC scores of the new matrix and the standard precomputed matrices using IQ-TREE. If the training data contain sufficient information to estimate a useful amino-acid replacement matrix, this will be revealed by the new matrix being the best-fit matrix for a substantial fraction of the loci in the testing set.

The relationships among the 19 existing Inline graphic matrices and the 7 new matrices we present here (Fig. 3) reveal a number of interesting patterns. As expected, there is a clear distinction between matrices estimated from different genomes, with the general-purpose matrices estimated from large data sets of protein alignments from the nuclear genome tending to cluster tightly together (Fig. 3). More surprising is the observation that the five new clade-specific Inline graphic matrices we estimate here tend to be quite distinct from all other matrices and are also remarkably distinct from one another. This result, combined with our observations that the clade-specific matrices tend to improve model fit and affect tree inference, highlight the potential benefits of inferring a clade-specific Inline graphic matrix before inferring a phylogeny. The differences among the clade-specific matrices also hint at potentially significant differences between the molecular evolutionary processes driving protein evolution in different clades of organisms.

Our results underscore previous recommendations that software for estimating phylogenetic trees should be run multiple times to help ensure that the estimate of the ML tree is as accurate as possible (Zhou et al. 2018). As part of our approach to comparing matrices, we ran IQ-TREE twice on the each of thousands of single-locus alignments, using different random-number seeds. We routinely observed differences in the phylogenies estimated from these replicate runs, as is expected for any heuristic search algorithm that is provided with limited information (in this case, single-locus alignments) with which to search for the optimal solution among an astronomical number of potential solutions.

The sometimes substantial variation in best-fit model for different loci from a single data set (Fig. 2) confirms that there can also be substantial variation in molecular evolution among loci. Thus, although QMaker allows researchers to infer a single Inline graphic matrix from a collection of alignments, it still seems sensible to infer phylogenies in a framework that allows for different matrices to be applied to different loci, such as by using a partitioned model (Lanfear et al. 2012; Chernomor et al. 2015) or mixture models. This can be achieved in the most commonly used phylogenetic inference software, including in IQ-TREE (Nguyen et al. 2015), RaxML (Stamatakis 2014), and PhyML (Guindon and Gascuel 2003).

The QMaker framework opens new avenues of research by simplifying the process of inferring a single Inline graphic matrix but is currently limited to estimating a single reversible matrix from one or more amino acid alignments. In principle, both of these limitations could be relaxed, for example by extending the QMaker approach to infer nonreversible matrices (e.g., Minh et al. 2020) and/or mixtures of Inline graphic matrices from amino acid alignments, for example, as was done to infer the LG4M and LG4X mixtures of matrices (Le et al. 2012). Both of these approaches have the potential to further improve phylogenetic inference beyond the developments that we present here.

Appendix

Supplementary material

Supplementary materials are available online from: https://doi.org/10.6084/m9.figshare.9768101.

Funding

This research was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED; [102.01.2019.06 to B.Q.M., C.C.D., and L.S.V.], an Australian National University Futures Grant to R.L., an Australian Research Council Discovery Grant [DP200103151 to R.L. and B.Q.M.], and a Chan-Zuckerberg Initiative Grant for Essential Open Source Software for Science to B.Q.M. and R.L.

References

Abascal F., Posada D., Zardoya R. 2007. MtArt: a new model of amino acid replacement for arthropoda. Mol. Biol. Evol. 24(1):1–5. [DOI] [PubMed] [Google Scholar]
Adachi J., Hasegawa M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42(4):459–468. [DOI] [PubMed] [Google Scholar]
Adachi J., Waddell P.J., Martin W., Hasegawa M. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50(4):348–358. [DOI] [PubMed] [Google Scholar]
Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., Sonnhammer E.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30(1):276–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chernomor O., Minh B.Q., von Haeseler A. 2015. Consequences of common topological rearrangements for partition trees in phylogenomic inference. J. Comput. Biol. 22(12):1129–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cuong C.D., Le Q.S., Gascuel O., Vinh S.L. 2010. FLU, an amino acid substitution model for influenza proteins. BMC Evol. Biol. 10(99):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Darriba D., Posada D., Kozlov A.M., Stamatakis A., Morel B., Flouri T. 2019. ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Mol. Biol. Evol. 37(1):291–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dayhoff M.O., Schwartz R.M., Orcutt B.C. 1978. A model for evolutionary change in proteins. Atlas Protein Sequence Struct. 5:345–352. [Google Scholar]
Dimmic M.W., Rest J.S., Mindell D.P., Goldstein R.A. 2002. rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J. Mol. Evol. 55(1):65–73. [DOI] [PubMed] [Google Scholar]
Duchene D.A., Tong K.J., Foster C.S.P., Duchene S., Lanfear R., Ho S.Y.W. 2019. Linking branch lengths across sets of loci provides the highest statistical support for phylogenetic inference. Mol. Biol. Evol. 37(4):1202–1210. [DOI] [PubMed] [Google Scholar]
El-Gebali S., Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A., Sonnhammer E.L.L., Hirsh L., Paladin L., Piovesan D., Tosatto S.C.E., Finn R.D. 2019. The Pfam protein families database in 2019. Nucleic Acids Res. 47(D1):D427-D432. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fletcher R. 1987. Practical methods of optimization. New York (NY): Wiley. [Google Scholar]
Gu X., Fu Y.X., Li W.H. 1995. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12(4):546–557. [DOI] [PubMed] [Google Scholar]
Guindon S., Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52(5):696–704. [DOI] [PubMed] [Google Scholar]
Henikoff S., Henikoff J.G. 1992. Amino-acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89(22):10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., Ho S.Y.W., Faircloth B.C., Nabholz B., Howard J.T., Suh A., Weber C.C., da Fonseca R.R., Alfaro-Nunez A., Narula N., Liu L., Burt D., Ellegren H., Edwards S.V., Stamatakis A., Mindell D.P., Cracraft J., Braun E.L., Warnow T., Jun W., Gilbert M.T.P., Zhang G.J., The Avian Phylogenomics Consortium. 2015. Phylogenomic analyses data of the avian phylogenomics project. Gigascience 4(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones D.T., Taylor W.R., Thornton J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8(3):275–282. [DOI] [PubMed] [Google Scholar]
Kalyaanamoorthy S., Minh B.Q., Wong T.K.F., von Haeseler A., Jermiin L.S. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14(6):587–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lanfear R., Calcott B., Ho S.Y., Guindon S. 2012. PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol. Biol. Evol. 29(6):1695–1701. [DOI] [PubMed] [Google Scholar]
Lanfear R., Frandsen P.B., Wright A.M., Senfeld T., Calcott B. 2017. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol. Biol. Evol. 34(3):772–773. [DOI] [PubMed] [Google Scholar]
Le S.Q., Dang C.C., Gascuel O. 2012. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol. Biol. Evol. 29(10):2921–2936. [DOI] [PubMed] [Google Scholar]
Le S.Q., Gascuel O. 2008. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25(7):1307–1320. [DOI] [PubMed] [Google Scholar]
Le S.Q., Gascuel O. 2010. Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. Syst. Biol. 59(3):277–287. [DOI] [PubMed] [Google Scholar]
Le S.Q., Lartillot N., Gascuel O. 2008. Phylogenetic mixture models for proteins. Philos. Trans. R. Soc. Lond. B 363(1512):3965–3976. [DOI] [PMC free article] [PubMed] [Google Scholar]
Le V.S., Dang C.C., Le Q.S. 2017. Improved mitochondrial amino acid substitution models for metazoan evolutionary studies. BMC Evol. Biol. 17(136):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., von Haeseler A., Lanfear R. 2020. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37(5):1530–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Misof B., Liu S., Meusemann K., Peters R.S., Donath A., Mayer C., Frandsen P.B., Ware J., Flouri T., Beutel R.G., Niehuis O., Petersen M., Izquierdo-Carrasco F., Wappler T., Rust J., Aberer A.J., Aspock U., Aspock H., Bartel D., Blanke A., Berger S., Bohm A., Buckley T.R., Calcott B., Chen J., Friedrich F., Fukui M., Fujita M., Greve C., Grobe P., Gu S., Huang Y., Jermiin L.S., Kawahara A.Y., Krogmann L., Kubiak M., Lanfear R., Letsch H., Li Y., Li Z., Li J., Lu H., Machida R., Mashimo Y., Kapli P., McKenna D.D., Meng G., Nakagaki Y., Navarrete-Heredia J.L., Ott M., Ou Y., Pass G., Podsiadlowski L., Pohl H., von Reumont B.M., Schutte K., Sekiya K., Shimizu S., Slipinski A., Stamatakis A., Song W., Su X., Szucsich N.U., Tan M., Tan X., Tang M., Tang J., Timelthaler G., Tomizuka S., Trautwein M., Tong X., Uchifune T., Walzl M.G., Wiegmann B.M., Wilbrandt J., Wipfler B., Wong T.K., Wu Q., Wu G., Xie Y., Yang S., Yang Q., Yeates D.K., Yoshizawa K., Zhang Q., Zhang R., Zhang W., Zhang Y., Zhao J., Zhou C., Zhou L., Ziesmann T., Zou S., Li Y., Xu X., Zhang Y., Yang H., Wang J., Wang J., Kjer K.M., Zhou X. 2014. Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210):763–767. [DOI] [PubMed] [Google Scholar]
Muller T., Vingron M. 2000. Modeling amino acid replacement. J. Comput. Biol. 7(6):761–776. [DOI] [PubMed] [Google Scholar]
Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32(1): 268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nickle D.C., Heath L., Jensen M.A., Gilbert P.B., Mullins J.I., Pond S.L.K. 2007. HIV-specific probabilistic models of protein evolution. PLoS One 2(6):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ran J.H., Shen T.T., Wang M.M., Wang X.Q. 2018. Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between Gnetales and angiosperms. Proc. R. Soc. B 285(1881):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson D.F., Foulds L.R. 1981. Comparison of phylogenetic trees. Math. Biosci. 53(1–2):131–147. [Google Scholar]
Rota-Stabelli O., Yang Z.H., Telford M.J. 2009. MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet. Evol. 52(1):268–272. [DOI] [PubMed] [Google Scholar]
Saitou N., Nei M. 1987. The neighbor-joining method - a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4):406–425. [DOI] [PubMed] [Google Scholar]
Sayyari E., Whitfield J.B., Mirarab S. 2017. Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Mol. Biol. Evol. 34(12):3279–3291. [DOI] [PubMed] [Google Scholar]
Shen X.X., Opulente D.A., Kominek J., Zhou X., Steenwyk J.L., Buh K.V., Haase M.A.B., Wisecaver J.H., Wang M., Doering D.T., Boudouris J.T., Schneider R.M., Langdon Q.K., Ohkuma M., Endoh R., Takashima M., Manabe R., Cadez N., Libkind D., Rosa C.A., DeVirgilio J., Hulfachor A.B., Groenewald M., Kurtzman C.P., Hittinger C.T., Rokas A. 2018. Tempo and mode of genome evolution in the budding yeast subphylum. Cell 175(6):1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shimodaira H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51(3):492–508. [DOI] [PubMed] [Google Scholar]
Shimodaira H., Hasegawa M. 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 12:1246–1247. [DOI] [PubMed] [Google Scholar]
Sullivan J., Joyce P. 2005. Model selection in phylogenetics. Annu. Rev. Ecol. Evol. Syst. 36(1):445–466. [Google Scholar]
Veerassamy S., Smith A., Tillier E.R.M. 2003. A transition probability model for amino acid substitutions from blocks. J. Comput. Biol. 10(6):997–1010. [DOI] [PubMed] [Google Scholar]
Wang H.-C., Li K., Susko E., Roger A.J. 2008. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol. Biol. 8(1):331. [DOI] [PMC free article] [PubMed] [Google Scholar]
Whelan S., Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18(5):691–699. [DOI] [PubMed] [Google Scholar]
Wu S.Y., Edwards S., Liu L. 2018. Genome-scale DNA sequence data and the evolutionary history of placental mammals. Data Brief 18:1972–1975. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]
Yang Z.H., Nielsen R., Hasegawa M. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15(12):1600–1611. [DOI] [PubMed] [Google Scholar]
Zhou X.F., Shen X.X., Hittinger C.T., Rokas A. 2018. Evaluating fast maximum likelihood-based phylogenetic programs using empirical phylogenomic data sets. Mol. Biol. Evol. 35(2):486–503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] Abascal F., Posada D., Zardoya R. 2007. MtArt: a new model of amino acid replacement for arthropoda. Mol. Biol. Evol. 24(1):1–5. [DOI] [PubMed] [Google Scholar]

[B2] Adachi J., Hasegawa M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42(4):459–468. [DOI] [PubMed] [Google Scholar]

[B3] Adachi J., Waddell P.J., Martin W., Hasegawa M. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50(4):348–358. [DOI] [PubMed] [Google Scholar]

[B4] Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., Sonnhammer E.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30(1):276–280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Chernomor O., Minh B.Q., von Haeseler A. 2015. Consequences of common topological rearrangements for partition trees in phylogenomic inference. J. Comput. Biol. 22(12):1129–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Cuong C.D., Le Q.S., Gascuel O., Vinh S.L. 2010. FLU, an amino acid substitution model for influenza proteins. BMC Evol. Biol. 10(99):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Darriba D., Posada D., Kozlov A.M., Stamatakis A., Morel B., Flouri T. 2019. ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models. Mol. Biol. Evol. 37(1):291–294. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Dayhoff M.O., Schwartz R.M., Orcutt B.C. 1978. A model for evolutionary change in proteins. Atlas Protein Sequence Struct. 5:345–352. [Google Scholar]

[B9] Dimmic M.W., Rest J.S., Mindell D.P., Goldstein R.A. 2002. rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J. Mol. Evol. 55(1):65–73. [DOI] [PubMed] [Google Scholar]

[B10] Duchene D.A., Tong K.J., Foster C.S.P., Duchene S., Lanfear R., Ho S.Y.W. 2019. Linking branch lengths across sets of loci provides the highest statistical support for phylogenetic inference. Mol. Biol. Evol. 37(4):1202–1210. [DOI] [PubMed] [Google Scholar]

[B11] El-Gebali S., Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A., Sonnhammer E.L.L., Hirsh L., Paladin L., Piovesan D., Tosatto S.C.E., Finn R.D. 2019. The Pfam protein families database in 2019. Nucleic Acids Res. 47(D1):D427-D432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Fletcher R. 1987. Practical methods of optimization. New York (NY): Wiley. [Google Scholar]

[B13] Gu X., Fu Y.X., Li W.H. 1995. Maximum-likelihood-estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12(4):546–557. [DOI] [PubMed] [Google Scholar]

[B14] Guindon S., Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52(5):696–704. [DOI] [PubMed] [Google Scholar]

[B15] Henikoff S., Henikoff J.G. 1992. Amino-acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89(22):10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., Ho S.Y.W., Faircloth B.C., Nabholz B., Howard J.T., Suh A., Weber C.C., da Fonseca R.R., Alfaro-Nunez A., Narula N., Liu L., Burt D., Ellegren H., Edwards S.V., Stamatakis A., Mindell D.P., Cracraft J., Braun E.L., Warnow T., Jun W., Gilbert M.T.P., Zhang G.J., The Avian Phylogenomics Consortium. 2015. Phylogenomic analyses data of the avian phylogenomics project. Gigascience 4(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Jones D.T., Taylor W.R., Thornton J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8(3):275–282. [DOI] [PubMed] [Google Scholar]

[B18] Kalyaanamoorthy S., Minh B.Q., Wong T.K.F., von Haeseler A., Jermiin L.S. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14(6):587–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Lanfear R., Calcott B., Ho S.Y., Guindon S. 2012. PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Mol. Biol. Evol. 29(6):1695–1701. [DOI] [PubMed] [Google Scholar]

[B20] Lanfear R., Frandsen P.B., Wright A.M., Senfeld T., Calcott B. 2017. PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Mol. Biol. Evol. 34(3):772–773. [DOI] [PubMed] [Google Scholar]

[B21] Le S.Q., Dang C.C., Gascuel O. 2012. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol. Biol. Evol. 29(10):2921–2936. [DOI] [PubMed] [Google Scholar]

[B22] Le S.Q., Gascuel O. 2008. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25(7):1307–1320. [DOI] [PubMed] [Google Scholar]

[B23] Le S.Q., Gascuel O. 2010. Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial. Syst. Biol. 59(3):277–287. [DOI] [PubMed] [Google Scholar]

[B24] Le S.Q., Lartillot N., Gascuel O. 2008. Phylogenetic mixture models for proteins. Philos. Trans. R. Soc. Lond. B 363(1512):3965–3976. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Le V.S., Dang C.C., Le Q.S. 2017. Improved mitochondrial amino acid substitution models for metazoan evolutionary studies. BMC Evol. Biol. 17(136):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., von Haeseler A., Lanfear R. 2020. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37(5):1530–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] Misof B., Liu S., Meusemann K., Peters R.S., Donath A., Mayer C., Frandsen P.B., Ware J., Flouri T., Beutel R.G., Niehuis O., Petersen M., Izquierdo-Carrasco F., Wappler T., Rust J., Aberer A.J., Aspock U., Aspock H., Bartel D., Blanke A., Berger S., Bohm A., Buckley T.R., Calcott B., Chen J., Friedrich F., Fukui M., Fujita M., Greve C., Grobe P., Gu S., Huang Y., Jermiin L.S., Kawahara A.Y., Krogmann L., Kubiak M., Lanfear R., Letsch H., Li Y., Li Z., Li J., Lu H., Machida R., Mashimo Y., Kapli P., McKenna D.D., Meng G., Nakagaki Y., Navarrete-Heredia J.L., Ott M., Ou Y., Pass G., Podsiadlowski L., Pohl H., von Reumont B.M., Schutte K., Sekiya K., Shimizu S., Slipinski A., Stamatakis A., Song W., Su X., Szucsich N.U., Tan M., Tan X., Tang M., Tang J., Timelthaler G., Tomizuka S., Trautwein M., Tong X., Uchifune T., Walzl M.G., Wiegmann B.M., Wilbrandt J., Wipfler B., Wong T.K., Wu Q., Wu G., Xie Y., Yang S., Yang Q., Yeates D.K., Yoshizawa K., Zhang Q., Zhang R., Zhang W., Zhang Y., Zhao J., Zhou C., Zhou L., Ziesmann T., Zou S., Li Y., Xu X., Zhang Y., Yang H., Wang J., Wang J., Kjer K.M., Zhou X. 2014. Phylogenomics resolves the timing and pattern of insect evolution. Science 346(6210):763–767. [DOI] [PubMed] [Google Scholar]

[B28] Muller T., Vingron M. 2000. Modeling amino acid replacement. J. Comput. Biol. 7(6):761–776. [DOI] [PubMed] [Google Scholar]

[B29] Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32(1): 268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Nickle D.C., Heath L., Jensen M.A., Gilbert P.B., Mullins J.I., Pond S.L.K. 2007. HIV-specific probabilistic models of protein evolution. PLoS One 2(6):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] Ran J.H., Shen T.T., Wang M.M., Wang X.Q. 2018. Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between Gnetales and angiosperms. Proc. R. Soc. B 285(1881):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Robinson D.F., Foulds L.R. 1981. Comparison of phylogenetic trees. Math. Biosci. 53(1–2):131–147. [Google Scholar]

[B33] Rota-Stabelli O., Yang Z.H., Telford M.J. 2009. MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet. Evol. 52(1):268–272. [DOI] [PubMed] [Google Scholar]

[B34] Saitou N., Nei M. 1987. The neighbor-joining method - a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4):406–425. [DOI] [PubMed] [Google Scholar]

[B35] Sayyari E., Whitfield J.B., Mirarab S. 2017. Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Mol. Biol. Evol. 34(12):3279–3291. [DOI] [PubMed] [Google Scholar]

[B36] Shen X.X., Opulente D.A., Kominek J., Zhou X., Steenwyk J.L., Buh K.V., Haase M.A.B., Wisecaver J.H., Wang M., Doering D.T., Boudouris J.T., Schneider R.M., Langdon Q.K., Ohkuma M., Endoh R., Takashima M., Manabe R., Cadez N., Libkind D., Rosa C.A., DeVirgilio J., Hulfachor A.B., Groenewald M., Kurtzman C.P., Hittinger C.T., Rokas A. 2018. Tempo and mode of genome evolution in the budding yeast subphylum. Cell 175(6):1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] Shimodaira H. 2002. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51(3):492–508. [DOI] [PubMed] [Google Scholar]

[B38] Shimodaira H., Hasegawa M. 2001. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 12:1246–1247. [DOI] [PubMed] [Google Scholar]

[B39] Sullivan J., Joyce P. 2005. Model selection in phylogenetics. Annu. Rev. Ecol. Evol. Syst. 36(1):445–466. [Google Scholar]

[B40] Veerassamy S., Smith A., Tillier E.R.M. 2003. A transition probability model for amino acid substitutions from blocks. J. Comput. Biol. 10(6):997–1010. [DOI] [PubMed] [Google Scholar]

[B41] Wang H.-C., Li K., Susko E., Roger A.J. 2008. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol. Biol. 8(1):331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] Whelan S., Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18(5):691–699. [DOI] [PubMed] [Google Scholar]

[B43] Wu S.Y., Edwards S., Liu L. 2018. Genome-scale DNA sequence data and the evolutionary history of placental mammals. Data Brief 18:1972–1975. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]

[B45] Yang Z.H., Nielsen R., Hasegawa M. 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15(12):1600–1611. [DOI] [PubMed] [Google Scholar]

[B46] Zhou X.F., Shen X.X., Hittinger C.T., Rokas A. 2018. Evaluating fast maximum likelihood-based phylogenetic programs using empirical phylogenomic data sets. Mol. Biol. Evol. 35(2):486–503. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Bui Quang Minh

Cuong Cao Dang

Le Sy Vinh

Robert Lanfear

Roles

Abstract

Figure 1.

Table 1.

Materials and Methods

Data Sets Used for Training and Testing

Table 2.

Model of Amino Acid Substitutions

Model of Rate Heterogeneity across Sites

Estimating a Joint Replacement Matrix from a Protein Data Set

Table 3.

Comparisons with Previous Estimation Procedures

Software Implementation

Results

QMaker Outperforms Existing Estimation Methods

Larger Amino Acid Databases Improve Model Fit, but Primarily to Target Alignments

Five New Clade-Specific Q Matrices Improve Model Fit on Phylogenomic Data

Figure 2.

Table 4.

Principle Components Analysis Reveal the Landscape of Amino Acid Models

Figure 3.

Figure 4.

Incorporating the New Matrices into Model Selection Changes Locus-Tree Inference

Figure 5.

Table 5.

Discussion

Appendix

Figure 1.

Figure 2.

Supplementary material

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases