Program: K-Estimator v5.2 Author: Josep M. Comeron Dep. Ecology & Evolution Univ. of Chicago 1101 East 57th St. Chicago, IL 60637 jcomeron@midway.uchicago.edu GENERAL DESCRIPTION K-Estimator is a computer program for estimating the number of nucleotide substitutions per site (synonymous [Ks] and nonsynonymous [Ka] for coding regions, and overall [K] for noncoding regions) and the confidence intervals of these estimates obtained by Monte-Carlo simulations. References: Comeron, J.M. (1999) K-Estimator: Calculation of the number of nucleotide substitutions per site and the confidence intervals. Bioinformatics (in press) Comeron, J.M. (1995) A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J. Mol. Evol. 41: 1152-1159. The program is written in Microsoft Visual Basic v.5.0 and it runs on any IBM-PC compatible computer under Windows 98/NT. Kest52.exe is a self-extracting file that will auto launch the SetUp. DETAILED DESCRIPTION K-Estimator is a Windows program written in Visual Basic 5.0 (Microsoft (c)) and can run on any IBM compatible computer under Windows 95/98 or Windows NT. The program accepts several multiple-sequence formats of already aligned nucleotide sequences (ASCII files): Clustal W (Thompson et al., 1994), PHYLIP (Felsenstein, 1993), MSF(PileUp)/GCG (Devereux et al., 1984), GDE (S. Smith, Harvard University Genome Center), MEGA (Kumar et al., 1994), NBRF/PIR (Sidman et al., 1988), LWL(91) (with or without spaces between codons). There is no program limit to the maximum length or number of sequences to be compared. For both noncoding and coding region sequences, it is possible to analyze particular regions or to obtain results from a sliding window analysis. Divergence Estimates: For noncoding regions, the program can estimate the overall (K) number of nucleotide substitutions per site using several multiple-hits at a site correcting methods: Jukes and Cantorīs 1-parameter (Jukes and Cantor, 1969), Kimuraīs 2-p (Kimura, 1980), Tajima and Nei (Tajima and Nei, 1984), and Tajimaīs 1-p, 2-p and 4-p (Tajima, 1993). When coding regions are under analysis, K-Estimator 5.0 applies the method described in Comeron (1995) to estimate Ks and Ka. This method, a modification of the method of Li (1993) and Pamilo and Bianchi (1993) (LPB), better quantifies the actual number of transitions and transversions and reduces stochastic errors (see Comeron, 1995, for details and comparison to previous methods). Three genetic codes can be applied: Universal, Vertebrate mitochondrial, or Drosophila mitochondrial. Furthermore, three different options can be applied to restrict the codons that are under analysis: 1) Maximum one substitution per codon (analyzes only those codons with no or only one substitution), 2) No three differences per codon (removes from the analyzes those homologous codons that differ in the three positions), and 3) Only AAs Substitution (estimates the Ks analyzing only those homologous codons that code for different amino acid but do not differ at the three positions). A file with a MEGA-compatible Distance matrix format (Lower-Left matrix) can be obtained for any estimated Divergence value. Confidence Intervals: K-Estimator obtains the Confidence Intervals (C.I.) of divergence estimates (K for noncoding regions, and Ks and Ka for coding regions) by Monte Carlo simulations (Comeron, 1995). Computer simulations take into account the following parameters: 1) Divergence Value; K or Ks and Ka, 2) number of nucleotides or codons, 3) the transition : transversion (alfa:beta) substitution ratio, and 4) the G+C content for noncoding regions, and the amino acid composition and G+C content at the third position of codons for coding regions. When alfa:beta is different than that expected under random nucleotide substitution, the substitution pattern is biased accordingly to maintain the original G+C percentage. For all simulations, the number of substitutions applied in each replicate follows a random Poisson-distributed number with a mean equal to the estimated number of substitutions (Divergence value x Number of analyzed sites). Substitutions are randomly distributed along the sequence. Since most multiple-hits correcting methods can give slightly biased divergence estimates under some conditions, Monte Carlo simulations using a number of substitutions based on these estimates could give inaccurate C.I. caused by a biased divergence average. To solve this putative problem, K-Estimator first scans for the optimal number of substitutions that will give the closest divergence average to the analyzed divergence value under the queried conditions, and subsequently it runs the final set of replicates. Confidence intervals for Ks and Ka estimates are analyzed together and can only be obtained after estimating Ks and Ka with K-Estimator 5.1; the number of codons, the amino acid composition (average of the two compared sequences), the G+C content at the third position of codons, as well as the number of synonymous and nonsynonymous substitutions, are fixed from the analyzed sequences. Confidence intervals are obtained directly from the null distribution of the divergence estimates from each replicate. The program can also calculate the exact probability of obtaining any particular divergence value (K for noncoding regions, and Ks, Ka, and Ka/Ks for coding regions). The program shows the number, if any, of replicates (failed replicates) where the multiple hits at a site correction method was not applicable or the estimated number of nucleotide substitutions per site was greater than 5.0. K-Estimator 5 also obtains the expected distribution of the ratio Ka/Ks that has been classically applied to detect the action of positive selection (Ka/Ks>1). In particular, the program simulates a condition where the number of nonsynonymous substitutions per nonsynonymous site (Ka) is on average equal to the value estimated for Ks. Therefore, it is obtained the null distribution for Ks, Ka, and more important the ratio Ka/Ks. Thus, the program can be used to test if Ka/Ks is significantly higher than 1 under the null hypothesis of Ka=Ks. As in all simulations, the number of both synonymous and nonsynonymous substitutions is Poisson-distributed. Results of both divergence and confidence intervals for divergence estimates analyses can be printed and/or saved as independent text files. Good luck and have fun, Josep M September 23, 1999