Calculation of posterior convergent and divergent substitution probabilities and their expected counts, using posterior integration and amino-acid substitution models: "codeML-ancestral" A.P. Jason de Koning, 2009 David Pollock Lab, UC Denver School of Medicine jason.de.koning@gmail.com Software from the paper: Castoe, de Koning et al 2009. "Evidence for an ancient adaptive episode of convergent molecular evolution." PNAS v106(22): 8986-8991. http://www.pnas.org/content/106/22/8986.abstract ---- With apologies to Ziheng Yang, codeml (from Yang's PAML4 package) has been modified to calculate the expected number of posterior convergent and divergent substitutions as described in our paper. Both the posterior numbers of convergent and divergent substitutions are calculated for each pair of independent branches on the tree, and for every site. Rate variation across sites is accommodated in the calculations. These calculations can be thought of as substitution probabilities that have been integrated over all possible ancestral states, along with their posterior probabilities under the selected model of amino-acid substitution. This approach therefore has great advantages over earlier methods for estimating convergent and divergent changes, based on ML assignment of unknown ancestral states (especially when using highly diverged sequences). ---- TECHNICAL NOTES: Technical details on the calculations can be found in the Methods and Supplementary info from our paper. Changes that were made to the original PAML code: * A number of modifications were made to the AncestralMarginal() function in treesub.c * Other changes made to support this function in both treesub.c and codeml.c: - global data structure for tree nodes modified to store "conP_part1", "prior", and "conP_byCat", which are components of the posterior substitution probability calculations - most other code changes are found in the PostProbNode() function ---- USAGE NOTES: *** It is advised that you should first estimate all model parameters in an unadulterated version of PAML, and then fix those values for analysis here. *** Unarchive the tarball and run "make codeml". Modify the codeml.ctl file to specify your input sequence data and tree (with branch lengths). Run "./codeml" and get comfortable. 1) The code currently works only for amino-acid sequences (or codons-->AA) 2) The program is VERY SLOW. It will sit and crunch numbers for quite a while--probably for several hours for a modest dataset of 30-40 species--even on a fast computer. 3) Be aware that because this version of codeml has been hacked up, it may not fully work other than for getting the convergent/divergent substitution probabilities 4) By default codeML-ancestral will output the substitution probabilities for convergent, strictly convergent, parallel and divergent substitutions for each pair of branches in separate lines of descent. Site-specific information is output to the "convergentSites.out" file. [For optional features: To get the asymptotic expected number of substitutions (analagous to the expectation from a posterior-predictive analysis) you will have to uncomment several code blocks in AncestralMarginal(). You should also comment out the block the makes posterior calculations (also annotated in the same function) when doing this.] ---- At some point we will implement these functions in a user-friendly package, but in the meantime you are invited to enjoy/suffer this code. Please feel free to direct queries or comments to Jason. - Jason de Koning, September 2009 ---- For further information, visit the Pollock Lab on the web at http://www.EvolutionaryGenomics.com or visit Jason at http://jasondk.org