Pollock laboratory abstracts (see titles only; see pending)

 
53:Journal of Molecular Biology, (2007) in press, online Aug. 29

Structural, biochemical, and in vivo characterization of the first virally encoded cyclophilin from the Mimivirus

Thai V, Renesto P, Fowler A, Brown D, Davis T, Gu W, Pollock DD, Kern D, Raoult D, and Eisenmesser E

Although multiple viruses utilize host cell cyclophilins, including SARS and HIV-1, their role in infection is poorly understood. To help elucidate these roles, we have characterized the first virally encoded cyclophilin (mimicyp) derived from the largest virus discovered to date (the Mimivirus) that is also a causative agent of pneumonia in humans. Mimicyp adopts a typical cyclophilin-fold, yet it also forms trimers unlike any previously characterized homologue. Strikingly, immunofluorescence assays reveal that mimicyp localizes to the surface of the mature virion, as recently proposed for several viruses that recruit host cell cyclophilins such as SARS and HIV-1. Additionally mimicyp lacks peptidyl-prolyl isomerase activity in contrast to human cyclophilins. Thus, this study suggests that cyclophilins, whether recruited from host cells (i.e. HIV-1 and SARS) or virally encoded (i.e. Mimivirus), are localized on viral surfaces for at least a subset of viruses.
 
52: in "Applications of Computational Intelligence in Biology: Current Trends and Open Problems", Smolinski, Milanova, and Aboul-Ella, eds, (2007) in press

Phylogenomics, protein family evolution, and the Tree of Life: an integrated approach between molecular evolution and computational intelligence

Naihum LA and Pereira SL

The massive amount of information generated by genomic technologies has opened new frontiers in science by bridging disciplines such as computational biology, molecular biology, molecular evolution, evolutionary biology, and ecology. Many tools and methods have been developed over the past several years to allow analysis of molecular sequences. Phylogenomics, the interpretation of genomic data to determine gene function and phylogenetic relationships of organisms, remains challenging nevertheless. Here, we focus on the application of phylogenomics to improve functional prediction of genes/products, to understand the evolution of protein families, and to resolve phylogenetic relationships of organisms. We point out areas that require further development, such as computational tools and methods to manipulate large and diverse data sets. The application of an integrated computational and biological approach may help to achieve a better system-based understanding of biological processes in different environments. This will help to fully access valuable information available from the evolution of genes, and genomes in the wide diversity of intact organisms and biological communities.
 
51:Journal of Molecular Evolution, (2007) in press

Coevolutionary patterns in cytochrome c oxidase subunit I depend on structure and functional context

Wang ZO and Pollock DD

The strength and pattern of coevolution between amino acid residues varies depending on their structural and functional environment. This context dependence, along with differences in analytical technique, is responsible for different results among coevolutionary analyses of different proteins. It is thus important to perform detailed study of individual proteins to gain better insight into how context dependence can affect coevolutionary patterns even within individual proteins, and to unravel the details of context dependence with respect to structure and function. Here, we extend our previous study by presenting further analysis of residue coevolution in cytochrome c oxidase subunit I sequences from 231 vertebrates using a statistically robust phylogeny-based maximum likelihood ratio method. As in previous studies, a strong overall coevolutionary signal was detected, and coevolution within structural regions was significantly related to the Ca distances between residues. While the strong selection for adjacent residues among predicted coevolving pairs in the surface region indicates that the statistical method is highly selective for biologically relevant interactions, the coevolutionary signal was strongest in the transmembrane region, although the distances between coevolving residues were greater. This indicates that coevolution may act to maintain more global structural and functional constraints in the transmembrane region. In the transmembrane region, sites that coevolved according to polarity and hydrophobicity rather than volume had a greater tendency to co-localize with just one of the predicted proton channels (channel H). Thus, the details of coevolution in cytochrome c oxidase subunit I depend greatly on domain structure and residue physicochemical characteristics, but proximity to function appears to play a critical role. We hypothesize that the association of coevolutionary sites with channel H was caused by adaptive coevolution, and is indicative of a more important functional role for this channel.
 
50:BMC Evolutionary Biology, Jul 26;7:123 (2007)

Comparative mitochondrial genomics of snakes: substitution rate dynamics and functionality of the duplicate control region

Jiang ZJ*, Castoe TA*, Austin CC, Burbrink FT, Herron MD, McGuire JA, Parkinson CL, and Pollock DD

*contributed equally

BACKGROUND: The mitochondrial genomes of snakes are characterized by an overall evolutionary rate that appears to be one of the most accelerated among vertebrates. They also possess other unusual features, including short tRNAs and other genes, and a duplicated control region that has been stably maintained since it originated more than 70 million years ago. Here, we provide a detailed analysis of evolutionary dynamics in snake mitochondrial genomes to better understand the basis of these extreme characteristics, and to explore the relationship between mitochondrial genome molecular evolution, genome architecture, and molecular function. We sequenced complete mitochondrial genomes from Slowinski's corn snake (Pantherophis slowinskii) and two cottonmouths (Agkistrodon piscivorus) to complement previously existing mitochondrial genomes, and to provide an improved comparative view of how genome architecture affects molecular evolution at contrasting levels of divergence. RESULTS: We present a Bayesian genetic approach that suggests that the duplicated control region can function as an additional origin of heavy strand replication. The two control regions also appear to have different intra-specific versus inter-specific evolutionary dynamics that may be associated with complex modes of concerted evolution. We find that different genomic regions have experienced substantial accelerated evolution along early branches in snakes, with different genes having experienced dramatic accelerations along specific branches. Some of these accelerations appear to coincide with, or subsequent to, the shortening of various mitochondrial genes and the duplication of the control region and flanking tRNAs. CONCLUSION: Fluctuations in the strength and pattern of selection during snake evolution have had widely varying gene-specific effects on substitution rates, and these rate accelerations may have been functionally related to unusual changes in genomic architecture. The among-lineage and among-gene variation in rate dynamics observed in snakes is the most extreme thus far observed in animal genomes, and provides an important study system for further evaluating the biochemical and physiological basis of evolutionary pressures in vertebrate mitochondria.
 
49: Nature, 447(7141):167-77 (2007).

Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences

Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, Jurka J, Kamal M, Mauceli E, Searle SM, Sharpe T, Baker ML, Batzer MA, Benos PV, Belov K, Clamp M, Cook A, Cuff J, Das R, Davidow L, Deakin JE, Fazzari MJ, Glass JL, Grabherr M, Greally JM, Gu W, Hore TA, Huttley GA, Kleber M, Jirtle RL, Koina E, Lee JT, Mahony S, Marra MA, Miller RD, Nicholls RD, Oda M, Papenfuss AT, Parra ZE, Pollock DD, Ray DA, Schein JE, Speed TP, Thompson K, VandeBerg JL, Wade CM, Walker JA, Waters PD, Webber C, Weidman JR, Xie X, Zody MC; Broad Institute Genome Sequencing Platform, Broad Institute Whole Genome Assembly Team, Broad Institute Whole Genome Assembly Team, Jaffe DB, Alvarez P, Brockman W, Butler J, Chin C, Gnerre S, MacCallum I, Graves JA, Ponting CP, Breen M, Samollow PB, Lander ES, and Lindblad-Toh K

We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.
 
48: Genome Research, 17(7):992-1004 (2007). Epub May 10

Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica

Gentles AJ, Wakefield MJ, Kohany O, Gu W, Batzer MA, Pollock DD, and Jurka J

The genome of the gray short-tailed opossum Monodelphis domestica is notable for its large size ( approximately 3.6 Gb). We characterized nearly 500 families of interspersed repeats from the Monodelphis. They cover approximately 52% of the genome, higher than in any other amniotic lineage studied to date, and may account for the unusually large genome size. In comparison to other mammals, Monodelphis is significantly rich in non-LTR retrotransposons from the LINE-1, CR1, and RTE families, with >29% of the genome sequence comprised of copies of these elements. Monodelphis has at least four families of RTE, and we report support for horizontal transfer of this non-LTR retrotransposon. In addition to short interspersed elements (SINEs) mobilized by L1, we found several families of SINEs that appear to use RTE elements for mobilization. In contrast to L1-mobilized SINEs, the RTE-mobilized SINEs in Monodelphis appear to shift from G+C-rich to G+C-low regions with time. Endogenous retroviruses have colonized approximately 10% of the opossum genome. We found that their density is enhanced in centromeric and/or telomeric regions of most Monodelphis chromosomes. We identified 83 new families of ancient repeats that are highly conserved across amniotic lineages, including 14 LINE-derived repeats; and a novel SINE element, MER131, that may have been exapted as a highly conserved functional noncoding RNA, and whose emergence dates back to approximately 300 million years ago. Many of these conserved repeats are also present in human, and are highly over-represented in predicted cis-regulatory modules. Seventy-six of the 83 families are present in chicken in addition to mammals.
 
47: PLoS Genetics, 3(5):e72 (2007) . Epub Mar 21

Regional variation in the density of essential genes in mice

Hentges KE, Pollock DD, Liu B, and Justice MJ

In most species, and particularly in vertebrates, the percentage of genes absolutely required for survival, the essential genes, has not been estimated. To obtain this estimation, we used the mouse as an experimental model to carry out high-efficiency N-ethyl-N-nitrosourea (ENU) mutagenesis screens in two balancer chromosome regions, and compared our results to a third previously published screen. The number of essential genes in each region was predicted based on allele frequencies. We determined that the density of essential genes differs by up to an order of magnitude among genomic regions. This indicates that extrapolating from regional estimates to genome-wide estimates of essential genes has a huge variance. A particularly high density of essential genes on mouse Chromosome 11 coincides with a high degree of regional linkage conservation, providing a possible causal explanation for the density variation. This is the first demonstration of regional variation in essential gene density in the mouse genome.
 
46: Gene, 396(1):46-58 (2007). adsf Epub Mar 19

Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica

Gu W, Ray DA, Walker JA, Barnes EW, Gentles AJ, Samollow PB, Jurka J, Batzer MA, and Pollock DD

Short INterspersed Elements (SINEs) are non-autonomous retrotransposons, usually between 100 and 500 base pairs (bp) in length, which are ubiquitous components of eukaryotic genomes. Their activity, distribution, and evolution can be highly informative on genomic structure and evolutionary processes. To determine recent activity, we amplified more than one hundred SINE1 loci in a panel of 43 M. domestica individuals derived from five diverse geographic locations. The SINE1 family has expanded recently enough that many loci were polymorphic, and the SINE1 insertion-based genetic distances among populations reflected geographic distance. Genome-wide comparisons of SINE1 densities and GC content revealed that high SINE1 density is associated with high GC content in a few long and many short spans. Young SINE1s, whether fixed or polymorphic, showed an unbiased GC content preference for insertion, indicating that the GC preference accumulates over long time periods, possibly in periodic bursts. SINE1 evolution is thus broadly similar to human Alu evolution, although it has an independent origin. High GC content adjacent to SINE1s is strongly correlated with bias towards higher AT to GC substitutions and lower GC to AT substitutions. This is consistent with biased gene conversion, and also indicates that like chickens, but unlike eutherian mammals, GC content heterogeneity (isochore structure) is reinforced by substitution processes in the M. domestica genome. Nevertheless, both high and low GC content regions are apparently headed towards lower GC content equilibria, possibly due to a relative shift to lower recombination rates in the recent Monodelphis ancestral lineage. Like eutherians, metatherian (marsupial) mammals have evolved high CpG substitution rates, but this is apparently a convergence in process rather than a shared ancestral state.
 
45: in Ancestral Reconstruction, DA Liberles, ed. (2007)book cover

Dealing with Uncertainty in Ancestral Sequence Reconstruction: Sampling from the Posterior Distribution

Pollock DD and Chang BS

Resurrection of ancestral proteins in the laboratory to investigate aspects of their function has provided an exciting opportunity to experimentally test theories concerning the evolution of protein structure and function. A potentially important pitfall of this approach, however, is that sequence and functional bias in ancestral reconstruction may affect results. In the worst-case scenario, the bias in reconstruction could lead to incorrect functional interpretation for reconstructed proteins. Inferring function or stability based on a single resurrected protein sequence may be a risky proposition without concurrent examination to determine if a bias in functional shifts indeed exists. If the evolutionary process can be modeled fairly well, an effective means to eliminate the reconstruction bias is to sample ancestral proteins from the posterior probability space. It is also important to incorporate uncertainty in the model of evolution and model variation across sites, and to consider the absence of rare variants. The question of how many reconstructed ancestral samples are sufficient to estimate probable ancestral function is an open one, and it may be specific to the variability in inferred function among likely ancestors. Given a reasonably accurate model of evolution, the sampling of even a few proteins from the posterior may provide a relatively unbiased estimate of ancestral function, and would allow evaluation of the variance in this functional estimate. We discuss the details of the problem, propose a simple experimental approach to solve it, and provide a program to sample ancestral sequences and to evaluate the tendency of maximum likelihood estimates to alter amino acid frequencies and under-sample rare (possibly slightly deleterious) variants in a protein.
 
44:BMC Bioinformatics, 7 Suppl 2:S7 (2006) adsf

EGenBio: a data management system for evolutionary genomics and biodiversity

Nahum LA, Reynolds MT, Wang ZO, Faith JJ, Jonna R, Jiang ZJ, Meyer TJ, and Pollock DD

BACKGROUND: Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; http://egenbio.lsu.edu webcite) to begin to address this. DESCRIPTION: EGenBio is a system for manipulation and filtering of large numbers of sequences, integrating curated sequence alignments and phylogenetic trees, managing evolutionary analyses, and visualizing their output. EGenBio is organized into three conceptual divisions, Evolution, Genomics, and Biodiversity. The Genomics division includes tools for selecting pre-aligned sequences from different genes and species, and for modifying and filtering these alignments for further analysis. Species searches are handled through queries that can be modified based on a tree-based navigation system and saved. The Biodiversity division contains tools for analyzing individual sequences or sequence alignments, whereas the Evolution division contains tools involving phylogenetic trees. Alignments are annotated with analytical results and modification history using our PRAED format. A miscellaneous Tools section and Help framework are also available. EGenBio was developed around our comparative genomic research and a prototype database of mtDNA genomes. It utilizes MySQL-relational databases and dynamic page generation, and calls numerous custom programs. CONCLUSION: EGenBio was designed to serve as a platform for tools and resources to ease combined analysis in evolution, genomics, and biodiversity.

 
43: Public Library of Science Computational Biology, 2(6):e69 (2006). adsf Epub Jun 23

Assessing the accuracy of ancestral protein reconstruction methods

Williams PD, Pollock DD, Blackburne BP, and Goldstein RA

The phylogenetic inference of ancestral protein sequences is a powerful technique for the study of molecular evolution, but any conclusions drawn from such studies are only as good as the accuracy of the reconstruction method. Every inference method leads to errors in the ancestral protein sequence, resulting in potentially-misleading estimates of the ancestral protein’s properties. To better understand the conditions of the past, it is important to understand the accuracy of different methods and how the resulting errors affect the conclusions drawn. The Maximum Parsimony (MP) and Maximum Likelihood (ML) inference methods have been shown to misestimate ancestral nucleotide frequencies, revealing a consistent and incorrect bias, but little data for proteins exists, partially because of the difficulty of finding true ancestral sequences for comparison. To assess the accuracy of ancestral protein reconstruction methods, we perform computational population evolution simulations featuring speciation and divergence events using an off-lattice protein model where fitness depends on the ability to fold into a specified target structure. As we know the population of sequences at each step of the simulation, we can compare these known ancestral sequences and the resulting thermodynamic properties with those inferred by MP, ML, and Bayesian methods. We find that MP and, even more so, ML methods overestimate thermostability and that a Bayesian analysis, although it does not generate the most accurate sequences, is the most accurate and most unbiased in terms of resulting protein properties. This suggests that ancestral reconstruction studies performed using MP and ML may need to be re-evaluated.
.
42: Molecular Biology and Evolution, 23(7):1444-9 (2006). adsf Epub May 11

Observations of amino acid gain and loss during protein evolution are explained by statistical bias

Goldstein RA and Pollock DD

In the scientific literature, and in molecular evolution in particular, extravagant claims are oftentimes given exceptional attention. This is true for unusual inferences of relationships among organisms, dating of organismal divergence times, and for reconstruction of function and properties of ancestral proteins. In all of these cases, misuse of statistics and ignorance of variation can lead to “phylogenetic optimism”, whereby confidence in the results is vastly overstated and important sources of bias ignored. As a case in point, the authors of a recent manuscript in Nature claim to have discovered “universal trends” of amino acid gain and loss in protein evolution. Such an inference of convergent evolution in the same direction in many different taxa should always be treated with extreme caution, since inferential bias is a likely explanation for such a trend. Here, we show that the “universal trend” in amino acid evolution can be explained by a bias in common methods for inferring evolutionary trends in proteins. Trends can be more accurately detected using phylogeny-based Bayesian methods, but the currently available dataset does not contain sufficient taxa to make definitive assertions, and previous assertions are almost certainly unfounded. Variation in amino acid replacement rates among proteins, among positions within proteins, and over time currently overwhelms our ability to make sound claims about such trends.
 
41: International Journal of Modern Physics C, 17(1): 75-90 (2006) adsf

Selective advantage of recombination in evolving protein populations: A lattice model study

Williams PD, Pollock DD, and Goldstein RA

Recent research has attempted to clarify the contributions of several mutational processes, such as substitutions or homologous recombination. Simplistic, tractable protein models, which determine the compact native structure phenotype from the sequence genotype, are well-suited to such studies. In this paper, we use a lattice-protein model to examine the effects of point mutation and homologous recombination on evolving populations of proteins. We find that while the majority of mutation and recombination events are neutral or deleterious, recombination is far more likely to be beneficial. This results in a faster increase in fitness during evolution, although the final fitness level is not significantly changed. This transient advantage provides an evolutionary advantage to subpopulations that undergo recombination, allowing fixation of recombination to occur in the population.
 
40: Evolutionary Bioinformatics Online, 2 (2006)

Functionality and the evolution of marginal stability in proteins: inferences from lattice simulations

Williams PD, Pollock DD, and Goldstein RA

It has been known for some time that many proteins are marginally stable. This has inspired several explanations. Having noted that the functionality of many enzymes is correlated with subunit motion, flexibility, or general disorder, some have suggested that marginally stable proteins should have an evolutionary advantage over proteins of differing stability. Others have suggested that stability and functionality are contradictory qualities, and that selection for both criteria results in marginally stable proteins, optimised to satisfy the competing design pressures. While these explanations are plausible, recent research simulating the evolution of model proteins has shown that selection for stability, ignoring any aspects of functionality, can result in marginally stable proteins because of the underlying makeup of protein sequence-space. We extend this research by simulating the evolution of proteins, using a computational protein model that equates functionality with binding and catalysis. In the model, marginal stability is not required for ligand-binding functionality and we observe no competing design pressures. The resulting proteins are marginally stable, again demonstrating that neutral evolution is sufficient for explaining marginal stability in observed proteins.
 
39: Human Genomics, 2(3): 158-67 (2005)

Divergence, recombination, and retention of functionality during protein evolution

Xu YO, Hall RW, Goldstein RA, Pollock DD.

Protein structure and function are not easily predictable from primary sequence, and because of this we have only a vague idea exactly how protein sequences evolve in the context of structure and function. Thanks to increasing biodiversity in genomic studies, progress is being made in detecting context-dependent variation in substitution processes, but it remains unclear exactly what features of the evolutionary process we should be looking for. To address this, our laboratories have been developing a system for simulating protein evolution in the context of structure and function using lattice models of proteins and ligands (or substrates). This system includes both thermodynamic features of protein stability and population dynamics; we refer to this approach as ab initio evolution to emphasize that the equilibrium details of variant fitnesses arise from the physical principles of the system, and not from any pre-conceived notions or arbitrary mathematical distributions. Here, we discuss the relevance of the system to evolutionary genomics and the choices that must be made in trying to reproduce essential biological features in the face of immense computational burdens. We present new results on the coevolution during the divergence process and retention of functionality in homologous recombinants following population divergence. The designability, or sequence space available to a structure, plays a key role in divergence and recombinant function. These results have implications for understanding viral evolution, speciation, and directed evolutionary experiments. We also show that the results of our analysis of the divergence process can guide improved methods for accurately approximating folding probabilities in more complex systems that would otherwise be beyond computational feasibility.
 
38: Molecular Biology and Evolution, 23(3): 449-512 (2006) . Epub 2005 Nov 16

Sequences and protein structures are congruent with functional and fitness differences among Colias phosphoglucose isomerase genotypes

Wheat CW, Watt WB, Pollock DD, Schulte PM

The enzyme phosphoglucose isomerase, PGI, of Colias butterflies (Lepidoptera, Pieridae) displays a widespread allozyme polymorphism. Many studies on the biochemical function, organismal performance, and fitness effects of Colias PGI genotypes have given evidence of strong natural selection in the wild to maintain this polymorphism. Here we begin to study the mechanism underlying this adaptive polymorphism at the level of molecular sequence and structure. The common electrophoretically-detectable alleles differ at multiple amino acid positions, and also show some cryptic charge-neutral amino acid variation hidden within the electrophoretic allele classes. Structural modeling shows that all changes are at or near PGI’s surface, and several naturally abundant variants that distinguish these alleles are so placed as potentially to alter subunit interaction and catalytic center geometry. There is a large excess of intraspecific variation, both synonymous and nonsynonymous, compared to interspecific fixation: there are no fixed synonymous differences between species, and only two fixed nonsynonymous differences. The fixed differences may be due to positive selection, but sliding window analysis of synonymous nucleotide diversity and Tajima’s D shows that that the amino acid sites predicted to be foci of selection based on structural and functional considerations also coincide with the regions of highest synonymous diversity. They are thus the most likely targets of balancing selection based on both genetic and biochemical considerations. Colias' PGI gene, with 1668 bp of cDNA, is divided into 12 exons, spread over ~ 11kb of chromosomal DNA, and intragenic recombination has been active over much of the gene. Our results show that the relaxation of constraint against amino acid variation, as one moves from the interior cores of proteins to their surface, allows adaptive, as well as neutral, natural variation to occur near or at those surfaces. This case study of persistent polymorphism now offers the integration of the genomic and molecular-structural bases of natural variation with its consequences for metabolic and organismal performance, thence for fitness, in wild populations.
 
37: NHGRI White Paper 2005

Proposal to sequence the first reptilian genome: the Green Anole Lizard, Anolis carolinensis

J. Losos, E. Braun, D. Brown, S. Clifton, S. Edwards, J. Gibson-Brown, T. Glenn, L. Guillette, D. Main, P. Minx, W. Modi, M. Pfrender, D. Pollock, D. Ray, A. Shedlock, and W. Warren

No abstract available.
 
36: Genome Research, 15(5):665-73 (2005)

Evolution of base substitution gradients in primate mitochondrial genomes

Raina SZ, Faith JJ, Seligmann H, Disotell T, Stewart C-B, and Pollock DD

Substitution patterns among nucleotides are often assumed to be constant in phylogenetic analyses. Although variation in the average rate of substitution among sites is commonly accounted for, variation in the relative rates of specific types of substitution are not. Here, we review details of methodologies used for detecting and analyzing differences in substitution processes among predefined groups of sites. We describe how such analyses can be performed using existing phylogenetic tools, and discuss how new phylogenetic analysis tools we have recently developed can be used to provide more detailed and sensitive analyses, including study of the evolution of mutation and substitution processes. As an example we consider the mitochondrial genome, for which two types of transition deaminations (C=>T and A=>G) are strongly affected by single-strandedness during replication, resulting in an asymmetric mutation process. Since time spent single-stranded varies along the mitochondrial genome, their differential mutational response results in very different substitution patterns in different regions of the genome.
 
35: Mycological Research, 109:261-5 (2005); see News and Views: T. Boekhout "Biodiversity: gut feeling for yeasts" Nature 434: 449-450 (2005)

News and Views overview, "Biodiversity: gut feeling for yeasts" in:
The beetle gut: a hyperdiverse source of novel yeasts

Suh S-O, McHugh, JV, Pollock DD, Blackwell M

We isolated over 650 yeasts over a three year period from the gut of a variety of beetles and characterized them on the basis of LSU rDNA sequences and morphological and metabolic traits. Of these, at least 200 were undescribed taxa, a number equivalent to almost 30% of all currently recognized yeast species. A Bayesian analysis of species discovery rates predicts further sampling of previously sampled habitats could easily produce another 100 species. The sampled habitat is, thereby, estimated to contain well over half as many more species as are currently known worldwide. The beetle gut yeasts occur in 45 independent lineages scattered across the yeast phylogenetic tree, often in clusters. The distribution suggests that some of the yeasts diversified by a process of horizontal transmission in the habitats and subsequent specialization in association with insect hosts. Evidence of specialization comes from consistent association over time and broad geographical ranges of certain yeasts and beetle species. The discovery of high yeast diversity in a previously unexplored habitat is a first step toward investigating the basis of the interactions and their impact in relation to ecology and evolution.
 
34: Encyclopedia of Genomics, Proteomics and Bioinformatics 2005; Dunn, Jorde, Little, and Subramaniam, eds. September 2005

Modeling protein evolution

Pollock DD and Goldstein RA

Modeling protein evolution has been frustratingly simplistic in the past, but new methodologies and approaches have been rapidly changing this situtation. Increased computational power, improved phylogeny-based maximum likelihood and Bayesian statistics, larger data sets, and better protein structure prediction methods are jointly improving the outlook and allowing researchers to improve the biological realism of protein models. They are also allowing more detailed analysis of differences in processes among sequence positions over space and time, of selection and adaptation, coevolution, and functional divergence, and of ancestral changes in function. The future is expected to bring improved integration of models of protein evolution with protein structure prediction, with the potential to dramatically improve the accuracy and power of both
 
33: Methods in Enzmology, 395:779-790 (2005)

Context dependence and coevolution among amino acid residues in proteins

Wang ZO and Pollock DD

As complete genomes accumulate, and the generation of genomic biodiversity proceeds at an accelerating pace, the need to understand the interaction between sequence evolution and protein structure and function rises in prominence. The pattern and pace of substitutions in proteins can provide important clues to functional importance, functional divergence, and adaptive response. Coevolution between amino acid residues and the context-dependence of the evolutionary process are often ignored, however, due to their complexity; but they are of critical importance for the accurate interpretation of reconstructed evolutionary events. Since residues interact with one another, and because the effect of substitutions can depend on the structural and physiological environment in which they occur, an accurate science of evolutionary functional genomics and a complete understanding of selection in proteins requires a better understanding of how context dependence affects protein evolution. Here, we present new evidence from vertebrate cytochrome oxidase sequences that pairwise coevolutionary interactions between protein residues are highly dependent on tertiary and secondary structure. We also discuss theoretical predictions that impinge on our expectations of how protein residues may interact over long distances due to their shared need to maintain protein stability.
 
32: Biological Procedures Online 2004; 6(1): 180-188

Analysis of among-site variation in substitution patterns

Krishnan NM, Raina SZ, and Pollock DD

Substitution patterns among nucleotides are often assumed to be constant in phylogenetic analyses. Although variation in the average rate of substitution among sites is commonly accounted for, variation in the relative rates of specific types of substitution are not. Here, we review details of methodologies used for detecting and analyzing differences in substitution processes among predefined groups of sites. We describe how such analyses can be performed using existing phylogenetic tools, and discuss how new phylogenetic analysis tools we have recently developed can be used to provide more detailed and sensitive analyses, including study of the evolution of mutation and substitution processes. As an example we consider the mitochondrial genome, for which two types of transition deaminations (C=>T and A=>G) are strongly affected by single-strandedness during replication, resulting in an asymmetric mutation process. Since time spent single-stranded varies along the mitochondrial genome, their differential mutational response results in very different substitution patterns in different regions of the genome.
 
31: DNA and Cell Biology 2004; 23:707-714

Detecting gradients of asymmetry in site-specific substitutions in mitochondrial genomes

Krishnan NM, Seligmann H, Raina SZ, and Pollock DD

During mitochondrial replication, spontaneous mutations occur and accumulate asymmetrically during the time spent single-stranded by the heavy strand (DssH). The predominant mutations appear to be deaminations from adenine to hypoxanthine (A=>H, which leads to an A=>G substitution) and cytosine to thymine (C=>T). Previous findings indicated that C=>T substitutions accumulate rapidly and then saturate at high DssH, suggesting protection or repair, whereas A=>G accumulates linearly with DssH. We describe here the implementation of a simple hidden Markov model (HMM) of among-site rate correlations to provide an almost continuous profile of the asymmetry in substitution response for any particular substitution type. We implement this model using a phylogeny-based Bayesian Markov chain Monte Carlo (MCMC) approach. We compare and contrast the relative asymmetries in all twelve possible substitution types, and find that the observed transition substitution responses determined using our new method agree quite well with previous predictions of a saturating curve for C=>T transition substitutions and a linear accumulation of A=>G transitions. The patterns seen in transversion substitutions show much lower among-site variation and are non-linear and more complex than those seen in transitions. We also find that, after accounting for the principal linear effect, some of the residual variation in A=>G/G=>A response ratios is explained by the average predicted nucleic acid secondary structure propensity at a site, possibly due to protection from mutation when secondary structure forms.
 
30: DNA and Cell Biology 2004; 23:701-705

The ambush hypothesis: Hidden stop codons prevent off-frame gene reading

Seligmann H and Pollock DD

Coding sequences lack stop codons, but many stops appear off-frame. Off-frame stops (stops in -1 and +1 shifted reading frames, termed hidden stops) terminate frameshifted translation, potentially decreasing energy and resource waste on non-functional proteins. Benefits may include reduced waste elimination costs and avoidance of potentially cytotoxic frame-shifted products. Our “ambush” hypothesis suggests that hidden stops are sometimes selected for. Codons of many amino acids can contribute to hidden stops, depending on the synonymous position state and adjacent codons. In vertebrate mitochondria, 31.75% of all amino acid combinations can form hidden stops. Codons with more potential to form hidden stops have greater usage frequency and bias in their favor among synonymous codons. Among primates, predicted mitochondrial rRNA secondary structure stability correlates negatively with the number of hidden stops in the mitochondrial genome. The taxonomic distribution of genetic codes suggests that +1 frameshifts might be more frequent than –1 frameshifts. This is confirmed by analyses of primate mitochondrial genomes: species with unstable rRNAs have more +1 stops, but the correlation is weak for -1 stops. High hidden stop density seems to be an adaptation in species with slippage prone ribosomes (unstable rRNAs). Hidden stops may thus compensate for reduced efficiency of some parts of the biosynthetic machinery. Some experimental data confirm our hypothesis: gene expression increases with the experimentally manipulated number of stops in the promoter region of a gene, suggesting biotechnological applications.
 
29: Molecular Biology and Evolution 2004; 21(10): 1871-1883
 
 Ancestral sequence reconstruction in primate mitochondrial DNA: compositional bias and effect on functional inference

Krishnan NM, Seligmann H, Stewart, C-B, de Koning APJ, and Pollock DD

Reconstruction of ancestral DNA and amino acid sequences is an important means of inferring information about past evolutionary events. Such reconstructions suggest changes in molecular function and evolutionary processes over the course of evolution, and are used to infer adaptation and convergence. Maximum likelihood (ML) is generally thought to provide relatively accurate reconstructed sequences compared to parsimony, but both methods lead to the inference of multiple directional changes in nucleotide frequencies in primate mitochondrial DNA (mtDNA). To better understand this surprising result, as well as to better understand how parsimony and ML differ, we constructed a series of computationally simple “conditional pathway” methods that differed in the number of substitutions allowed per site along each branch, and also evaluated the entire Bayesian posterior frequency distribution of reconstructed ancestral states. We analyzed primate mitochondrial cytochrome b (Cyt-b) and cytochrome oxidase subunit I (COI) genes and found that ML reconstructs ancestral frequencies that are often more different from tip sequences than are parsimony reconstructions. In contrast, frequency reconstructions based on the posterior ensemble more closely resemble extant nucleotide frequencies. Simulations indicate that these differences in ancestral sequence inference are probably due to deterministic bias caused by high uncertainty in the optimization-based ancestral reconstruction methods (parsimony, ML, Bayesian maximum a posteriori). In contrast, ancestral nucleotide frequencies based on an average of the Bayesian set of credible ancestral sequences are much less biased. The methods involving simpler conditional pathway calculations have slightly reduced likelihood values compared to full likelihood calculations, but can provide fairly unbiased nucleotide reconstructions and may be useful in more complex phylogenetic analyses than considered here due to their speed and flexibility. To determine whether biased reconstructions using optimization methods might affect inferences of functional properties, ancestral primate mitochondrial tRNA sequences were inferred and helix-forming propensities for conserved pairs were evaluated in silico. For ambiguously reconstructed nucleotides at sites with high base composition variability, ancestral tRNA sequences from Bayesian analyses were more compatible with canonical base pairing than were those inferred by other methods. Thus, nucleotide bias in reconstructed sequences apparently can lead to serious bias and inaccuracies in functional predictions.
 
28: Genetics 2004; 168(1): 489-502
 
 Estimating the degree of saturation in mutant screens

Pollock DD and Larkin J

Large-scale screens for loss-of-function mutants have played a significant role in recent advances in developmental biology and other fields. In such mutant screens, it is desirable to estimate the degree of “saturation” of the screen (i.e., what fraction of the possible target genes have been identified). We applied Bayesian and maximum likelihood methods for estimating the number of loci remaining undetected in large-scale screens, and produce credibility intervals to assess the uncertainty of these estimates. Since different loci may mutate to alleles with detectable phenotypes at different rates, we also incorporated variation in the degree of mutability among genes, using either gamma-distributed mutation rates or multiple discrete mutation rate classes. We examined eight published data sets from large-scale mutant screens and find that credibility intervals are much broader than implied by previous assumptions about the degree of saturation of screens. The likelihood methods presented here are a significantly better fit to data from published experiments than estimates based on the Poisson distribution, which implicitly assumes a single mutation rate for all loci. The results are reasonably robust to different models of variation in the mutability of genes. We tested our methods against mutant allele data from a region of the Drosophila melanogaster genome for which there is an independent genomics-based estimate of the number of undetected loci, and found that the number of such loci falls within the predicted credibility interval for our models. The methods we have developed may also be useful for estimating the degree of saturation in other types of genetic screens in addition to classical screens for simple loss-of-function mutants, including genetic modifier screens and screens for protein-protein interactions using the yeast two-hybrid method.
 
27: Human Genomics 2004; 1(2): 85
 
 Human genomics and the role of evolutionary genomics

Pollock DD

Human Genomics has, from its outset, included a great deal of evolutionary analysis. The structure of the editorial board has representation from many evolution-based disciplines, including population and quantitative genetics, and of course, evolutionary genomics. This inclusion is the result of an obvious trend in the field of genomics to incorporate more and more evolutionary analysis, not just as an extra frill, but as a central component of the field. The world now has over one hundred complete bacterial genomes, and with human, roundworm, multiple fruitflies, mosquito, rice, Arabidposis, pufferfish, mouse, rat, dog, chimpanzee, chicken, and a growing number of other multicellular organisms either sequenced or imminent, comparative genomics is coming into its own. Still, one might argue that a journal of Human Genomics should focus on its main target, Homo sapiens, and leave aside mucking about with the multitude of other species on the planet, most of which many self-respecting Homo sapiens individuals might rather target with the bottom of their shoe rather than with a multimillion dollar sequencing project. As the evolutionary genomics editor, it seems necessary to provide some explanation and justification.
 
26: Genetics 2003; 165(2): 735-745
 
 Likelihood analysis of asymmetrical mutation bias gradients in vertebrate mitochondrial genomes

Faith JJ and Pollock DD

Protein-coding genes in mitochondrial genomes have varying degrees of asymmetric skew in base frequencies at the third codon position. The variation in skew among genes appears to be caused by varying durations of time that the heavy strand spends in the mutagenic single stranded state during replication (DssH). The primary data used to study skew has been the gene-by-gene base frequencies in individual taxa, which provides little information on exactly what kinds of mutations are responsible for the base frequency skew. To assess the contribution of individual mutation components to the ancestral vertebrate substitution pattern, here we analyze a large data set of complete vertebrate mitochondrial genomes in a phylogeny-based likelihood context. This also allows us to evaluate the change in skew continuously along the mitochondrial genome, and to directly estimate relative substitution rates. Our results indicate that different types of mutation respond differently to the gradient. A primary role for hydrolytic deamination of cytosines in creating variance in skew among genes was not supported, but rather linearly increasing rates of mutation from adenine to hypoxanthine with appear to drive regional differences in skew. Substitutions due to hydrolytic deamination of cytosines, although common, appear to quickly saturate, possibly due to stabilization by the mitochondrial DNA single strand binding protein. These results should form the basis of more realistic models of DNA and protein evolution in mitochondria.
 
25: NHGRI White Paper 2003
 
Proposal for complete sequencing of the genome of a Marsupial, the gray, short-tailed opossum, Monodelphis domestica

Amemiya CT, Greally JM, Jirtle RL, Lander ES, Lindblad-Toh K, Miller RD, Pollock DD, Samallow PB, Springer MS, and Wilson RK

Metatherian (“marsupial”) mammals are phylogenetically distinct from current mammalian biomedical models, all of which are eutherian (“placental”) species. However, marsupials and eutherians are more closely related to one another than to any other vertebrate model species (i.e., birds, amphibians, fishes). Fossil evidence establishes a minimum date of 125 million years (MY) for the separation of eutherian and metatherian mammals (JI et al. 2002), while analyses of nuclear gene sequences suggest that metatherian / eutherian divergence may be as old as 173-190 MY (KUMAR and HEDGES 1998; WOODBURNE et al. 2003). To place this in context, the evolutionary gulf between mammals and the next most closely related group of non-mammalian research models, i.e., birds (chicken), is approximately 300 –350 MY. Thus, the marsupial – eutherian relationship represents a unique midpoint in age relative to existing mammalian and non-mammalian vertebrate models. As a legacy of their common ancestry, marsupials and eutherians share basic genetic mechanisms and molecular processes that represent fundamental (ancient) mammalian characteristics. Nevertheless, since their divergence, eutherian and marsupial mammals have evolved many distinctive morphologic, physiologic, and genetic variations on these elemental mammalian designs. These phylogenetically restricted differences can be used as comparative tools for examining the underlying molecular and genetic processes that are common to all mammalian species, and thereby help to reveal how variations in these mechanisms lead to differences in gene regulation, expression, and function. As the closest sister group to eutherian mammals, marsupials are also the most appropriate “outgroup” for assessing the relative antiquity or novelty of the molecular and genetic changes that have occurred among the many eutherian species (including ourselves) presently used in biomedical and evolutionary research..
 
24: Journal of Molecular Evolution 2003; 56(4): 375-376
 
 The Zuckerkandl Prize: Structure and Evolution

Pollock DD

Guest Editorial: The Zuckerkandl Prize, established by Springer-Verlag in 2002 to honor Emile Zuckerkandl and his contributions to molecular evolution, goes this year to Gustavo Caetano-Anollés for his paper on “Evolved RNA Secondary Structure and the rooting of the Universal Tree of Life” (Caetano-Anollés 2002). The editors of the Journal of Molecular Evolution have judged this to be the best paper in the journal last year due to its creative use of structure, and the evolution of structure, to reconstruct deep phylogenies.
 
23: Systematic Biology 2003; 52(1):124-6
 
 Is sparse taxon sampling a problem for phylogenetic inference?

Hillis, DM, Pollock DD, McGuire JA, and Zwickl DJ

No abstract: ...There is no simple answer to the question posed in the heading of this section; the answer will depend on the particular situation being examined (the scope of the problem, the number of taxa already sequenced, the number of characters already collected, and the quantity and the availability of additional relevant taxa to include). We disagree with the assertion of Rosenberg and Kumar (2002) that more characters per taxon is necessarily a better strategy than more taxa for the same characters. Rosenberg and Kumar (2002) put ther argument in terms of the current genome sequencing studies, in which many genes (or complete genomes) are examined from very few taxa. Rosenberg and Kumar 92002) argued that their conclusions "mesh well" with this scattered genome approach. In contrast, we propose that this approach will likely result in poorly estimated evolutionary models, poorly estimated evolutionary trees, and a poor overall view of evolutionary history. If one is interested in inferring the evolutionary history of life, a much broader sample of taxa (perhaps sequence for far less than full genomes) will result in a much more accurate estimate of phylogeny than will complete genomes of only a small sample of taxa.
 
22: Systematic Biology 2002; 51(4):664-71
 
 Increased taxon sampling is advantageous for phylogenetic inference

Pollock DD, Zwickl DJ, McGuire JA, and Hillis DM

Until recently, it was believed that complex phylogenies might be extremely difficult to reconstruct due to the phenomenal rate of increase in the number of possible phylogenies as the number of taxa increases. However, Hillis (1996) showed through simulation that, for at least one complex phylogeny of angiosperms with 228 taxa, reconstruction was far more accurate than expected, even with relatively modest amounts of DNA sequence data. This led to a flurry of papers on the subject of taxon sampling and phylogenetic reconstruction, with focus quickly shifting from the question of whether complex phylogenies can be reconstructed to whether and how much an existing phylogeny can be improved through increased taxon sampling (Hillis, 1998; Kim, 1998; Poe, 1998; Poe and Swofford, 1999; Pollock and Bruno, 2000; Rannala et al., 1998; Yang, 1998). Although a statistician might intuitively believe that it is generally better (or at least no worse) to increase the amount of data to resolve a question in statistical inference, the benefits of taxon addition for phylogenetic inference remain controversial. ...A recent paper on the subject of taxon addition (Rosenberg and Kumar, 2001) concludes that increased taxon sampling is of little benefit to phylogenetic inference when compared to increasing sequence length. We disagree with their interpretation and believe that their data support the importance of increased taxon sampling. In addition, some of their data were simulated under extreme conditions (i.e., substitution rates that were very high or low, or sequences that were unreasonably short). Large error values and non-linear relationships at these extremes make it difficult to interpret effects for the majority of the range, and averaging across the entire range is inappropriate. Moreover, we do not believe that Rosenberg and Kumar (2001) used the most appropriate metric to measure the relative effect of taxon addition. Our reanalysis of their simulated data indicates that increased taxon sampling is highly beneficial for phylogenetic inference..
 
21: Applied Bioinformatics 2002; 1(2): 81-92
 
 Genomic biodiversity, phylogenetics, and coevolution in proteins

Pollock DD.

Comprehensive sampling of genomic biodiversity is fast becoming a reality for some genomic regions and complete organelle genomes. Genomic biodiversity is defined as large genomic sequences from many species, and here some recent work is reviewed that demonstrates the potential benefits of genomic biodiversity for molecular evolutionary analysis and phylogenetic reconstruction. This work shows that, using likelihood-based approaches, taxon addition can dramatically improve phylogenetic reconstruction. Features, or dynamics, of the evolutionary process are much more easily inferred with large numbers of taxa, and large numbers are essential for discriminating differences in evolutionary patterns between sites. Accurate prediction of site-specific patterns can improve phylogenetic reconstruction by an amount equivalent to quadrupling sequence length. Genomic biodiversity is particularly central to research relating patterns of evolution, adaptation, and coevolution to structural and functional features of proteins. Research on detecting coevolution between amino acid residues in proteins is reviewed that demonstrates a clear need for much greater numbers of closely related taxa to better discriminate site-specific patterns of interaction, and to allow more detailed analysis of coevolutionary interactions between subunits in protein complexes. It is argued that parsing out coevolutionary and other context-dependent substitution probabilities is essential for discriminating between coevolution and adaptation, and for more realistically modeling the evolution of proteins. Research is also reviewed that argues for increasing the efficiency of acquiring genomic biodiversity, and suggests that this might be done by simultaneously shotgun cloning and sequencing genomic mixtures from many species. Increased efficiency is a prerequisite if genomic biodiversity levels are to rapidly increase by orders of magnitude, and thus lead to dramatically improved understanding of interactions between protein structure, function, and sequence evolution.
 
20: Pac Symp Biocomp 2002; tutorial
 
Molecular evolution and phylogenetic analysis

Pollock DD and Goldstein RA

All of biology is based on evolution. Evolution is the organizing principle for understanding the shared history of all biological organisms. Evolution describes the similarities between different organisms, as well as explaining how differences emerged. In addition to answering basic questions about the history of life, evolutionary perspectives and information drawn from evolutionary analyses can provide information highly relevent to many biological, biotechnological, and biomedical problems. There is also growing interest in mimicking evolution in the test tube in order to develop RNA, proteins, and organisms with specified properties.
 
19: J Mol Graph Model 2001;19(1):150-6
 
 Evolution of functionality in lattice proteins

Williams PD, Pollock DD, and Goldstein RA

We study the evolution of protein functionality using a two-dimensional lattice model. The characteristics particular to evolution, such as population dynamics and early evolutionary trajectories, have a large effect on the distribution of observed structures. Only subtle differences are observed between the distribution of structures evolved for function and those evolved for their ability to form compact structures.
 
18: Pac Symp Biocomp 2001 13:164-166
 
Structures, phylogenies, and genomes: The integrated study of protein evolution

Goldstein RA, Pollock DD, and Thorne JL

For the past decades, evolutionary biologists have tried to reconstruct evolutionary histories, to piece together phylogenetic trees, and to understand the network of hereditary relationships. Such approaches (whether it is admitted or not) are based on models of the evolutionary process. These tasks would be easier if reality would better match the simplest models. Unfortunately for these scientists, evolution takes place in a complicated web of constraints, with changes in the DNA sometimes but not always translating to changes in amino acids which may or may not result in significant changes in the properties of these expressed proteins. All of this occurs in a complicated and interconnected fitness landscape, where different locations in the protein may be under radically different selective pressure. This situation has led a number of investigators to bring more of the biologial and biochemical complexity into these evolutionary models, to develop approaches with a closer fidelity to biological reality with the hope that more accurate pictures of biological history will result.
 
17: Mol Biol Evol 2000 Dec;17(12):1854-8
 
Assessing an unknown evolutionary process: effect of increasing site-specific knowledge through taxon addition

Pollock DD, Bruno WJ.

Assessment of the evolutionary process is crucial for understanding the effect of protein structure and function on sequence evolution and for many other analyses in molecular evolution. Here, we used simulations to study how taxon sampling affects accuracy of parameter estimation and topological inference in the absence of branch length asymmetry. With maximum-likelihood analysis, we find that adding taxa dramatically improves both support for the evolutionary model and accurate assessment of its parameters when compared with increasing the sequence length. Using a method we call "doppelganger trees," we distinguish the contributions of two sources of improved topological inference: greater knowledge about internal nodes and greater knowledge of site-specific rate parameters. Surprisingly, highly significant support for the correct general model does not lead directly to improved topological inference. Instead, substantial improvement occurs only with accurate assessment of the evolutionary process at individual sites. Although these results are based on a simplified model of the evolutionary process, they indicate that in general, assuming processes are not independent and identically distributed among sites, more extensive sampling of taxonomic biodiversity will greatly improve analytical results in many current sequence data sets with moderate sequence lengths.
 
16: Mol Biol Evol 2000 Dec;17(12):1776-88

Comment in:  
A case for evolutionary genomics and the comprehensive examination of sequence biodiversity

Pollock DD, Eisen JA, Doggett NA, Cummings MP.

Comparative analysis is one of the most powerful methods available for understanding the diverse and complex systems found in biology, but it is often limited by a lack of comprehensive taxonomic sampling. Despite the recent development of powerful genome technologies capable of producing sequence data in large quantities (witness the recently completed first draft of the human genome), there has been relatively little change in how evolutionary studies are conducted. The application of genomic methods to evolutionary biology is a challenge, in part because gene segments from different organisms are manipulated separately, requiring individual purification, cloning, and sequencing. We suggest that a feasible approach to collecting genome-scale data sets for evolutionary biology (i.e., evolutionary genomics) may consist of combination of DNA samples prior to cloning and sequencing, followed by computational reconstruction of the original sequences. This approach will allow the full benefit of automated protocols developed by genome projects to be realized; taxon sampling levels can easily increase to thousands for targeted genomes and genomic regions. Sequence diversity at this level will dramatically improve the quality and accuracy of phylogenetic inference, as well as the accuracy and resolution of comparative evolutionary studies. In particular, it will be possible to make accurate estimates of normal evolution in the context of constant structural and functional constraints (i.e., site-specific substitution probabilities), along with accurate estimates of changes in evolutionary patterns, including pairwise coevolution between sites, adaptive bursts, and changes in selective constraints. These estimates can then be used to understand and predict the effects of protein structure and function on sequence evolution and to predict unknown details of protein structure, function, and functional divergence. In order to demonstrate the practicality of these ideas and the potential benefit for functional genomic analysis, we describe a pilot project we are conducting to simultaneously sequence large numbers of vertebrate mitochondrial genomes.
 
15: Pac Symp Biocomp 2000; 12:3-5
 
Protein Evolution and Structural Genomics

Frishman D, Goldstein RA, Pollock DD.

The genomic data available to computational biologists represents the product of the complex processes of evolution. In particular, the forces of mutation, duplication, and selection have acted to sculpt modern protein sequence and structure in the context of changing functional requirements. Just as crystallographers are able to determine protein structures through an analysis of X-ray diffraction patterns, scientists are learning to read the evolutionary history of proteins in order to infer and explain both structure and function. This pursuit depends on the development of new computational approaches in order to make optimal use of genomic data, and requires interaction with experiment for comparison and verification of computational results.
 
14: Comp Chem 2000; 24(1):133-134
 
RECOMB98: Computational molecular biology: pre- and post-genomics

Pollock DD and Heringa J

Meeting review.
 
13: J Mol Biol 1999 Mar; 19;287(1):187-98
 
Coevolving protein residues: maximum likelihood identification and relationship to structure

Pollock DD, Taylor WR, and Goldman N

The identification of protein sites undergoing correlated evolution (coevolution) is of great interest due to the possibility that these pairs will tend to be adjacent in the three-dimensional structure. Identification of such pairs should provide useful information for understanding the evolutionary process, predicting the effects of site-directed substitution, and potentially for predicting protein structure. Here, we develop and apply a maximum likelihood method with the aim of improving detection of coevolution. Unlike previous methods which have had limited success, this method allows for correlations induced by phylogenetic relationships and for variation in rate of evolution along branches, and does not rely on accurate reconstruction of ancestral nodes. In order to reduce the complexity of coevolutionary relationships and identify the primary component of pairwise coevolution between two sites, we reduce the data to a two-state system at each site, regardless of the actual number of residues observed at that site. Simulations show that this strategy is good at identifying simple correlations and at recognizing cases in which the data are insufficient to distinguish between coevolution and spurious correlations. The new method was tested by using size and charge characteristics to group the residues at each site, and then evaluating coevolution in myoglobin sequences. Grouping based on physicochemical characteristics allows categorization of coevolving sites into positive and negative coevolution, depending on the correlation between equilibrium state frequencies. We detected a striking excess of negative coevolution (corresponding to charge) at sites brought into proximity by the periodicity of the alpha-helix, and there was also a tendency for sites with significant likelihood ratios to be close in the three-dimensional structure. Sites on the surface of the protein appear to coevolve both when they are close in the structure, and when they are distant, implying a role for folding and/or avoidance of quaternary structure in the coevolution process. Copyright 1998 Academic Press.
 
Myoglobin data from this manuscript can be found here.
 
12: Theor Popul Biol 1998 Aug; 54(1):78-90
 
Increased accuracy in analytical molecular distance estimation

Pollock DD

Analytical molecular distance estimates can be inaccurate and biased estimates of the total number of substitutions not only when the model of evolution they are based on is incorrect, but also when the method of estimating the total is too simple. This comes about because when there are different types of substitutions occurring simultaneously, it can become extremely difficult to estimate the number of the more quickly evolving type, and the variance of this larger number can overwhelm the total estimate. In this paper, in an extension of earlier work with a simple two-parameter model of evolution, more accurate analytical distances are derived for models appropriate to a variety of known DNA types using generalized least squares principles of noise reduction. It is shown that the new estimates can be applied to achieve more accurate results for site-to-site rate variation, regions with biased nucleotide frequencies, and synonymous sites in protein-coding regions. This study also includes a methodology to obtain accurate distance estimates for large numbers of sequence regions evolving in different manners. Copyright 1998 Academic Press.
 
11: Theor Popul Biol 1998 Jun; 53(3):256-71
 
Microsatellite behavior with range constraints: parameter estimation and improved distances for use in phylogenetic reconstruction

Pollock DD, Bergman A, Feldman MW, Goldstein DB

A symmetric stepwise mutation model with reflecting boundaries is employed to evaluate microsatellite evolution under range constraints. Methods of estimating range constraints and mutation rates under the assumptions of the model are developed. Least squares procedures are employed to improve molecular distance estimation for use in phylogenetic reconstruction in the case where range constraints and mutation rates vary across loci. The bias and accuracy of these methods are evaluated using computer simulations, and they are compared to previously existing methods which do not assume range constraints. Range constraints are seen to have a substantial impact on phylogenetic conclusions based on molecular distances, particularly for more divergent taxa. Results indicate that if range constraints are in effect, the methods developed here should be used in both the preliminary planning and final analysis of phylogenetic studies employing microsatellites. It is also seen that in order to make accurate phylogenetic inferences under range constraints, a larger number of loci are required than in their absence.
 
10:Annals Ent Soc America 1998; 91(5):524-531.  
 
Molecular phylogeny for Colias butterflies and their relatives (Lepidoptera: Pieridae)

Pollock DD, Watt WB, Rashbrook VK, Iyengar EV

The sulfur butterflies, Colias spp., and their relatives in the family Pieridae have been the subjects of diverse behavioral, ecological, and evolutionary studies. However, their phylogeny is uncertain in many respects. We used DNA sequences from 2 mitochondrial gene blocks, 333 bp of the cytochrome oxidase I subunit (CO I) and 1,261 bp from the 2 ribosomal genes and the tRNA between them (rDNA), as character sources to test existing phylogenetic hypotheses and begin to infer others. The rDNA block resolves better at deeper nodes of the phylogeny, and the CO I block at shallower nodes. Our results support sister status for subfamilies Coliadinae and Pierinae within Pieridae; independent tribal status for Euchloini and Pierini within Pierinae; status as sister genera for Colias and Zerene within Coliadinae; and monophyly within subgenus C. (Euoolias) of all North American Colias studied. Our results suggest that the Neotropical coliad genus Eurema may warrant splitting, as some early workers proposed, but do not support the recently proposed splitting of Eurasian C. erate from subgenus C. (Eriocolias) into the separate subgenus C. (Neocolias).
 
9: J Hered 1997 Sep-Oct; 88(5):335-42pdf

Launching microsatellites: a review of mutation processes and methods of phylogenetic interference

Goldstein DB, Pollock DD.
8: Protein Eng 1997 Jun; 10(6):647-57
 
Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution

Pollock DD, Taylor WR.

Various methods for detecting correlation between sites were evaluated by ascertaining their ability to discriminate positively correlated sites from background correlation at randomly evolved sites. A model for generating pairwise correlations of different degrees is also described. An assortment of physicochemical vectors and similarity and difference matrices were used to discriminate correlated change. There was little difference in effectiveness between the different matrices, but there were significant differences between the matrices and the physicochemical vectors. It is shown that all methods investigated exhibit significant inability to screen out background correlation, particularly in the presence of phylogenetic relatedness between the sequences. Methods using the matrices are unable to distinguish positively correlated from negatively correlated, or compensatory, replacements.
 
7: Genetics 1997 Jan; 145(1):207-16
 
Microsatellite genetic distances with range constraints: analytic description and problems of estimation

Feldman MW, Bergman A, Pollock DD, Goldstein DB.

Statistical properties of the symmetric stepwise-mutation model for microsatellite evolution are studied under the assumption that the number of repeats is strictly bounded above and below. An exact analytic expression is found for the expected products of the frequencies of alleles separated by k repeats. This permits characterization of the asymptotic behavior of our distances D1 and (delta mu)2 under range constraints. Based on this characterization we develop transformations that partially restore linearity when allele size is restricted. We show that the appropriate transformation cannot be applied in the case of varying mutation rates (beta) and range constraints (R) because of statistical difficulties. In the special case of no variation in beta and R across loci, however, the transformation simplifies to a usable form and results in a distance much more linear with time than distances developed for an infinite range. Although analytically incorrect in the case of variation in beta and R, the simpler transformation is surprisingly insensitive to variation in these parameters, suggesting that it may have considerable utility in phylogenetic studies.
 
6: Mol Biol Evol 1995 Jul; 12(4):713-7