Structural, biochemical, and in vivo characterization of the first virally encoded cyclophilin from the Mimivirus
Thai V, Renesto P, Fowler A, Brown D, Davis T, Gu W, Pollock DD, Kern D, Raoult D, and Eisenmesser E
Although multiple viruses utilize host cell cyclophilins, including SARS and HIV-1, their role in infection is poorly understood. To help elucidate these roles, we have characterized the first virally encoded cyclophilin (mimicyp) derived from the largest virus discovered to date (the Mimivirus) that is also a causative agent of pneumonia in humans. Mimicyp adopts a typical cyclophilin-fold, yet it also forms trimers unlike any
previously characterized homologue. Strikingly, immunofluorescence assays reveal that
mimicyp localizes to the surface of the mature virion, as recently proposed for several
viruses that recruit host cell cyclophilins such as SARS and HIV-1. Additionally mimicyp
lacks peptidyl-prolyl isomerase activity in contrast to human cyclophilins. Thus, this study
suggests that cyclophilins, whether recruited from host cells (i.e. HIV-1 and SARS) or
virally encoded (i.e. Mimivirus), are localized on viral surfaces for at least a subset of
viruses.
Phylogenomics, protein family evolution, and the Tree of Life: an integrated approach between molecular evolution and computational intelligence
Naihum LA and Pereira SL
The massive amount of information generated by genomic technologies has opened new frontiers in science by bridging disciplines such as computational biology, molecular biology, molecular evolution, evolutionary biology, and ecology. Many tools and methods have been developed over the past several years to allow analysis of molecular sequences. Phylogenomics, the interpretation of genomic data to determine gene function and phylogenetic relationships of organisms, remains challenging nevertheless. Here, we focus on the application of phylogenomics to improve functional prediction of genes/products, to understand the evolution of protein families, and to resolve phylogenetic relationships of organisms. We point out areas that require further development, such as computational tools and methods to manipulate large and diverse data sets. The application of an integrated computational and biological approach may help to achieve a better system-based understanding of biological processes in different environments. This will help to fully access valuable information available from the evolution of genes, and genomes in the wide diversity of intact organisms and biological communities.
Coevolutionary patterns in cytochrome c oxidase
subunit I depend on structure and functional context
Wang ZO and Pollock DD
The strength and pattern of coevolution between amino acid residues varies depending on their structural and functional environment. This context dependence, along with differences in analytical technique, is responsible for different results among coevolutionary analyses of different proteins. It is thus important to perform detailed study of individual proteins to gain better insight into how context dependence can affect coevolutionary patterns even within individual proteins, and to unravel the details of context dependence with respect to structure and function. Here, we extend our previous study by presenting further analysis of residue coevolution in cytochrome c oxidase subunit I sequences from 231 vertebrates using a statistically robust phylogeny-based maximum likelihood ratio method. As in previous studies, a strong overall coevolutionary signal was detected, and coevolution within structural regions was significantly related to the Ca distances between residues. While the strong selection for adjacent residues among predicted coevolving pairs in the surface region indicates that the statistical method is highly selective for biologically relevant interactions, the coevolutionary signal was strongest in the transmembrane region, although the distances between coevolving residues were greater. This indicates that coevolution may act to maintain more global structural and functional constraints in the transmembrane region. In the transmembrane region, sites that coevolved according to polarity and hydrophobicity rather than volume had a greater tendency to co-localize with just one of the predicted proton channels (channel H). Thus, the details of coevolution in cytochrome c oxidase subunit I depend greatly on domain structure and residue physicochemical characteristics, but proximity to function appears to play a critical role. We hypothesize that the association of coevolutionary sites with channel H was caused by adaptive coevolution, and is indicative of a more important functional role for this channel.
BACKGROUND: The mitochondrial genomes of snakes are characterized by an overall evolutionary rate that appears to be one of the most accelerated among vertebrates. They also possess other unusual features, including short tRNAs and other genes, and a duplicated control region that has been stably maintained since it originated more than 70 million years ago. Here, we provide a detailed analysis of evolutionary dynamics in snake mitochondrial genomes to better understand the basis of these extreme characteristics, and to explore the relationship between mitochondrial genome molecular evolution, genome architecture, and molecular function. We sequenced complete mitochondrial genomes from Slowinski's corn snake (Pantherophis slowinskii) and two cottonmouths (Agkistrodon piscivorus) to complement previously existing mitochondrial genomes, and to provide an improved comparative view of how genome architecture affects molecular evolution at contrasting levels of divergence. RESULTS: We present a Bayesian genetic approach that suggests that the duplicated control region can function as an additional origin of heavy strand replication. The two control regions also appear to have different intra-specific versus inter-specific evolutionary dynamics that may be associated with complex modes of concerted evolution. We find that different genomic regions have experienced substantial accelerated evolution along early branches in snakes, with different genes having experienced dramatic accelerations along specific branches. Some of these accelerations appear to coincide with, or subsequent to, the shortening of various mitochondrial genes and the duplication of the control region and flanking tRNAs. CONCLUSION: Fluctuations in the strength and pattern of selection during snake evolution have had widely varying gene-specific effects on substitution rates, and these rate accelerations may have been functionally related to unusual changes in genomic architecture. The among-lineage and among-gene variation in rate dynamics observed in snakes is the most extreme thus far observed in animal genomes, and provides an important study system for further evaluating the biochemical and physiological basis of evolutionary pressures in vertebrate mitochondria.
Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences
Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, Jurka J, Kamal M, Mauceli E, Searle SM, Sharpe T, Baker ML, Batzer MA, Benos PV, Belov K, Clamp M, Cook A, Cuff J, Das R, Davidow L, Deakin JE, Fazzari MJ, Glass JL, Grabherr M, Greally JM, Gu W, Hore TA, Huttley GA, Kleber M, Jirtle RL, Koina E, Lee JT, Mahony S, Marra MA, Miller RD, Nicholls RD, Oda M, Papenfuss AT, Parra ZE, Pollock DD, Ray DA, Schein JE, Speed TP, Thompson K, VandeBerg JL, Wade CM, Walker JA, Waters PD, Webber C, Weidman JR, Xie X, Zody MC; Broad Institute Genome Sequencing Platform, Broad Institute Whole Genome Assembly Team, Broad Institute Whole Genome Assembly Team, Jaffe DB, Alvarez P, Brockman W, Butler J, Chin C, Gnerre S, MacCallum I, Graves JA, Ponting CP, Breen M, Samollow PB, Lander ES, and Lindblad-Toh K
We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.
Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica
Gentles AJ, Wakefield MJ, Kohany O, Gu W, Batzer MA, Pollock DD, and Jurka J
The genome of the gray short-tailed opossum Monodelphis domestica is notable for its large size ( approximately 3.6 Gb). We characterized nearly 500 families of interspersed repeats from the Monodelphis. They cover approximately 52% of the genome, higher than in any other amniotic lineage studied to date, and may account for the unusually large genome size. In comparison to other mammals, Monodelphis is significantly rich in non-LTR retrotransposons from the LINE-1, CR1, and RTE families, with >29% of the genome sequence comprised of copies of these elements. Monodelphis has at least four families of RTE, and we report support for horizontal transfer of this non-LTR retrotransposon. In addition to short interspersed elements (SINEs) mobilized by L1, we found several families of SINEs that appear to use RTE elements for mobilization. In contrast to L1-mobilized SINEs, the RTE-mobilized SINEs in Monodelphis appear to shift from G+C-rich to G+C-low regions with time. Endogenous retroviruses have colonized approximately 10% of the opossum genome. We found that their density is enhanced in centromeric and/or telomeric regions of most Monodelphis chromosomes. We identified 83 new families of ancient repeats that are highly conserved across amniotic lineages, including 14 LINE-derived repeats; and a novel SINE element, MER131, that may have been exapted as a highly conserved functional noncoding RNA, and whose emergence dates back to approximately 300 million years ago. Many of these conserved repeats are also present in human, and are highly over-represented in predicted cis-regulatory modules. Seventy-six of the 83 families are present in chicken in addition to mammals.
Regional variation in the density of essential genes in mice
Hentges KE, Pollock DD, Liu B, and Justice MJ
In most species, and particularly in vertebrates, the percentage of genes absolutely required for survival, the essential genes, has not been estimated. To obtain this estimation, we used the mouse as an experimental model to carry out high-efficiency N-ethyl-N-nitrosourea (ENU) mutagenesis screens in two balancer chromosome regions, and compared our results to a third previously published screen. The number of essential genes in each region was predicted based on allele frequencies. We determined that the density of essential genes differs by up to an order of magnitude among genomic regions. This indicates that extrapolating from regional estimates to genome-wide estimates of essential genes has a huge variance. A particularly high density of essential genes on mouse Chromosome 11 coincides with a high degree of regional linkage conservation, providing a possible causal explanation for the density variation. This is the first demonstration of regional variation in essential gene density in the mouse genome.
Evolutionary dynamics of transposable elements in the short-tailed opossum Monodelphis domestica
Gu W, Ray DA, Walker JA, Barnes EW, Gentles AJ, Samollow PB, Jurka J, Batzer MA, and Pollock DD
Short INterspersed Elements (SINEs) are non-autonomous retrotransposons, usually between 100 and 500 base pairs (bp) in length, which are ubiquitous components of eukaryotic genomes. Their activity, distribution, and evolution can be highly informative on genomic structure and evolutionary processes. To determine recent activity, we amplified more than one hundred SINE1 loci in a panel of 43 M. domestica individuals derived from five diverse geographic locations. The SINE1 family has expanded recently enough that many loci were polymorphic, and the SINE1 insertion-based genetic distances among populations reflected geographic distance. Genome-wide comparisons of SINE1 densities and GC content revealed that high SINE1 density is associated with high GC content in a few long and many short spans. Young SINE1s, whether fixed or polymorphic, showed an unbiased GC content preference for insertion, indicating that the GC preference accumulates over long time periods, possibly in periodic bursts. SINE1 evolution is thus broadly similar to human Alu evolution, although it has an independent origin. High GC content adjacent to SINE1s is strongly correlated with bias towards higher AT to GC substitutions and lower GC to AT substitutions. This is consistent with biased gene conversion, and also indicates that like chickens, but unlike eutherian mammals, GC content heterogeneity (isochore structure) is reinforced by substitution processes in the M. domestica genome. Nevertheless, both high and low GC content regions are apparently headed towards lower GC content equilibria, possibly due to a relative shift to lower recombination rates in the recent Monodelphis ancestral lineage. Like eutherians, metatherian (marsupial) mammals have evolved high CpG substitution rates, but this is apparently a convergence in process rather than a shared ancestral state.
Dealing with Uncertainty in Ancestral Sequence Reconstruction: Sampling from the Posterior Distribution
Pollock DD and Chang BS
Resurrection of ancestral proteins in the laboratory to investigate
aspects of their function has provided an exciting opportunity
to experimentally test theories concerning the evolution of
protein structure and function. A potentially important pitfall
of this approach, however, is that sequence and functional bias
in ancestral reconstruction may affect results. In the worst-case
scenario, the bias in reconstruction could lead to incorrect
functional interpretation for reconstructed proteins. Inferring
function or stability based on a single resurrected protein
sequence may be a risky proposition without concurrent examination
to determine if a bias in functional shifts indeed exists. If
the evolutionary process can be modeled fairly well, an effective
means to eliminate the reconstruction bias is to sample ancestral
proteins from the posterior probability space. It is also important
to incorporate uncertainty in the model of evolution and model
variation across sites, and to consider the absence of rare
variants. The question of how many reconstructed ancestral samples
are sufficient to estimate probable ancestral function is an
open one, and it may be specific to the variability in inferred
function among likely ancestors. Given a reasonably accurate
model of evolution, the sampling of even a few proteins from
the posterior may provide a relatively unbiased estimate of
ancestral function, and would allow evaluation of the variance
in this functional estimate. We discuss the details of the problem,
propose a simple experimental approach to solve it, and provide
a program to sample ancestral sequences and to evaluate the
tendency of maximum likelihood estimates to alter amino acid
frequencies and under-sample rare (possibly slightly deleterious)
variants in a protein.
EGenBio: a data management system
for evolutionary genomics and biodiversity
Nahum LA, Reynolds MT, Wang ZO, Faith JJ, Jonna R, Jiang ZJ, Meyer TJ, and Pollock DD
BACKGROUND: Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; http://egenbio.lsu.edu webcite) to begin to address this. DESCRIPTION: EGenBio is a system for manipulation and filtering of large numbers of sequences, integrating curated sequence alignments and phylogenetic trees, managing evolutionary analyses, and visualizing their output. EGenBio is organized into three conceptual divisions, Evolution, Genomics, and Biodiversity. The Genomics division includes tools for selecting pre-aligned sequences from different genes and species, and for modifying and filtering these alignments for further analysis. Species searches are handled through queries that can be modified based on a tree-based navigation system and saved. The Biodiversity division contains tools for analyzing individual sequences or sequence alignments, whereas the Evolution division contains tools involving phylogenetic trees. Alignments are annotated with analytical results and modification history using our PRAED format. A miscellaneous Tools section and Help framework are also available. EGenBio was developed around our comparative genomic research and a prototype database of mtDNA genomes. It utilizes MySQL-relational databases and dynamic page generation, and calls numerous custom programs. CONCLUSION: EGenBio was designed to serve as a platform for tools and resources to ease combined analysis in evolution, genomics, and biodiversity.
Assessing the accuracy of ancestral protein
reconstruction methods
Williams PD, Pollock DD, Blackburne BP, and Goldstein RA
The phylogenetic inference of ancestral protein sequences is
a powerful technique for the study of molecular evolution, but
any conclusions drawn from such studies are only as good as
the accuracy of the reconstruction method. Every inference method
leads to errors in the ancestral protein sequence, resulting
in potentially-misleading estimates of the ancestral proteins
properties. To better understand the conditions of the past,
it is important to understand the accuracy of different methods
and how the resulting errors affect the conclusions drawn. The
Maximum Parsimony (MP) and Maximum Likelihood (ML) inference
methods have been shown to misestimate ancestral nucleotide
frequencies, revealing a consistent and incorrect bias, but
little data for proteins exists, partially because of the difficulty
of finding true ancestral sequences for comparison. To assess
the accuracy of ancestral protein reconstruction methods, we
perform computational population evolution simulations featuring
speciation and divergence events using an off-lattice protein
model where fitness depends on the ability to fold into a specified
target structure. As we know the population of sequences at
each step of the simulation, we can compare these known ancestral
sequences and the resulting thermodynamic properties with those
inferred by MP, ML, and Bayesian methods. We find that MP and,
even more so, ML methods overestimate thermostability and that
a Bayesian analysis, although it does not generate the most
accurate sequences, is the most accurate and most unbiased in
terms of resulting protein properties. This suggests that ancestral
reconstruction studies performed using MP and ML may need to
be re-evaluated.
Observations of amino acid gain and loss
during protein evolution are explained by statistical bias
Goldstein RA and Pollock DD
In the scientific literature, and in molecular evolution in
particular, extravagant claims are oftentimes given exceptional
attention. This is true for unusual inferences of relationships
among organisms, dating of organismal divergence times, and
for reconstruction of function and properties of ancestral proteins.
In all of these cases, misuse of statistics and ignorance of
variation can lead to phylogenetic optimism, whereby
confidence in the results is vastly overstated and important
sources of bias ignored. As a case in point, the authors of
a recent manuscript in Nature claim to have discovered universal
trends of amino acid gain and loss in protein evolution.
Such an inference of convergent evolution in the same direction
in many different taxa should always be treated with extreme
caution, since inferential bias is a likely explanation for
such a trend. Here, we show that the universal trend
in amino acid evolution can be explained by a bias in common
methods for inferring evolutionary trends in proteins. Trends
can be more accurately detected using phylogeny-based Bayesian
methods, but the currently available dataset does not contain
sufficient taxa to make definitive assertions, and previous
assertions are almost certainly unfounded. Variation in amino
acid replacement rates among proteins, among positions within
proteins, and over time currently overwhelms our ability to
make sound claims about such trends.
Selective advantage of recombination in evolving
protein populations: A lattice model study
Williams PD, Pollock DD, and Goldstein RA
Recent research has attempted to clarify the contributions of
several mutational processes, such as substitutions or homologous
recombination. Simplistic, tractable protein models, which determine
the compact native structure phenotype from the sequence genotype,
are well-suited to such studies. In this paper, we use a lattice-protein
model to examine the effects of point mutation and homologous
recombination on evolving populations of proteins. We find that
while the majority of mutation and recombination events are
neutral or deleterious, recombination is far more likely to
be beneficial. This results in a faster increase in fitness
during evolution, although the final fitness level is not significantly
changed. This transient advantage provides an evolutionary advantage
to subpopulations that undergo recombination, allowing fixation
of recombination to occur in the population.
Functionality and the evolution of marginal
stability in proteins: inferences from lattice simulations
Williams PD, Pollock DD, and Goldstein RA
It has been known for some time that many proteins are marginally
stable. This has inspired several explanations. Having noted
that the functionality of many enzymes is correlated with subunit
motion, flexibility, or general disorder, some have suggested
that marginally stable proteins should have an evolutionary
advantage over proteins of differing stability. Others have
suggested that stability and functionality are contradictory
qualities, and that selection for both criteria results in marginally
stable proteins, optimised to satisfy the competing design pressures.
While these explanations are plausible, recent research simulating
the evolution of model proteins has shown that selection for
stability, ignoring any aspects of functionality, can result
in marginally stable proteins because of the underlying makeup
of protein sequence-space. We extend this research by simulating
the evolution of proteins, using a computational protein model
that equates functionality with binding and catalysis. In the
model, marginal stability is not required for ligand-binding
functionality and we observe no competing design pressures.
The resulting proteins are marginally stable, again demonstrating
that neutral evolution is sufficient for explaining marginal
stability in observed proteins.
Divergence, recombination,
and retention of functionality during protein evolution
Xu YO, Hall RW, Goldstein RA, Pollock DD.
Protein structure and function are not easily predictable from
primary sequence, and because of this we have only a vague idea
exactly how protein sequences evolve in the context of structure
and function. Thanks to increasing biodiversity in genomic studies,
progress is being made in detecting context-dependent variation
in substitution processes, but it remains unclear exactly what
features of the evolutionary process we should be looking for.
To address this, our laboratories have been developing a system
for simulating protein evolution in the context of structure
and function using lattice models of proteins and ligands (or
substrates). This system includes both thermodynamic features
of protein stability and population dynamics; we refer to this
approach as ab initio evolution to emphasize that the equilibrium
details of variant fitnesses arise from the physical principles
of the system, and not from any pre-conceived notions or arbitrary
mathematical distributions. Here, we discuss the relevance of
the system to evolutionary genomics and the choices that must
be made in trying to reproduce essential biological features
in the face of immense computational burdens. We present new
results on the coevolution during the divergence process and
retention of functionality in homologous recombinants following
population divergence. The designability, or sequence space
available to a structure, plays a key role in divergence and
recombinant function. These results have implications for understanding
viral evolution, speciation, and directed evolutionary experiments.
We also show that the results of our analysis of the divergence
process can guide improved methods for accurately approximating
folding probabilities in more complex systems that would otherwise
be beyond computational feasibility.
Sequences and protein
structures are congruent with functional and fitness differences
among Colias phosphoglucose isomerase genotypes
Wheat CW, Watt WB, Pollock DD, Schulte PM
The enzyme phosphoglucose isomerase, PGI, of Colias butterflies
(Lepidoptera, Pieridae) displays a widespread allozyme polymorphism.
Many studies on the biochemical function, organismal performance,
and fitness effects of Colias PGI genotypes have given evidence
of strong natural selection in the wild to maintain this polymorphism.
Here we begin to study the mechanism underlying this adaptive
polymorphism at the level of molecular sequence and structure.
The common electrophoretically-detectable alleles differ at
multiple amino acid positions, and also show some cryptic charge-neutral
amino acid variation hidden within the electrophoretic allele
classes. Structural modeling shows that all changes are at or
near PGIs surface, and several naturally abundant variants
that distinguish these alleles are so placed as potentially
to alter subunit interaction and catalytic center geometry.
There is a large excess of intraspecific variation, both synonymous
and nonsynonymous, compared to interspecific fixation: there
are no fixed synonymous differences between species, and only
two fixed nonsynonymous differences. The fixed differences may
be due to positive selection, but sliding window analysis of
synonymous nucleotide diversity and Tajimas D shows that
that the amino acid sites predicted to be foci of selection
based on structural and functional considerations also coincide
with the regions of highest synonymous diversity. They are thus
the most likely targets of balancing selection based on both
genetic and biochemical considerations. Colias' PGI gene, with
1668 bp of cDNA, is divided into 12 exons, spread over ~ 11kb
of chromosomal DNA, and intragenic recombination has been active
over much of the gene. Our results show that the relaxation
of constraint against amino acid variation, as one moves from
the interior cores of proteins to their surface, allows adaptive,
as well as neutral, natural variation to occur near or at those
surfaces. This case study of persistent polymorphism now offers
the integration of the genomic and molecular-structural bases
of natural variation with its consequences for metabolic and
organismal performance, thence for fitness, in wild populations.
37: NHGRI White Paper 2005
Proposal to sequence the first reptilian
genome: the Green Anole Lizard, Anolis carolinensis
J. Losos, E. Braun, D. Brown, S. Clifton, S. Edwards, J.
Gibson-Brown, T. Glenn, L. Guillette, D. Main, P. Minx, W. Modi,
M. Pfrender, D. Pollock, D. Ray, A. Shedlock, and W. Warren
Evolution
of base substitution gradients in primate mitochondrial genomes
Raina SZ, Faith JJ, Seligmann H, Disotell T, Stewart C-B,
and Pollock DD
Substitution patterns among nucleotides are often assumed to
be constant in phylogenetic analyses. Although variation in
the average rate of substitution among sites is commonly accounted
for, variation in the relative rates of specific types of substitution
are not. Here, we review details of methodologies used for detecting
and analyzing differences in substitution processes among predefined
groups of sites. We describe how such analyses can be performed
using existing phylogenetic tools, and discuss how new phylogenetic
analysis tools we have recently developed can be used to provide
more detailed and sensitive analyses, including study of the
evolution of mutation and substitution processes. As an example
we consider the mitochondrial genome, for which two types of
transition deaminations (C=>T and A=>G) are strongly
affected by single-strandedness during replication, resulting
in an asymmetric mutation process. Since time spent single-stranded
varies along the mitochondrial genome, their differential mutational
response results in very different substitution patterns in
different regions of the genome.
The beetle gut: a hyperdiverse
source of novel yeasts
Suh S-O, McHugh, JV, Pollock DD, Blackwell M
We isolated over 650 yeasts over a three year period from the
gut of a variety of beetles and characterized them on the basis
of LSU rDNA sequences and morphological and metabolic traits.
Of these, at least 200 were undescribed taxa, a number equivalent
to almost 30% of all currently recognized yeast species. A Bayesian
analysis of species discovery rates predicts further sampling
of previously sampled habitats could easily produce another
100 species. The sampled habitat is, thereby, estimated to contain
well over half as many more species as are currently known worldwide.
The beetle gut yeasts occur in 45 independent lineages scattered
across the yeast phylogenetic tree, often in clusters. The distribution
suggests that some of the yeasts diversified by a process of
horizontal transmission in the habitats and subsequent specialization
in association with insect hosts. Evidence of specialization
comes from consistent association over time and broad geographical
ranges of certain yeasts and beetle species. The discovery of
high yeast diversity in a previously unexplored habitat is a
first step toward investigating the basis of the interactions
and their impact in relation to ecology and evolution.
Modeling protein evolution has been frustratingly simplistic
in the past, but new methodologies and approaches have been
rapidly changing this situtation. Increased computational power,
improved phylogeny-based maximum likelihood and Bayesian statistics,
larger data sets, and better protein structure prediction methods
are jointly improving the outlook and allowing researchers to
improve the biological realism of protein models. They are also
allowing more detailed analysis of differences in processes
among sequence positions over space and time, of selection and
adaptation, coevolution, and functional divergence, and of ancestral
changes in function. The future is expected to bring improved
integration of models of protein evolution with protein structure
prediction, with the potential to dramatically improve the accuracy
and power of both
Context dependence and
coevolution among amino acid residues in proteins
Wang ZO and Pollock DD
As complete genomes accumulate, and the generation of genomic
biodiversity proceeds at an accelerating pace, the need to understand
the interaction between sequence evolution and protein structure
and function rises in prominence. The pattern and pace of substitutions
in proteins can provide important clues to functional importance,
functional divergence, and adaptive response. Coevolution between
amino acid residues and the context-dependence of the evolutionary
process are often ignored, however, due to their complexity;
but they are of critical importance for the accurate interpretation
of reconstructed evolutionary events. Since residues interact
with one another, and because the effect of substitutions can
depend on the structural and physiological environment in which
they occur, an accurate science of evolutionary functional genomics
and a complete understanding of selection in proteins requires
a better understanding of how context dependence affects protein
evolution. Here, we present new evidence from vertebrate cytochrome
oxidase sequences that pairwise coevolutionary interactions
between protein residues are highly dependent on tertiary and
secondary structure. We also discuss theoretical predictions
that impinge on our expectations of how protein residues may
interact over long distances due to their shared need to maintain
protein stability.
Analysis of among-site
variation in substitution patterns
Krishnan NM, Raina SZ, and Pollock DD
Substitution patterns among nucleotides are often assumed to
be constant in phylogenetic analyses. Although variation in
the average rate of substitution among sites is commonly accounted
for, variation in the relative rates of specific types of substitution
are not. Here, we review details of methodologies used for detecting
and analyzing differences in substitution processes among predefined
groups of sites. We describe how such analyses can be performed
using existing phylogenetic tools, and discuss how new phylogenetic
analysis tools we have recently developed can be used to provide
more detailed and sensitive analyses, including study of the
evolution of mutation and substitution processes. As an example
we consider the mitochondrial genome, for which two types of
transition deaminations (C=>T and A=>G) are strongly
affected by single-strandedness during replication, resulting
in an asymmetric mutation process. Since time spent single-stranded
varies along the mitochondrial genome, their differential mutational
response results in very different substitution patterns in
different regions of the genome.
Detecting gradients
of asymmetry in site-specific substitutions in mitochondrial
genomes
Krishnan NM, Seligmann H, Raina SZ, and Pollock DD
During mitochondrial replication, spontaneous mutations occur
and accumulate asymmetrically during the time spent single-stranded
by the heavy strand (DssH). The predominant mutations appear
to be deaminations from adenine to hypoxanthine (A=>H, which
leads to an A=>G substitution) and cytosine to thymine (C=>T).
Previous findings indicated that C=>T substitutions accumulate
rapidly and then saturate at high DssH, suggesting protection
or repair, whereas A=>G accumulates linearly with DssH. We
describe here the implementation of a simple hidden Markov model
(HMM) of among-site rate correlations to provide an almost continuous
profile of the asymmetry in substitution response for any particular
substitution type. We implement this model using a phylogeny-based
Bayesian Markov chain Monte Carlo (MCMC) approach. We compare
and contrast the relative asymmetries in all twelve possible
substitution types, and find that the observed transition substitution
responses determined using our new method agree quite well with
previous predictions of a saturating curve for C=>T transition
substitutions and a linear accumulation of A=>G transitions.
The patterns seen in transversion substitutions show much lower
among-site variation and are non-linear and more complex than
those seen in transitions. We also find that, after accounting
for the principal linear effect, some of the residual variation
in A=>G/G=>A response ratios is explained by the average
predicted nucleic acid secondary structure propensity at a site,
possibly due to protection from mutation when secondary structure
forms.
The ambush hypothesis:
Hidden stop codons prevent off-frame gene reading
Seligmann H and Pollock DD
Coding sequences lack stop codons, but many stops appear off-frame.
Off-frame stops (stops in -1 and +1 shifted reading frames,
termed hidden stops) terminate frameshifted translation, potentially
decreasing energy and resource waste on non-functional proteins.
Benefits may include reduced waste elimination costs and avoidance
of potentially cytotoxic frame-shifted products. Our ambush
hypothesis suggests that hidden stops are sometimes selected
for. Codons of many amino acids can contribute to hidden stops,
depending on the synonymous position state and adjacent codons.
In vertebrate mitochondria, 31.75% of all amino acid combinations
can form hidden stops. Codons with more potential to form hidden
stops have greater usage frequency and bias in their favor among
synonymous codons. Among primates, predicted mitochondrial rRNA
secondary structure stability correlates negatively with the
number of hidden stops in the mitochondrial genome. The taxonomic
distribution of genetic codes suggests that +1 frameshifts might
be more frequent than 1 frameshifts. This is confirmed
by analyses of primate mitochondrial genomes: species with unstable
rRNAs have more +1 stops, but the correlation is weak for -1
stops. High hidden stop density seems to be an adaptation in
species with slippage prone ribosomes (unstable rRNAs). Hidden
stops may thus compensate for reduced efficiency of some parts
of the biosynthetic machinery. Some experimental data confirm
our hypothesis: gene expression increases with the experimentally
manipulated number of stops in the promoter region of a gene,
suggesting biotechnological applications.
Ancestral sequence reconstruction
in primate mitochondrial DNA: compositional bias and effect
on functional inference
Krishnan NM, Seligmann H, Stewart, C-B, de Koning APJ, and Pollock
DD
Reconstruction of ancestral DNA and amino acid sequences is
an important means of inferring information about past evolutionary
events. Such reconstructions suggest changes in molecular function
and evolutionary processes over the course of evolution, and
are used to infer adaptation and convergence. Maximum likelihood
(ML) is generally thought to provide relatively accurate reconstructed
sequences compared to parsimony, but both methods lead to the
inference of multiple directional changes in nucleotide frequencies
in primate mitochondrial DNA (mtDNA). To better understand this
surprising result, as well as to better understand how parsimony
and ML differ, we constructed a series of computationally simple
conditional pathway methods that differed in the
number of substitutions allowed per site along each branch,
and also evaluated the entire Bayesian posterior frequency distribution
of reconstructed ancestral states. We analyzed primate mitochondrial
cytochrome b (Cyt-b) and cytochrome oxidase subunit I (COI)
genes and found that ML reconstructs ancestral frequencies that
are often more different from tip sequences than are parsimony
reconstructions. In contrast, frequency reconstructions based
on the posterior ensemble more closely resemble extant nucleotide
frequencies. Simulations indicate that these differences in
ancestral sequence inference are probably due to deterministic
bias caused by high uncertainty in the optimization-based ancestral
reconstruction methods (parsimony, ML, Bayesian maximum a posteriori).
In contrast, ancestral nucleotide frequencies based on an average
of the Bayesian set of credible ancestral sequences are much
less biased. The methods involving simpler conditional pathway
calculations have slightly reduced likelihood values compared
to full likelihood calculations, but can provide fairly unbiased
nucleotide reconstructions and may be useful in more complex
phylogenetic analyses than considered here due to their speed
and flexibility. To determine whether biased reconstructions
using optimization methods might affect inferences of functional
properties, ancestral primate mitochondrial tRNA sequences were
inferred and helix-forming propensities for conserved pairs
were evaluated in silico. For ambiguously reconstructed nucleotides
at sites with high base composition variability, ancestral tRNA
sequences from Bayesian analyses were more compatible with canonical
base pairing than were those inferred by other methods. Thus,
nucleotide bias in reconstructed sequences apparently can lead
to serious bias and inaccuracies in functional predictions.
Estimating the degree
of saturation in mutant screens
Pollock DD and Larkin J
Large-scale screens for loss-of-function mutants have played
a significant role in recent advances in developmental biology
and other fields. In such mutant screens, it is desirable to
estimate the degree of saturation of the screen
(i.e., what fraction of the possible target genes have been
identified). We applied Bayesian and maximum likelihood methods
for estimating the number of loci remaining undetected in large-scale
screens, and produce credibility intervals to assess the uncertainty
of these estimates. Since different loci may mutate to alleles
with detectable phenotypes at different rates, we also incorporated
variation in the degree of mutability among genes, using either
gamma-distributed mutation rates or multiple discrete mutation
rate classes. We examined eight published data sets from large-scale
mutant screens and find that credibility intervals are much
broader than implied by previous assumptions about the degree
of saturation of screens. The likelihood methods presented here
are a significantly better fit to data from published experiments
than estimates based on the Poisson distribution, which implicitly
assumes a single mutation rate for all loci. The results are
reasonably robust to different models of variation in the mutability
of genes. We tested our methods against mutant allele data from
a region of the Drosophila melanogaster genome for which there
is an independent genomics-based estimate of the number of undetected
loci, and found that the number of such loci falls within the
predicted credibility interval for our models. The methods we
have developed may also be useful for estimating the degree
of saturation in other types of genetic screens in addition
to classical screens for simple loss-of-function mutants, including
genetic modifier screens and screens for protein-protein interactions
using the yeast two-hybrid method.
27: Human Genomics 2004; 1(2): 85
Human genomics and the
role of evolutionary genomics
Pollock DD
Human Genomics has, from its outset, included a great deal of
evolutionary analysis. The structure of the editorial board
has representation from many evolution-based disciplines, including
population and quantitative genetics, and of course, evolutionary
genomics. This inclusion is the result of an obvious trend in
the field of genomics to incorporate more and more evolutionary
analysis, not just as an extra frill, but as a central component
of the field. The world now has over one hundred complete bacterial
genomes, and with human, roundworm, multiple fruitflies, mosquito,
rice, Arabidposis, pufferfish, mouse, rat, dog, chimpanzee,
chicken, and a growing number of other multicellular organisms
either sequenced or imminent, comparative genomics is coming
into its own. Still, one might argue that a journal of Human
Genomics should focus on its main target, Homo sapiens, and
leave aside mucking about with the multitude of other species
on the planet, most of which many self-respecting Homo sapiens
individuals might rather target with the bottom of their shoe
rather than with a multimillion dollar sequencing project. As
the evolutionary genomics editor, it seems necessary to provide
some explanation and justification.
Likelihood analysis of
asymmetrical mutation bias gradients in vertebrate mitochondrial
genomes
Faith JJ and Pollock DD
Protein-coding genes in mitochondrial genomes have varying degrees
of asymmetric skew in base frequencies at the third codon position.
The variation in skew among genes appears to be caused by varying
durations of time that the heavy strand spends in the mutagenic
single stranded state during replication (DssH). The primary data
used to study skew has been the gene-by-gene base frequencies
in individual taxa, which provides little information on exactly
what kinds of mutations are responsible for the base frequency
skew. To assess the contribution of individual mutation components
to the ancestral vertebrate substitution pattern, here we analyze
a large data set of complete vertebrate mitochondrial genomes
in a phylogeny-based likelihood context. This also allows us to
evaluate the change in skew continuously along the mitochondrial
genome, and to directly estimate relative substitution rates.
Our results indicate that different types of mutation respond
differently to the gradient. A primary role for hydrolytic deamination
of cytosines in creating variance in skew among genes was not
supported, but rather linearly increasing rates of mutation from
adenine to hypoxanthine with appear to drive regional differences
in skew. Substitutions due to hydrolytic deamination of cytosines,
although common, appear to quickly saturate, possibly due to stabilization
by the mitochondrial DNA single strand binding protein. These
results should form the basis of more realistic models of DNA
and protein evolution in mitochondria.
25: NHGRI White Paper 2003
Proposal for complete sequencing of the genome
of a Marsupial, the gray, short-tailed opossum, Monodelphis
domestica
Amemiya CT, Greally JM, Jirtle RL, Lander ES, Lindblad-Toh
K, Miller RD, Pollock DD, Samallow PB, Springer MS, and Wilson RK
Metatherian (marsupial) mammals are phylogenetically
distinct from current mammalian biomedical models, all of which
are eutherian (placental) species. However, marsupials
and eutherians are more closely related to one another than to
any other vertebrate model species (i.e., birds, amphibians, fishes).
Fossil evidence establishes a minimum date of 125 million years
(MY) for the separation of eutherian and metatherian mammals (JI
et al. 2002), while analyses of nuclear gene sequences suggest
that metatherian / eutherian divergence may be as old as 173-190
MY (KUMAR and HEDGES 1998; WOODBURNE et al. 2003). To place this
in context, the evolutionary gulf between mammals and the next
most closely related group of non-mammalian research models, i.e.,
birds (chicken), is approximately 300 350 MY. Thus, the
marsupial eutherian relationship represents a unique midpoint
in age relative to existing mammalian and non-mammalian vertebrate
models. As a legacy of their common ancestry, marsupials and eutherians
share basic genetic mechanisms and molecular processes that represent
fundamental (ancient) mammalian characteristics. Nevertheless,
since their divergence, eutherian and marsupial mammals have evolved
many distinctive morphologic, physiologic, and genetic variations
on these elemental mammalian designs. These phylogenetically restricted
differences can be used as comparative tools for examining the
underlying molecular and genetic processes that are common to
all mammalian species, and thereby help to reveal how variations
in these mechanisms lead to differences in gene regulation, expression,
and function. As the closest sister group to eutherian mammals,
marsupials are also the most appropriate outgroup
for assessing the relative antiquity or novelty of the molecular
and genetic changes that have occurred among the many eutherian
species (including ourselves) presently used in biomedical and
evolutionary research..
24: Journal of Molecular Evolution 2003; 56(4): 375-376
The Zuckerkandl Prize:
Structure and Evolution
Pollock DD
Guest Editorial: The Zuckerkandl Prize, established by Springer-Verlag
in 2002 to honor Emile Zuckerkandl and his contributions to molecular
evolution, goes this year to Gustavo Caetano-Anollés for
his paper on Evolved RNA Secondary Structure and the rooting
of the Universal Tree of Life (Caetano-Anollés 2002).
The editors of the Journal of Molecular Evolution have judged
this to be the best paper in the journal last year due to its
creative use of structure, and the evolution of structure, to
reconstruct deep phylogenies.
Is sparse taxon sampling
a problem for phylogenetic inference?
Hillis, DM, Pollock DD, McGuire JA, and Zwickl DJ
No abstract: ...There is no simple answer to the question posed
in the heading of this section; the answer will depend on the
particular situation being examined (the scope of the problem,
the number of taxa already sequenced, the number of characters
already collected, and the quantity and the availability of
additional relevant taxa to include). We disagree with the assertion
of Rosenberg and Kumar (2002) that more characters per taxon
is necessarily a better strategy than more taxa for the same
characters. Rosenberg and Kumar (2002) put ther argument in
terms of the current genome sequencing studies, in which many
genes (or complete genomes) are examined from very few taxa.
Rosenberg and Kumar 92002) argued that their conclusions "mesh
well" with this scattered genome approach. In contrast,
we propose that this approach will likely result in poorly estimated
evolutionary models, poorly estimated evolutionary trees, and
a poor overall view of evolutionary history. If one is interested
in inferring the evolutionary history of life, a much broader
sample of taxa (perhaps sequence for far less than full genomes)
will result in a much more accurate estimate of phylogeny than
will complete genomes of only a small sample of taxa.
Increased taxon sampling
is advantageous for phylogenetic inference
Pollock DD, Zwickl DJ, McGuire JA, and Hillis DM
Until recently, it was believed that complex phylogenies might
be extremely difficult to reconstruct due to the phenomenal rate
of increase in the number of possible phylogenies as the number
of taxa increases. However, Hillis (1996) showed through simulation
that, for at least one complex phylogeny of angiosperms with 228
taxa, reconstruction was far more accurate than expected, even
with relatively modest amounts of DNA sequence data. This led
to a flurry of papers on the subject of taxon sampling and phylogenetic
reconstruction, with focus quickly shifting from the question
of whether complex phylogenies can be reconstructed to whether
and how much an existing phylogeny can be improved through increased
taxon sampling (Hillis, 1998; Kim, 1998; Poe, 1998; Poe and Swofford,
1999; Pollock and Bruno, 2000; Rannala et al., 1998; Yang, 1998).
Although a statistician might intuitively believe that it is generally
better (or at least no worse) to increase the amount of data to
resolve a question in statistical inference, the benefits of taxon
addition for phylogenetic inference remain controversial. ...A
recent paper on the subject of taxon addition (Rosenberg and Kumar,
2001) concludes that increased taxon sampling is of little benefit
to phylogenetic inference when compared to increasing sequence
length. We disagree with their interpretation and believe that
their data support the importance of increased taxon sampling.
In addition, some of their data were simulated under extreme conditions
(i.e., substitution rates that were very high or low, or sequences
that were unreasonably short). Large error values and non-linear
relationships at these extremes make it difficult to interpret
effects for the majority of the range, and averaging across the
entire range is inappropriate. Moreover, we do not believe that
Rosenberg and Kumar (2001) used the most appropriate metric to
measure the relative effect of taxon addition. Our reanalysis
of their simulated data indicates that increased taxon sampling
is highly beneficial for phylogenetic inference..
Genomic biodiversity,
phylogenetics, and coevolution in proteins
Pollock DD.
Comprehensive sampling of genomic biodiversity is fast becoming
a reality for some genomic regions and complete organelle genomes.
Genomic biodiversity is defined as large genomic sequences from
many species, and here some recent work is reviewed that demonstrates
the potential benefits of genomic biodiversity for molecular evolutionary
analysis and phylogenetic reconstruction. This work shows that,
using likelihood-based approaches, taxon addition can dramatically
improve phylogenetic reconstruction. Features, or dynamics, of
the evolutionary process are much more easily inferred with large
numbers of taxa, and large numbers are essential for discriminating
differences in evolutionary patterns between sites. Accurate prediction
of site-specific patterns can improve phylogenetic reconstruction
by an amount equivalent to quadrupling sequence length. Genomic
biodiversity is particularly central to research relating patterns
of evolution, adaptation, and coevolution to structural and functional
features of proteins. Research on detecting coevolution between
amino acid residues in proteins is reviewed that demonstrates
a clear need for much greater numbers of closely related taxa
to better discriminate site-specific patterns of interaction,
and to allow more detailed analysis of coevolutionary interactions
between subunits in protein complexes. It is argued that parsing
out coevolutionary and other context-dependent substitution probabilities
is essential for discriminating between coevolution and adaptation,
and for more realistically modeling the evolution of proteins.
Research is also reviewed that argues for increasing the efficiency
of acquiring genomic biodiversity, and suggests that this might
be done by simultaneously shotgun cloning and sequencing genomic
mixtures from many species. Increased efficiency is a prerequisite
if genomic biodiversity levels are to rapidly increase by orders
of magnitude, and thus lead to dramatically improved understanding
of interactions between protein structure, function, and sequence
evolution.
All of biology is based on evolution. Evolution is the organizing
principle for understanding the shared history of all biological
organisms. Evolution describes the similarities between different
organisms, as well as explaining how differences emerged. In addition
to answering basic questions about the history of life, evolutionary
perspectives and information drawn from evolutionary analyses
can provide information highly relevent to many biological, biotechnological,
and biomedical problems. There is also growing interest in mimicking
evolution in the test tube in order to develop RNA, proteins,
and organisms with specified properties.
We study the evolution of protein functionality using a two-dimensional
lattice model. The characteristics particular to evolution, such
as population dynamics and early evolutionary trajectories, have
a large effect on the distribution of observed structures. Only
subtle differences are observed between the distribution of structures
evolved for function and those evolved for their ability to form
compact structures.
Structures, phylogenies, and genomes: The integrated
study of protein evolution
Goldstein RA, Pollock DD, and Thorne JL
For the past decades, evolutionary biologists have tried to reconstruct
evolutionary histories, to piece together phylogenetic trees,
and to understand the network of hereditary relationships. Such
approaches (whether it is admitted or not) are based on models
of the evolutionary process. These tasks would be easier if reality
would better match the simplest models. Unfortunately for these
scientists, evolution takes place in a complicated web of constraints,
with changes in the DNA sometimes but not always translating to
changes in amino acids which may or may not result in significant
changes in the properties of these expressed proteins. All of
this occurs in a complicated and interconnected fitness landscape,
where different locations in the protein may be under radically
different selective pressure. This situation has led a number
of investigators to bring more of the biologial and biochemical
complexity into these evolutionary models, to develop approaches
with a closer fidelity to biological reality with the hope that
more accurate pictures of biological history will result.
Assessing an unknown
evolutionary process: effect of increasing site-specific knowledge
through taxon addition
Pollock DD, Bruno WJ.
Assessment of the evolutionary process is crucial for understanding
the effect of protein structure and function on sequence evolution
and for many other analyses in molecular evolution. Here, we used
simulations to study how taxon sampling affects accuracy of parameter
estimation and topological inference in the absence of branch
length asymmetry. With maximum-likelihood analysis, we find that
adding taxa dramatically improves both support for the evolutionary
model and accurate assessment of its parameters when compared
with increasing the sequence length. Using a method we call "doppelganger
trees," we distinguish the contributions of two sources of
improved topological inference: greater knowledge about internal
nodes and greater knowledge of site-specific rate parameters.
Surprisingly, highly significant support for the correct general
model does not lead directly to improved topological inference.
Instead, substantial improvement occurs only with accurate assessment
of the evolutionary process at individual sites. Although these
results are based on a simplified model of the evolutionary process,
they indicate that in general, assuming processes are not independent
and identically distributed among sites, more extensive sampling
of taxonomic biodiversity will greatly improve analytical results
in many current sequence data sets with moderate sequence lengths.
A case for evolutionary
genomics and the comprehensive examination of sequence biodiversity
Pollock DD, Eisen JA, Doggett NA, Cummings MP.
Comparative analysis is one of the most powerful methods available
for understanding the diverse and complex systems found in biology,
but it is often limited by a lack of comprehensive taxonomic sampling.
Despite the recent development of powerful genome technologies
capable of producing sequence data in large quantities (witness
the recently completed first draft of the human genome), there
has been relatively little change in how evolutionary studies
are conducted. The application of genomic methods to evolutionary
biology is a challenge, in part because gene segments from different
organisms are manipulated separately, requiring individual purification,
cloning, and sequencing. We suggest that a feasible approach to
collecting genome-scale data sets for evolutionary biology (i.e.,
evolutionary genomics) may consist of combination of DNA samples
prior to cloning and sequencing, followed by computational reconstruction
of the original sequences. This approach will allow the full benefit
of automated protocols developed by genome projects to be realized;
taxon sampling levels can easily increase to thousands for targeted
genomes and genomic regions. Sequence diversity at this level
will dramatically improve the quality and accuracy of phylogenetic
inference, as well as the accuracy and resolution of comparative
evolutionary studies. In particular, it will be possible to make
accurate estimates of normal evolution in the context of constant
structural and functional constraints (i.e., site-specific substitution
probabilities), along with accurate estimates of changes in evolutionary
patterns, including pairwise coevolution between sites, adaptive
bursts, and changes in selective constraints. These estimates
can then be used to understand and predict the effects of protein
structure and function on sequence evolution and to predict unknown
details of protein structure, function, and functional divergence.
In order to demonstrate the practicality of these ideas and the
potential benefit for functional genomic analysis, we describe
a pilot project we are conducting to simultaneously sequence large
numbers of vertebrate mitochondrial genomes.
The genomic data available to computational biologists represents
the product of the complex processes of evolution. In particular,
the forces of mutation, duplication, and selection have acted
to sculpt modern protein sequence and structure in the context
of changing functional requirements. Just as crystallographers
are able to determine protein structures through an analysis of
X-ray diffraction patterns, scientists are learning to read the
evolutionary history of proteins in order to infer and explain
both structure and function. This pursuit depends on the development
of new computational approaches in order to make optimal use of
genomic data, and requires interaction with experiment for comparison
and verification of computational results.
Coevolving protein
residues: maximum likelihood identification and relationship to
structure
Pollock DD, Taylor WR, and Goldman N
The identification of protein sites undergoing correlated evolution
(coevolution) is of great interest due to the possibility that
these pairs will tend to be adjacent in the three-dimensional
structure. Identification of such pairs should provide useful
information for understanding the evolutionary process, predicting
the effects of site-directed substitution, and potentially for
predicting protein structure. Here, we develop and apply a maximum
likelihood method with the aim of improving detection of coevolution.
Unlike previous methods which have had limited success, this method
allows for correlations induced by phylogenetic relationships
and for variation in rate of evolution along branches, and does
not rely on accurate reconstruction of ancestral nodes. In order
to reduce the complexity of coevolutionary relationships and identify
the primary component of pairwise coevolution between two sites,
we reduce the data to a two-state system at each site, regardless
of the actual number of residues observed at that site. Simulations
show that this strategy is good at identifying simple correlations
and at recognizing cases in which the data are insufficient to
distinguish between coevolution and spurious correlations. The
new method was tested by using size and charge characteristics
to group the residues at each site, and then evaluating coevolution
in myoglobin sequences. Grouping based on physicochemical characteristics
allows categorization of coevolving sites into positive and negative
coevolution, depending on the correlation between equilibrium
state frequencies. We detected a striking excess of negative coevolution
(corresponding to charge) at sites brought into proximity by the
periodicity of the alpha-helix, and there was also a tendency
for sites with significant likelihood ratios to be close in the
three-dimensional structure. Sites on the surface of the protein
appear to coevolve both when they are close in the structure,
and when they are distant, implying a role for folding and/or
avoidance of quaternary structure in the coevolution process.
Copyright 1998 Academic Press.
Increased accuracy in
analytical molecular distance estimation
Pollock DD
Analytical molecular distance estimates can be inaccurate and
biased estimates of the total number of substitutions not only
when the model of evolution they are based on is incorrect, but
also when the method of estimating the total is too simple. This
comes about because when there are different types of substitutions
occurring simultaneously, it can become extremely difficult to
estimate the number of the more quickly evolving type, and the
variance of this larger number can overwhelm the total estimate.
In this paper, in an extension of earlier work with a simple two-parameter
model of evolution, more accurate analytical distances are derived
for models appropriate to a variety of known DNA types using generalized
least squares principles of noise reduction. It is shown that
the new estimates can be applied to achieve more accurate results
for site-to-site rate variation, regions with biased nucleotide
frequencies, and synonymous sites in protein-coding regions. This
study also includes a methodology to obtain accurate distance
estimates for large numbers of sequence regions evolving in different
manners. Copyright 1998 Academic Press.
Microsatellite behavior with range
constraints: parameter estimation and improved distances for use
in phylogenetic reconstruction
Pollock DD, Bergman A, Feldman MW, Goldstein DB
A symmetric stepwise mutation model with reflecting boundaries
is employed to evaluate microsatellite evolution under range constraints.
Methods of estimating range constraints and mutation rates under
the assumptions of the model are developed. Least squares procedures
are employed to improve molecular distance estimation for use
in phylogenetic reconstruction in the case where range constraints
and mutation rates vary across loci. The bias and accuracy of
these methods are evaluated using computer simulations, and they
are compared to previously existing methods which do not assume
range constraints. Range constraints are seen to have a substantial
impact on phylogenetic conclusions based on molecular distances,
particularly for more divergent taxa. Results indicate that if
range constraints are in effect, the methods developed here should
be used in both the preliminary planning and final analysis of
phylogenetic studies employing microsatellites. It is also seen
that in order to make accurate phylogenetic inferences under range
constraints, a larger number of loci are required than in their
absence.
Molecular phylogeny for Colias
butterflies and their relatives (Lepidoptera: Pieridae)
Pollock DD, Watt WB, Rashbrook VK, Iyengar EV
The sulfur butterflies, Colias spp., and their relatives in the
family Pieridae have been the subjects of diverse behavioral,
ecological, and evolutionary studies. However, their phylogeny
is uncertain in many respects. We used DNA sequences from 2 mitochondrial
gene blocks, 333 bp of the cytochrome oxidase I subunit (CO I)
and 1,261 bp from the 2 ribosomal genes and the tRNA between them
(rDNA), as character sources to test existing phylogenetic hypotheses
and begin to infer others. The rDNA block resolves better at deeper
nodes of the phylogeny, and the CO I block at shallower nodes.
Our results support sister status for subfamilies Coliadinae and
Pierinae within Pieridae; independent tribal status for Euchloini
and Pierini within Pierinae; status as sister genera for Colias
and Zerene within Coliadinae; and monophyly within subgenus C.
(Euoolias) of all North American Colias studied. Our results suggest
that the Neotropical coliad genus Eurema may warrant splitting,
as some early workers proposed, but do not support the recently
proposed splitting of Eurasian C. erate from subgenus C. (Eriocolias)
into the separate subgenus C. (Neocolias).
Effectiveness of correlation
analysis in identifying protein residues undergoing correlated
evolution
Pollock DD, Taylor WR.
Various methods for detecting correlation between sites were evaluated
by ascertaining their ability to discriminate positively correlated
sites from background correlation at randomly evolved sites. A
model for generating pairwise correlations of different degrees
is also described. An assortment of physicochemical vectors and
similarity and difference matrices were used to discriminate correlated
change. There was little difference in effectiveness between the
different matrices, but there were significant differences between
the matrices and the physicochemical vectors. It is shown that
all methods investigated exhibit significant inability to screen
out background correlation, particularly in the presence of phylogenetic
relatedness between the sequences. Methods using the matrices
are unable to distinguish positively correlated from negatively
correlated, or compensatory, replacements.
Microsatellite genetic distances
with range constraints: analytic description
and problems of estimation
Feldman MW, Bergman A, Pollock DD, Goldstein DB.
Statistical properties of the symmetric stepwise-mutation model
for microsatellite evolution are studied under the assumption
that the number of repeats is strictly bounded above and below.
An exact analytic expression is found for the expected products
of the frequencies of alleles separated by k repeats. This permits
characterization of the asymptotic behavior of our distances D1
and (delta mu)2 under range constraints. Based on this characterization
we develop transformations that partially restore linearity when
allele size is restricted. We show that the appropriate transformation
cannot be applied in the case of varying mutation rates (beta)
and range constraints (R) because of statistical difficulties.
In the special case of no variation in beta and R across loci,
however, the transformation simplifies to a usable form and results
in a distance much more linear with time than distances developed
for an infinite range. Although analytically incorrect in the
case of variation in beta and R, the simpler transformation is
surprisingly insensitive to variation in these parameters, suggesting
that it may have considerable utility in phylogenetic studies.
A comparison of two
methods for constructing evolutionary distances from a weighted
contribution of transition and transversion differences
Pollock DD, Goldstein DB.
Since the initial work of Jukes and Cantor (1969), a number of
procedures have been developed to estimate the expected number
of nucleotide substitutions corresponding to a given observed
level of nucleotide differentiation assuming particular evolutionary
models. Unlike the proportion of different sites, the expected
number of substitutions that would have occurred grows linearly
with time and therefore has had great appeal as an evolutionary
distance. Recently, however, a number of authors have tried to
develop improved statistical approaches for generating and evaluating
evolutionary distances (Schoniger and von Haeseler 1993; Goldstein
and Polock 1994; Tajima and Takezaki 1994). These studies clearly
show that the estimated number of nucleotide substitutions is
generally not the best estimator for use in reconstruction of
phylogenetic relationships. The reason for this is that there
is often a large error associated with the estimation of this
number. Therefore, even though its expectation is correct (i.e.,
on average the expected number of substitutions is proportional
to time--but see Tajima 1993), it is not expected to be as useful
as estimators designed to have a lower variance.
Evolutionary relations among vertebrate muscle-type
lactate dehydrogenases
Quattro JM, Pollock DD, Powell M, Woods HA, Powers DA.
Gene duplication has produced two lactate dehydrogenase (LDH)
isozymes, LDH-A and LDH-B, that are found in essentially all vertebrates.
On the basis of the biochemical properties of the LDH-A and LDH-B
isozymes, it has been suggested that each locus is orthologous
among all vertebrates. However, phylogenetic studies have not
supported a common evolutionary history among the LDH-A isozymes,
particularly when those from lower vertebrates are examined. We
present here the sequence of a muscle-type LDH from Fundulus heteroclitus,
a teleost fish for which the LDH-B sequence has been determined
and shown to be unrelated phylogenetically to tetrapod LDH-A isozymes.
Although the sequence of the teleost muscle LDH shares certain
features with the LDH-A of tetrapods, phylogenetic analyses do
not support an orthologous relation among the LDH-A isozymes of
teleost fish and tetrapod vertebrates.
Least squares estimation
of molecular distance--noise abatement in phylogenetic reconstruction
Goldstein DB, Pollock DD.
Zuckerkandl and Pauling (1962, "Horizons in Biochemistry,"
pp. 189-225, Academic Press, New York) first noticed that the
degree of sequence similarity between the proteins of different
species could be used to estimate their phylogenetic relationship.
Since then models have been developed to improve the accuracy
of phylogenetic inferences based on amino acid or DNA sequences.
Most of these models were designed to yield distance measures
that are linear with time, on average. The reliability of phylogenetic
reconstruction, however, depends on the variance of the distance
measure in addition to its expectation. In this paper we show
how the method of generalized least squares can be used to combine
data types, each most informative at different points in time,
into a single distance measure. This measure reconstructs phylogenies
more accurately than existing non-likelihood distance measures.
We illustrate the approach for a two-rate mutation model and demonstrate
that its application provides more accurate phylogenetic reconstruction
than do currently available analytical distance measures.
3: Cytog. Cell. Genet 1991; 58(1-4): 1930
Chromosomal localization of the calbindin gene
Modi, W. S., M. Dean, D. D. Pollock, H. N. Suanez, and S. Christakos.
2: Cytog. Cell. Genet 1991; 58(1-4): 1870
Regional localization of the human glutaminase
gls and interleukin-9 il9 genes by in situ hybridization
Modi WS, Pollock DD, Mock BA, Banner C, Renauld JC, Van Snick
J.
Regional localization of the human glutaminase
GLS and interferon-9 IL9 genes by in-situ
hybridization
Modi WS, Pollock DD, Mock BA, Banner C, Renauld JC, and Van Snick
J
Phosphate-activated glutaminase is found in the mammalian small
intestine, brain, and kidney, but not in liver. The enzyme initiates
the catabolism of glutamine as the principal respiratory fuel
in the small intestine, may synthesize the neurotransmitter glutamate
in the brain, and functions in the kidney to help maintain systemic
pH homeostasis. Interleukin-9 (IL9) is a relatively new cytokine
that supports the growth of the helper T-cell clones, mast cells,
and megakaryoblastic leukemia cells. cDNA clones have recently
been obtained for each of these genes. The human loci for phosphate-activated
glutaminase (GLS) and IL9 have previously been mapped to chromosomes
2 and 5, respectively, by analysis of somatic cell hybrid DNAs.
By using chromosomal in situ hybridization, we have regionally
mapped GLS to 2q32 .fwdarw. q34 and IL9 to 5q31 .fwdarw. q35.
Manuscripts in revision or review
54: in review
Adaptive evolution and functional redesign of core metabolic proteins in snakes
Castoe TA, Jiang ZJ, Gu W. O., Wang ZO, and Pollock DD
Abstract pending.
55: in revision
Dissection of the human genome using repeat
probability clouds
Gu W, Castoe TA, Hedges D, Batzer MA, and Pollock DD
Abstract pending.
56: in revision
Reconciling the biochemical and genetic data
on replication of mitochondrial genomes