.

Friday 4 January 2019

Phylogenetic

molecular(a)(a)(a)(a)(a) Phylo constituenttics An induction to computational system actings and tool arounds for analyzing ontogenyary relationships K arn Do tumesce Math 500 legislate 2008 molecular(a) phylo cistrontics K ben Dowell 1 Abstract molecular phyletics applies a combination of molecular and statistical techniques to suppose ontogenyary relationships among organisms or genes.This redirect examination paper come throughs a general introduction to phyletics and phyletic channelizes, renders close to of the closely earthy land computational methods utilize to infer phyletic schooling from molecular entropy, and provides an oerview of slightly of the round(prenominal) diametrical online tools avail fit for phylogenetic analytic thinking. In addition, several(prenominal) phylogenetic case studies ar summarized to illustrate how researchers in versatile biologic disciplines argon applying molecular phylogenetics in their work. world to m olecular PhylogeneticsThe analogousity of biologic attend tos and molecular mechanisms in living organisms strongly suggests that species descended from a h adepty oil ancestor. molecular(a) phylogenetics uses the twist and work out of molecules and how they enured upment everyplace metre to infer these organic exploitationary relationships. This divide of study emerged in the early 20th century just now didnt begin in solemn until the 1960s, with the advent of protein sequencing, PCR, electrophoresis, and new(prenominal) than molecular biology techniques.Over the past 30 old age, as deemrs ready make a great deal(prenominal) powerful and more(prenominal) gener l unmatched(prenominal)y accessible, and computer algorithmic programs more sophisticated, researchers nurse been able to tackle the immensely modify stochastic and probabilistic problems that define development at the molecular aim more effectively. Within past decade, this field has been hike up reenergized and redefined as whole genome sequencing for compo turn up plant organisms has break prodigaler and little expensive. As mounds of genomic learnive in put to workation becomes public on the wholey available, molecular phylogenetics is continuing to plow and disclose new applications. 4, 10, 17, 20, 22 The primary nonsubjective of molecular phylogenetic studies is to rec everyplace the club of ontogenyary events and introduce them in evolutionary head diagrams that graphic tout ensemble(prenominal)y depict relationships among species or genes everyplace time. This is an extremely complex dish, advertise abstruse by the fact that thither is no angiotensin-converting enzyme set behavior to draw close whole phylogenetic problems. Phylogenetic information get alongs wad consist of hundreds of dis quasi(prenominal) species, to separately genius of which whitethorn pass varying magnetic variation grade and patterns that influence evo lutionary dislodge.Consequently, there atomic bout 18 numerous various evolutionary imitates and stochastic methods available. The optimal methods for a phylogenetic analytic thinking depend on the spirit of the study and info utilise. 5, 19, 20 molecular maturation Beyond Darwin development is a cultivate by which the traits of a population change from one generation to other. In On the Origin of Species by Means of im humannessent Selection, Darwin resolved that, prone overwhelming evidence from his queen-size comparative analysis of living specimens and fossils, any(prenominal) living organisms descended from a roughhewn ancestor.The bears completely illustration (see prognosticate 1) is a channelise-like anatomical social organization that suggests how slow and attendant modifications could lead to the extreme variations seen in species today. 11, 27 molecular Phylogenetics K arn Dowell 2 flesh 1. Evolution define Graphically. The sole illustr ation in Darwins Origin of the Species uses a corner-like structure to describe evolution. This drawing shows ancestors at the limbs and leges of the point, more upstart ancestors at its twigs, and contemporary organisms at its buds. 34 Darwins theory of evolution is found on tether underlying principles ariation in traits comprise among individuals at plaza a population, these variations behind be passed from one generation to the neighboring via inheritance, and that some forms of transmissible traits provide individuals a high chance of survival and reproduction than others. 11 Although Darwin demonstrable his theory of evolution without some(prenominal) intimacy of the molecular basis of life, it has since been pin downd that evolution is actually a molecular surgery ground on genetic development, en write in coded in desoxyribonucleic sour, RNA, and proteins. At a molecular level, evolution is driven by the same types of mechanisms Darwin discover at the s pecies level.One molecule undergoes diversification into more another(prenominal)(prenominal) variations. One or more of those variants cig art be selected to be reproduced or amplified throughout a population over umteen generations. Such variations at the molecular level jakes be cause by sports, much(prenominal)(prenominal)(prenominal) as deletions, innovateions, in variants, or substitutions at the basis level, which in turn preserve protein structure and biological function. 11, 22 What is a Phylogeny? tally to modern evolutionary theory, all organisms on earth render descended from a common land ancestor, which tights that any desexualize of species, extant or extinct, is relate.This relationship is called a organic evolution, and is be by phylogenetic steers, which graphically catch up with the evolutionary history associate to the species of gratify (see intent 2). Phylogenetics infers guide diagrams from observations about existing organisms victim isation morphological, physiological, and molecular roleistics. judge 2. Phylogeny of Mammalia. This phylogenetic guide shows the evolutionary relationships among six orders of mammal species (taxa). Taxa listed in grey atomic snatch 18 extinct. The tree of life epitomises a phylogeny of all organisms, living and extinct.Other, more specialized species and molecular phylogenies argon employ to support comparative studies, footrace biogeographic hypotheses, label mode and timing of speciation, infer aminic panelling season of extinct proteins, track the evolution of diseases, and even provide evidence in criminal cases. 19 molecular(a) Phylogenetics K arn Dowell 3 misgiving Phylogenetic channelizes Before exploring statistical and bioinformatic methods for estimating phylogenetic trees from molecular data, its in-chief(postnominal) to have a basic familiarity of the terms and elements common to these types of trees. See enroll 3. ) Figure 3. elementary elements of a phylogenetic tree. Phylogenetic trees argon composed of get-goes, excessively know as butts, that connect and terminate at nodes. Branches and nodes give the axe be inside or orthogonal (terminal). The terminal nodes at the tips of trees represent working(a) taxonomic units (genus Otus). OTUs correspond to the molecular whiles or taxa (species) from which the tree was inferred. Internal nodes represent the last common ancestor (LCA) to all nodes that arise from that principal.Trees skunk be make of a unity gene from many taxa (a species tree) or multi-gene families (gene trees). 1, 10 A tree is considered to be rooted if there is a finicky node or outgroup (an external point of reference) from which all OTUs in the tree arises. The root is the oldest point in the tree and the common ancestor of all taxa in the analysis. In the absence of a known outgroup, the root freighter be discipline(p) in the middle of the tree or a rootless tree may be supplyd. Branches of a t ree can be grouped together in polar ways. (See Figure 4. ) Figure 4.Groups and associations of systematic units in trees. A monophyletic group consists of an internal LCA node and all OTUs arising from it. entirenessly members at bottom the group atomic number 18 derived from a common ancestor and have inherited a align of unique common traits. A paraphyletic group excludes some of its descendents (for examples all mammals, except the marsupialia molecular Phylogenetics Karen Dowell 4 taxa). And a polyphyletic group can be a solicitation of distantly related OTUs that are associated by a quasi(prenominal) characteristic or phenotype, besides are not directly descended from a common ancestor. 1, 17 Trees and Homology Evolution is shaped by homology, which refers to any similarity collectable to common ancestry. Similarly, phylogenetic trees are defined by homologous relationships. Paralogs are homologous chronological successions uninvolved by a gene duplicate event. O rthologs are homologous places make outd by a speciation event (when one species diverges into twain). Homologs can be either paralogs or orthologs. 1, 11, 22 Molecular phylogenetic trees are drawn so that branch length corresponds to amount of evolution (the per centum protestence in molecular durations) betwixt nodes. 1, 19 Figure 5. Understanding paralogs and orthologs. Paralogs are created by gene duplication events. (See Figure 5. ) at one time a gene has been duplicated, all subsequent species in the phylogeny pull up stakes inherit both copies of the gene, creating orthologs. Interestingly, evolutionary leaving of contrastive species may result in many variations of a protein, all with similar structures and functions, but with very incompatible amino group acid ages. Phylogenetic studies can analyse the store of such proteins to an ancestral protein family or gene. 1, 22 Figure 6. Mirror Phylogenies. cistron A and Gene A1 are paralogs, whereas all instances of Gene A are orthologs of each other in antithetic cuspid species. One way to ensure that paralogs and orthologs are distinguishly write in a phylogenetic tree, and carry against misrepresentation due to missing or incomplete taxonomic instruction is to commit mirror phylogenies (see Figure 6) in which paralogs behave as each others outgroup. 1, 4, 19, 22 Estimating Molecular Phylogenetic Trees Molecular phylogenetic trees are generated from character datasets that provides evolutionary pith and context.Character data may consist of biomolecular eon conjugations of desoxyribonucleic acid, RNA, or amino acids, molecular markers, such as individual stem polymorphisms (SNPs) or restriction fragment length polymorphisms (RFLPs), geomorphology data, or information on gene order and content. Evolution is modeled as a member that changes the raise of a character, such as the type of root (AGTC) at a Molecular Phylogenetics Karen Dowell 5 particularised locating in a deoxy ribonucleic acid chronological instalment each character is a function that maps a set of taxa to distinct secerns. 1, 19 Note that near of the examples in this paper use deoxyribonucleic acid sequences as character data, but trees can be accurately come closed from many different types of molecular data. Figure 7. Evolution of a deoxyribonucleic acid Sequence Figure 7 illustrates how a molecular sequence skill shoot over time as a result of doubled mutations that results small, but evolutionarily important changes in a pedestal sequence. At the protein level, these changes may not initially affect protein structure or function, but over time, they may eventually shape a new purpose for a protein within divergent species. 10, 19, 22 OTUs can be employ to build an unrooted phylogenetic tree that clearly depicts a path of evolutionary change. Steps in Phylogenetic psychoanalysis Although the nature and scope of phylogenetic studies may vary significantly and request diff erent datasets and computational methods, the basic pure tones in any phylogenetic analysis continue the same put in and align a dataset, build (estimate) phylogenetic trees from sequences exploitation computational methods and stochastic models, and statistically test and appraise the estimated trees. 4, 19, 20 Assemble and Align entropysets The scratch step is to identify a protein or DNA sequence of interest and assemble a dataset consisting of other related sequences. For example, to explore relationships among different members of the Notch family of proteins, one might select DNA sequences for Notch1 through Notch4, in different species, such as human, dog, rat, and mouse, whence discharge a quadruplexx sequence alliance to identify homologies. 1, 10, 13, 19, 20 in that location are a publication of free, online tools available to simplify and streamline this process. DNA sequences of interest can be retrieved exploitation NCBI BLAST or similar search tools.Whe n evaluating a set of related sequences retrieved in a BLAST search, give birth close attention to the score and E-value. A high score indicates the subject sequence retrieved with closely related to the sequence apply to initiate the query. The smaller the E-value, the higher the fortune that the homology reflects a true evolutionary relationship, as opposed to sequence similarity due to chance. As a general rule, sequences with E- determine less than 10-5 are homologs of a query sequence. 10 Once sequences are selected and retrieved, octuple sequence confederation is created.This involves arranging a set of sequences in a intercellular substance to identify regions of homology. Typically, gaps (one or more spaces in the alignment) are introduced in one or more sequences to represent insertions or deletions in the molecular code that may have occurred over time. stiff multiple sequence alignment hinged on gap analysisdetermining where to insert gaps and how whopping to make them. in that respect are many web positions and parcel political platforms, such as ClustalW, MSA, MAFFT, and T-Coffee, designed to perform multiple sequence on a given set of molecular data. ClustalW is before long the some right and most widely used. 1, 10. 19 Molecular Phylogenetics Karen Dowell 6 Building Phylogenetic Trees To build phylogenetic trees, statistical methods are applied to determine the tree topology and calculate the branch lengths that topper describe the phylogenetic relationships of the reorient sequences in a dataset. Many different methods for building trees exist and no single method performs well for all types of trees and datasets. The most common computational methods applied include exceed-matrix methods, and separate data methods, such as upper limit stuffiness and supreme likeliness. 4, 17, 20 in that location are several software packages, such as Paup*, PAML, PHYLIP, that apply most familiar methods. 4 Paup* is a commercially availab le program that implements a wide variety of methods for phylogenetic inference, including maximal likelihood analysis for DNA data exploitation different models. Paup* to a fault includes a set of exact and trial-and-error methods for searching optimal trees. PAML (Phylogenetic Analysis by upper limit Likelihood) is open-access set of programs for phylogenetic analysis and evolutionary model comparison.PAML includes many forward-looking modelsDNA- and AAestablish models as well as codon- ground models that can be used to detect autocratic selection. Many of the programs in PAML can model heterogeneity of evolutionary rates among sequence sites utilize ? distributions, and evolutionary dynamics of different sequence regions (concatenated gene sequences). PHYLIP is another large suite of open-access programs for phylogenetic inference that estimates trees exploitation numerous methods, including yokewise outgo, maximal parsimony, and maximum likelihood.The maximum likelihoo d programs can handle a few simple stochastic models and have good tree searching capabilities. PHYLIP is more a great deal than not considered good educational software for critic phylogeneticists. Distance-Matrix Methods Distance matrix methods compute a matrix of pairwise distances between sequences that approximate evolutionary distance. Distance- ground methods tend to be in polynomial time and are quite fast in practice. These methods use clod techniques to compute evolutionary distances, such as the follow of nucleotide or amino acid substitutions between sequences, for all pairs of taxa.They hence produce phylogenetic trees use algorithms establish on serviceable relationships among distance value. There are several different distance-matrix methods, including the Unweighted Pair-Group Method with arithmetic Mean (UPGMA), which uses a sequential bunch algorithm the Transformed Distance Method, which uses an outgroup as a reference, then applies UPGMA the Neighbor -Relations Method, which applies 4point condition to adjust the distance matrix, then applies UPGMA and the Neighbor-Joining Method, which arranges OTUs in a star, the finds neighbors sequentially to minimize come up length of tree. 4, 17 The following section on the UPGMA method provides a more enlarge example of how distance-matrix methods work. UPGMA Method UPGMA produces rooted trees for which the edge lengths can be viewed as quantify measured by a molecular clock with a constant rate. This method uses a sequential gather algorithm to identify both OTUs that are most similar (meaning they have the shortest evolutionary distance and are most similar in sequence) and treat them as a single new manifold OTU. This process is repeated iteratively until only ii OTUs remain.The algorithm defines the distance (d) between two clusters Ci and Cj as the average distance between pairs of sequences from each cluster Molecular Phylogenetics Karen Dowell 7 Where Ci and Cj are the turn ing of sequences in clusters i and j. This sequential lot process is visually described in Figure 8. In this example, the two most homologous sequences are 1 and 2. They are flock into a new composite evoke node (6), and the branch lengths (t1 and t2) are defined as 1/2d1,2. The next step is to search for the closest pair among remaining sequences and node 6.Pair 4 and 5 are identify and clump into a new heighten node (7), and the branch length for t4 and t5 is figure. 4, 17 Figure 8. Sequential assemble of sequences victimisation the UPGMA method. 17 In this interactive process, stir node 8 is created from pairs 7 and 3, and parent node 9 is created by clustering nodes 6 and 8. 4, 17 Thus, all sequences are clustered into a single evolutionary tree. The total time (t9) can be calculated as D6,8 = 1/6 (d1,3 + d1,4 + d1,5 + d2,3 + d2,4 +d2,5) distinguishable Data Methods Discrete data methods examine each column of a multiple sequence alignment dataset separately and search for the tree that best represents all this information. Although distance- ground methods tend to be much faster than discrete data methods, they typically yield little information beyond the basic tree structure. Discrete data analyses, on the other hand, are information thick. These methods produce a separate tree for each column in the alignment, so it is realizable to trace the evolution for specific elements within a given sequence, such as catalytic sites or regulatory regions. 10, 17, 19, 20) Commonly used discrete data methods include maximum parsimony, which searches for the most niggardly tree that contends the least number of evolutionary changes to explain differences sight, maximum likelihood, which requires a probabilistic model for the process of nucleotide substitution, and Bayesian MCMC, which to a fault requires a stochastic model of evolution, but creates a opportunity distribution on a set of trees or aspects of evolutionary history. 17, 19, 20 Discrete dat a methods are generally considered to produce the best estimates of evolutionary history.However, these methods can be computationally expensive, and it can take weeks or months to come a reasonable level of truth for moderate to large datasets with 100 or more OTUs. 19 Molecular Phylogenetics Maximum thrift Karen Dowell 8 Among the most widely used tree- affection techniques, maximum parsimony applies a set of algorithms to search for the tree that requires the minimum number of evolutionary changes observed among the OTUs in the study. For example, Figure 9 lists intravenous feeding en sample sequences from which phylogenetic trees could be inferred using maximum parsimony.Site Seq 1 2 3 4 1 A A A A 2 A G G G 3 G C A A 4 A C T G 5 G G A A 6 T T T T 7 G G C C 8 C C C C 9 A G A G Figure 9. Sample sequences for a maximum parsimony study 17 Maximum parsimony algorithms identify phylogenetically edifying sites, meaning the site favors some trees over others. Consider the sequence s in Figure 9 Site 1 is not informative, because all sequences at that site (in column 1) are A (Adenine), and no change in state is required to match any one sequence (1-4) to another.Similarly, Site 2 is not informative because all lead trees require one change and there is no reason to favor one tree over another. Site 3 is not informative because all cardinal trees require two changes. (See Figure 10). Figure 10. Site 3 trees all require one evolutionary change. 17 Site 4 is not informative because all terzetto trees require three changes. No one tree can be identified as parsimonious. (See Figure 10 Figure 11. Site 4 trees all require three evolutionary changes. 17 Site 5 is informative because one tree requires only one nucleotide change, whereas the other two trees require 2 changes.In Figure 12, the front tree on the left, which requires only one nucleotide change, is identified as the maximum parsimony tree. Figure 12. Site 5 trees vary in the number of evolutionary chan ges required. 17 Molecular Phylogenetics Maximum Likelihood Karen Dowell 9 The maximum likelihood method requires a probabalistic model of evolution for estimating nucleotide substitution. This method evaluates competing hypotheses (trees and parameters) by selecting those with the highest likelihood, meaning those that render the observed data most plausible. The ikelihood of a conjecture is defined as the hazard of the data given that hypothesis. In phylogeny reconstruction, the hypotheses are the evolutionary tree (its topology and branch lengths) and any other parameters of the evolutionary model. 17, 20 The likelihood calculations required for evolutionary trees are remote from straightforward and usually require complex computations that must allow for all affirmable unobserved sequences at the LCA nodes of hypothesized trees. This method specifies the variation probability from one nucleotide state to another in a time interval in each branch.For example, for a one-par ameter model with rate of substitution ? per site per unit time, the probability that the nucleotide at time t is i is The probability that the nucleotide at time t is j is To set up a likelihood function, given x as the ancestral node and y and z as internal nodes, the probability of law-abiding nucleotides i, j, k, l at the tips of the tree is computed as Pxl(t1+t2+t3)Pxy(t1)Pyk(t2+t3)Pyz(t2)Pzi(t3)Pzj(t3) For the ancestral node (root) x, the probability of having nucleotide l in sequence 4 is calculated as Pxl(t1+t2+t3)Because x, y, and z can be any one of four nucleotides (ACGT), it is necessary to sum over all possibilities to obtain the probability of observing the material body of nucleotides i, j, k, l, in sequences 1, 2, 3, 4, for a given supposed tree (see Figure 13. ). This likelihood probability is calculated as h(I,j,k,l)= ? gxPxl(t1+t2+t3) ? Pxy(t1)Pyk(t2+t3) ? Pyz(t2)Pzi(t3) Pzj(t3) The appropriate likelihood function depends on the hypothetical tree and the evolut ionary model used. (See Figure 13. ) 17 Figure 13. Different types of model trees for the derivation of the maximum likelihood function. 17 Molecular Phylogenetics Stochastic Models of Evolution Karen Dowell 10 Evolutionary changes in molecular sequences result from mutations, some of which occur by chance, others by natural selection. Rates of change can also differ among OTUs, depending on several factors ranging from GC content to genome size. To accurately estimate phylogenetic trees, assumptions must be made about the substitution process and those assumptions must be stated in the form of a stochastic evolutionary model. These probabilistic models are used to regularize trees according to likelihood P(datatree).From a Bayesian perspective, they rank trees according to a so-and-so probability P(treedata). 17, 20 The mark of probabilistic models is to find likelihood or posterior probability of a particular taxonomic feature, then define and compute P(x? T,t ? ) Where x ? is xj for j=1n, T is a tree with n leaves with sequence j at leaf j, and t ? are tree edge lengths. 17 A few familiar stochastic models of evolution include the single parameter Jukes-Cantor (JC) method, Kimura 2-parameter (K2P), Hasegawa-Kishino-Yano (HKY), and Equal-Input.Some software programs, such as Paup*, will mechanically use a default model for the tree estimation method chosen. The JC method is the easiest one to comprehend, because it assumes that if a site changes its state, it changes with equal probability to the other states. This is not very realistic, however, as some sites are known to evolve more rapidly than others, and some sites may be invariable and not allowed to change at all. Determining how best to select the appropriate model is a topic of another paper (or papers) as there is no one model that incorporates all mutation rules and patterns across different species and macromolecules. 4, 17, 20 Hidden Markov Models visibility hidden Markov models (HMMs) are a form of Bayesian network that provides statistical models of the consensus structure of a sequence family. Gary Churchill at The capital of Mississippi Lab was the first evolutionary geneticist to propose using pen HMMs to model rates of evolution. Many software packages and web serve now apply HMMs to estimate phylogenetic relationships. 8 In the HMM format, each position in the model corresponds to a site in the sequence alignment. For each position, there are a number of possible states, each of which corresponds to a different rate of evolution.In addition, transitions between all possible rate-states at side by side(p) positions. Transition probabilities capture any movement for patterns of rates to occur in consequent sites. 2, 4 Assessing Trees Tree estimating algorithms generate one or more optimal trees. This set of possible trees is subjected to a series of statistical tests to evaluate whether one tree is better than another and if the proposed phylogeny is reasona ble. Common methods for assessing trees include the help and Jackknife Resampling methods, and analytical methods, such as parsimony, distance, and likelihood.To illustrate how these methods are used, consider the step involved in a aid analysis. Bootstrap Analysis A assist is a statistical method for assessing trees that takes its give ear from the fact that it can pull itself up by its assists and generate meaningful statistical distributions from almost nothing. employ bootstrap analysis, distributions that would otherwise be difficult to calculate precisely are estimated by repeated origination and analysis of artificial datasets. In a Non-parametric bootstrap, artificial datasets Molecular Phylogenetics Karen Dowell 11 generated by resampling from fender data.In a parametric bootstrap, data is simulated according to hypothesis tested. The objective of any bootstrap analysis is to test whether the whole dataset supports the tree. 1, 4, 17 Figure 14 illustrates the basi c steps in any bootstrap analysis. Sample datasets are mechanically generated from an original dataset. Trees are then estimated from each sample dataset. The results are compiled and compared to determine a bootstrap consensus tree. Figure 14. Steps in a phylogenetic tree bootstrap analysis. 1 Phylogenetic Analysis rotating shafts There are several good online tools and databases that can be used for phylogenetic analysis.These include PANTHER, P-Pod, PFam, TreeFam, and the PhyloFacts morphological phylogenomic encyclopedia. for each one of these databases uses different algorithms and draws on different sources for sequence information, and therefore the trees estimated by PANTHER, for example, may differ significantly from those generated by P-Pod or PFam. As with all bioinformatics tools of this type, it is important to test different methods, compare the results, then determine which database whole kit best (according to consensus results, not researcher bias) for studies i nvolving different types of datasets.In addition, to the phylogenetic programs already mentioned in this paper, a comprehensive list of more than 350 software packages, web- gains, and other resources can be pitch here http//evolution. genetics. washington. edu/phylip/software. html. PANTHER (pantherdb. org) Protein ANalysis by dint of Evolutionary Relationships, known by its acronym PANTHER, is a library of protein families and subfamilies indexed by function. Panther indication 6. 1 contains 5547 protein families. Molecular Phylogenetics Karen Dowell 12It categorizes proteins by evolutionary related proteins (families) and related proteins with same function (subfamilies). 8, 21, 26 PANTHER is composed of both a library and index. The library is a collection of countersigns that represent a protein family as a collection of multiple sequence alignments, HMMs, and a family phylogenetic tree. operating(a) divergence within the tree is represented by dividing the parent tree int o child trees and HMMs based on shared functions. These subfamilies enable database curators to more accurately capture operational divergence of protein sequences as inferred from genomic DNA. 25, 26 PANTHER database entries are annotated to molecular function, biological process and pathway with a proprietorship PANTHER/X ontology system, which is supposed to be easier to understand than the more global trite Gene Ontology (GO). Database entries in PANTHER are generated through clustering of UniProt database using a BLAST-based similarity score. Trees are automatically generated based on multiple sequence alignments and parameters of the protein family HMMs using the Tree Inferred from Profile Score (TIPS) clustering algorithm.scientific curators review all family trees, annotate each tree, and determine how best to divide them into subtrees using a tree-attribute viewer that tabulates annotations for sequences in a tree. In addition, trees and subfamilies are manually cross- checked and formalize by curators. 25, 26 P-POD (ortholog. princeton. edu) The Princeton Protein Orthology Database (P-POD) combines results from multiple comparative methods with curated information culled from the lit.Designed to be a resource for observational biologists seeking evolutionary information on genes on interest, P-POD employs a modular architecture, based on their Generic Model existence Database (GMOD). P-POD can be accessed from their web service or downloaded to run on local computer systems. 12 P-POD accepts FASTA-formatted protein sequences as input, and performs comparative genomic analyses on those sequences using OrthoMCL and Jaccard clustering methods. The P-POD database contains both phylogenetic information and manually curated observational results.The site also provides many links to sites rich in human disease and gene information. This tool may be oddly helpful for bioinformaticists and statisticians developing comparative genomic database tools and resources. Pfam (pfam. sanger. ac. uk/) PFam is a collection of protein families represented by multiple sequence alignments and HMMs. It contains models of protein clans, families, spheres, and motifs, and uses HMMs representing keep functional and structural domains. It is a large, widely used, actively curated get on with database that has been available online since 1995.Pfam can be used to retrieve the domain architectures for a specific protein by conducting a search using a protein sequence against the Pfam library of HMMs. This database is also helpful for proteomes and protein domain architecture analysis. 6, 8, 24 There are two versions of the Pfam database PfamB is generated automatically from ProDom, using PsiBLAST, an open access bioinformatics tool available through NCBI for identifying weak, but biologically relevant sequence similarities. Pfam-A is hand-curated from custom multiple sequence alignments. Pfam protein domain families are clustered with Mkdom2, an d aligned with ProDomAlign.ProDom is a comprehensive set of protein domain families automatically generated from the SWISSPROT and TrEMBL sequence databases. Mkdom2 is a ProDom program used to make ProDom family clusters. Protein domain families in ProDom were aligned using an amend parallelized program called Molecular Phylogenetics Karen Dowell 13 ProDomAlign, actual in C++ using OpenMP. ProDomAlign is based on MultAlign, a program well suited for aligning very large sequence families with thousands of associated sequences. As of early 2008, Pfam matched 72 percent of known proteins sequences, and 95 percent of proteins for which there is a known structure.Within the Pfam database, 75 percent of sequences will have one match to Pfam-A, 19 percent to Pfam-B. There are also two versions of Pfam-A and Pfam-B. Pfam-ls handles global alignments, and Pfam-fs is optimized for local alignments. Interestingly, Pfam entries can be classified advertisement as unknown, but that doesnt mean the protein is un enter. Unknown entries can be proteins for which some information is known, but it has not been in full researched or cannot be adequately annotated. For example, Pfam adit PFO1816 is a LeucineRich Repeat Variant (LRV), which has a known structure (1LRV) available in the Protein Databank (pdb. rg). LRV repeat regions, which are found in many different proteins, are often involved in cell adhesion, DNA repair, and hormone receptionbut realization of an LRV within a sequence encode a protein doesnt specifically discontinue the proteins function. For studies involving a large number of protein searches, it may be more at rest to run Pfam locally on a guest machine. The standalone Pfam system requires the HMMER2 software, the Pfam HMM libraries and a equate of additional files from the Pfam website to be installed on the client machine. HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis. ) Once the initial searc h is complete, researchers can go to the Pfam website to further go bad select number of sequences using additional features on website. 6, 8, 24 TreeFam (TreeFam. org) TreeFam is a curated database of phylogenetic trees and orthology predictions for all animal gene families that focuses on gene sets from animals with completely sequenced genomes. Orthologs and paralogs are inferred from phylogenetic tree of gene family.Release 4 contains curated trees for 1314 families and automatically generated trees for another 14351 families. 16, 23 Like Pfam, TreeFam is a two-part database TreeFam-B contains automatically generated trees, and TreeFam-A consists of manually curated trees. To automatically generate trees, an algorithm selects clusters of genes to create TreeFam-B seeds from core species with high-quality reference genome sequences, first using BLAST to rapidly assemble an initial list of possible matches, then HMMER to expand and filter probable sequence matches for each TreeFa m B seed family.The filtered alignment is fed into a neighbor-joining algorithm and a tree is constructed based on amino acid mismatch distances. For TreeFam version 4, the most current release, five clean family trees were built for each TreeFam B seed, two using a maximum likelihood tree generated using PHYML (one based on the protein alignment, the other on codon alignment), three using a neighbor joining tree, using different distance measurements based on codon alignments. 16, 23 Scientific curators then manually any indemnify errors (based on information in the literature) in automatically generated TreeFam-B trees. Curated TreeFam-B trees then become seeds for TreeFam-A trees. leach TreeFam-A trees are build using three merging algorithms and bootstrapping to find the consensus tree of seven trees two constrained maximum likelihood trees based on protein and codon alignment, and five free neighbor-joining trees generated using different distance measurements based on cod on alignments.For both TreeFam-B and TreeFam-A families, orthologs and paralogs are inferred only from clean trees using Duplication/ passing game Inference (DLI) algorithm that requires a species tree (NCBI taxonomy tree). 16, 23 Molecular Phylogenetics PhyloFacts (phylogenomics. berkeley. edu/phylofacts) Karen Dowell 14 PhyloFacts is an online phylogenomic encyclopedia for protein functional and structural categorization. It contains more than 57,000 books for protein superfamilies and structural domains.Each book contains heterogenous data for protein families, including multiple sequence alignments, one or more phylogenetic trees, predicted 3-D protein structures, predicted functional subfamilies, taxonomic distributions, GO annotations, and PFAM domains. HMMs constructed for each family and subfamily permit novel sequences to be classified to different functional classes. 14 unlike other databases mentioned in this paper, PhyloFacts seeks to correct and polish off annotation errors associated with computational methods for predicting protein function based on sequence homology.It uses a consensus forward motion that integrates many different prediction methods and sources of experimental data over an evolutionary tree. By applying evolutionary and structural clustering of proteins, PhyloFacts is able to analyze disparate datasets using multiple methods, identify potential errors in database annotations, and provide a mechanism for improving the the true of functional annotation in general. 14 PhyloFacts can be used to search for protein structure prediction or functional classification for a particular protein sequence.Researchers may also browse through protein family books and multiple sequence alignments, phylogenetic trees, HMMs and other pertinent information for proteins of interest. This webservice also provides many links to literature and other information sources. 14 Applied Molecular Phylogenetics Molecular phylogenetic studies have many d iverse applications. As the amount of publically available molecular sequence data grows and methods for modeling evolution become more sophisticated and accessible, more and more biologists are incorporating phylogenetic analyses into their research trategy. Heres a sampling of how molecular phylogenetics might be applied. Tracing the evolution of man In one case study, molecular phylogenetic techniques were used to compare and analyze variation in DNA sequences using modern human and Neanderthal mitochondrial DNA (mtDNA). For this study, 206 modern human mtDNAs and parts of two Neanderthal mtDNAs sequences derived from skeletal remains were used to generate an initial dataset. Genetic distance was first estimated using the Jukes-Cantor single parameter model.Then the Kimura 2-Parameter model was used to distinguish between transition (replacement of one purine with another purine or one pyrimidine with another pyrimidine) and transversion (replacement of one purine with a pyrimid ine or vice versa) probabilities with Kimura 2parameter model. A phylogenetic tree representing primate evolution was generated using pairwise genetic distances between primate Hypervariable regions I and II of mtDNA. 3 Chasing an epidemic SARS Using publically available genomic data, it is possible to reconstruct the progress of the SARS epidemic over time and geographically.To conduct this phylogenetic analysis, researchers used the neighborjoining method to construct a phylogenetic tree of spike proteins in various coronaviruses and identify the viral host (a Himalyan deal civet). They then obtained 13 SARs genome sequences with documented information on the date and location of the sample. The neighbor-joining method and a distance matrix based on Jukes-Cantor model, were used to generate an epidemic tree, from which it was possible to identify the origin (date and location) of the virus by observing progression of mutations over time. 3 Molecular Phylogenetics Barking up the r ight tree Karen Dowell 15 Phylogenetics is increasingly merged into biological and biomedical research papers. When the cuspid genome was published, researchers used sequence data to estimate a comprehensive phylogeny of the canid family. Figure 15. Phylogenetic Tree of the Canid family This canid family phylogenetic tree is based on 15 kb of cryptanalytics DNA and intron sequence. It was constructed using the maximum parsimony method and represents the single most parsimonious tree.A good example of how phylogenies are write in the literature, this tree includes bootstrap values and Bayesian posterior probability values listed above and below internodes, respectively. Dashes indicate bootstrap values below 50%. In addition, divergence time in millions of years (Myr) is indicated for three nodes. 18 Seeing the Forest from the Trees Molecular phylogenetics is a broad, diverse field with many applications, support by multiple computational and statistical methods. The sheer volume s of genomic data currently available (and rapidly growing) render molecular phylogenetics a key component of much biological research.Genome-scale studies on gene content, conserved gene order, gene expression, regulatory networks, metabolous pathways, functional genome annotation can all be enriched by evolutionary studies based on phylogenetic statistical analyses. 19, 25 27 Molecular phylogenies have fast become an integral part of biological research, pharmaceutical drug design, and bioinformatics techniques for protein structure prediction and multiple sequence alignment. Although not all molecular biologists and bioinformaticians may be familiar with the techniques describedMolecular Phylogenetics Karen Dowell 16 in this paper, this is a rapidly growing and expanding field and there is ongoing require for novel algorithms to solve complex phylogeny reconstruction problems. References 1. Baldauf, SL (2003) Phylogeny for the faint of heart a tutorial. Trends in Genetics, 19( 6)345-351. 2. Brown, D, K Sjolander (2006) Functional categorisation Using Phylogenomic Inference. PLos computational Biology, 2(6)0479-0483. 3. Cristianini, N, and M Hahn (2007) Introduction to Computational Genomics A movement Studies Approach.Cambridge University Press Cambridge. 4. Durbin, R, S Eddy, A Krogh, G Mitchison (1998) Biological Sequence Analysis. Cambridge University Press Cambridge. 5. Ewens, WJ, R Grant (2005) Statistical Methods in Bioinformatics. customs duty Science and Business Media New York. 6. Finn, RD, J Tate, J Mistry, PC Coggill, SJ Sammut, HR Hotz, G Ceric, K Forslund, SR Eddy, ELL Sonnhammer, A Bateman (2008) The Pfam protein families database. Nucleic Acids Research, 36D281288. 7. Gabaldon, T (2008) Large-scale assignment of orthology natural covering to phylogenetics? Genome Biology, 9235. 1-235. 6. 8. Gollery, M. (2008) Handbook of Hidden Markov Models in Bioinformatics. CRC Press, Taylor &038 Francis Group London. 9. Goodstadt, L, CP Ponting (2 006) Phylogenetic Reconstruction of Orthology, Paralogy, and keep Synteny for Dog and Human. PLoS Computational Biology, 2(9)1134-1150. 10. residence, BG. (2004) Phylogenetic Trees do Easy A How-To Manual, 2nd ed. Sinauer Associates, Inc. Sunderland, MA. 11. Hartwell, LH, L Hood, ML Goldberg, AE Reynolds, LM Silver, RC Veres (2008) Genetics From Genes to Genomes, third Ed.McGraw-Hill New York. 12. Heinicke, S, MS Livstone, C Lu, R Oughtred, F Kang, SV Angiuoli, O White, D Botstein, K Dolinski (2007) The Princeton Protein Orthology Database (P-POD) A Comparative Genomics Analysis Tool for Biologists. PLoS ONE, 8e766. 1-15. 13. Kortschak, RD, R Tamme (2001) Evolutionary analysis of vertebrate Notch genes. Dev Genes Evol, 211350-354. 14. Krishnamurthy, N, DP Brown, D Kirshner, K Sjolander (2006) PhyloFacts an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biology, 7R83. -13. 15. Kuzniar, A, RCHJ van Ham, S Pongor, JAM Le unissen (2008) The quest for orthologs finding the alike gene across genomes. Trends in Genetics, 24(11)539-551. Molecular Phylogenetics Karen Dowell 17 16. Li, H, A Coghlan, J Ruan, LJ Coin, JK Heriche, L Osmotherly, R Li, T Liu, Z Zhang, L Bolund, GKS Wong, W Zheng, P Dehal, J Wang, R Durbin (2006) TreeFam a curated database of phylgenetic trees of animal gene families. Nucleic Acids Research, 34D573-580. 17. Li, WH (1997) Molecular Evolution. Sinauer Associates Sunderland, MA. 18.Lindblad-Toh, K, CM Wade, TS Mikkelsen, EK Karlsson, DB Jaffe, M Kamal, M Clamp, JL Chang, EJ Kulbokas III, MC Zody, E Mauceli, X Xie, M Breen, RK Wayne, EA Ostrander, CP Ponting, F Galibert, DR Smith, PJ deJong, E Kirkness, P Alvarez, T Biagi, W Brockman, J Butler, C Chin, A Cook, J Cuff, MJ Daly, D DeCaprio, S Gnerre, M Grabherr, M Kellis, M Kleber, C Bardeleben, L Goodstadt, A Heger, C Hitte, L Kim, KP Koepfli, HG Parker, JP Pollinger, SMJ Searle, NB Sutter, R Thomas, C Webber, ES Lander (2005) Gen ome Sequence, Comparative Analysis and Haplotype mental synthesis of the Domestic Dog.Nature, 438803-819. 19. Linder, CR, T Warnow (2005) An overview of phylogeny reconstruction. In the Handbook of Computational Molecular Biology, Chapman and Hall/CRC Computer &038 Information Science. 20. Lio, P, N Goldman (1998) Models of Molecular Evolution and Phylogeny. Genome Research, 812331244. 21. Mi, H, N Guo, A Kejariwal, PD Thomas (2007) PANTHER version 6 protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Research, 35D247-252. 22. Patthy, Laszlo. (1999) Protein Evolution. Blackwell Science, Ltd Malden, MA. 23. Ruan, J, H Li Z Chen, A Coghlan, LJM Coin, Y Guo, JK Heriche, Y Hu, K Kristiansen, R Li, T Liu, A Mose, J Qin, S Vang, AJ Vilella, A Ureta-Vidal, L Bolund, J Wang, R Durbin (2008) TreeFam 2008 Update. Nucleic Acids Research, 36D735-740. 24. Sammut, SJ, RD Finn, A Bateman (2008) Pfam 10 years on 10000 families and una gitated growing. Briefings in Bioinformatics, 9(3)210-219. 5. Thomas, PD, A Kejariwal, N Guo, H Mi, MJ Campbell, A Muruganujan, B Lazareva-Ulitsky (2006) Applications for protein sequence-function evolution data mRNA/protein expression analysis and coding SNP scoring tools. Nucleic Acids Research, 34W645-650. 26. Thomas, PD, MJ Campbell, A Kejariwal, H Mi, B Karlak, R Daverman, K Diemer, A Muruganujan, A Narechania. PANTHER A library of Protein Families and Subfamilies Indexed by Function. Genome Research, 132129-2141. 27.Warnow, T (2004) Computational Methods in Phylogenetics Computational Systems Biology Conference, Stanford, CA 28. Whelan, S, P Lio, N Goldman (2001) Molecular phylogenetics state of the art methods for looking into the past. Trends in Genetics, 17(5)262-272. Molecular Phylogenetics Karen Dowell 18 Appendix Website Resources Phylogeny Programs. A University of Washington site formerly supported by the National Science Foundation. http//www. evolution. genetics . washington. edu/phylip/software. tml TreeFam Tree Families Database. http//wwww. treefam. org Protein Analysis Through Evolutionary Relationships (PANTHER) Classification System. http//www. pantherdb. org. 29. Pfam Database of Protein Families. http//pfam. sanger. ac. uk 30. Princeton Protein Orthology Database (P-POD). http//ppod. princeton. edu 31. Wikipedia. http//en. wikipedia. org/wiki/Tree_of_life(science) Cover Page The cover send off is from a phylogeny of canid species that appeared in Lindblad-Toh et al, 2005. 18

No comments:

Post a Comment