BMS8110复习（六）：Lecture 6 - Molecular Phylogeny and evolution

Outline

Introduction to molecular evolution
Principles of molecular phylogeny and evolution 分子系统发育与进化原理
- Goals; historical background; molecular clock hypothesis
- positive and negative selection; neutral theory of evolution
Molecular phylogeny: properties of trees
- Topologies and branch lengths of trees
- Tree roots
- Enumerating(穷举) trees and selecting search strategies
Type of trees (species trees vs. gene/protein trees; DNA or protein)
Five stages of phylogenetic analysis
- Stage 1: sequence acquisition
- Stage 2: multiple sequence alignment
- Stage 3: models of DNA and amino acid substitution
- Stage 4: tree-building methods (distance-based; maximum parsimony(最大简约法); maximum likelihood; Bayesian methods)
- Stage 5: evaluating trees

At the molecular level, evolution is a process of mutation with selection.

Molecular evolution is the study of changes in genes and proteins throughout different branches of the tree of life.

Phylogeny is the inference of evolutionary relationships.

Traditionally, phylogeny relied on the comparison of morphological(形态特征) features between organisms.
Today, molecular sequence data are also used for phylogenetic analyses.

Phylogeny can answer questions such as:

Is my favorite gene under selective pressure?
Was the extinct quagga(伯切尔氏斑马, 已绝种) more like a zebra or a horse?
Was Darwin correct that human are closest to chimps and gorrilas?
How related are whales, dolphins & porpoises to cows?
Where and when did HIV orginate?
What is the history of life on earth?

For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages.

Positive and negative selection:

Positive selection occurs when a sequence undergoes significantly increased rates of substitution, while negative selection occurs when a sequence undergoes change slowly.
Otherwise, selection is neutral.

Molecular phylogeny: nomenclature(术语) of trees

There are two main kinds of information inherent to any tree: topology and branch lengths.

The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor.

A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs)

Finding optimal trees: branch swapping

Bisect(二等分) a branch to form two subtrees
Reconnect via one branch from each subtree; evaluate each bisection
Identify the optimal tree(s)

Species trees versus gene/protein trees

Molecular evolutionary studies can be complicated by the fact that both species and genes evolove. Speciation usually occurs when a species becomes reproductively isolated. In a species tree, each internal node represents a speciation event.
Genes (and proteins) may duplicate or otherwise evolve before or after given speciation(物种形成) event. The topology of a gene (or protein) based tree may differ from the topology of a species tree.
A gene (e.g. a globin球蛋白) may duplicate before or after two species diverge!

Stage 1: sequence acquisition

For phylogeny, DNA can be more informative
Some substitutions in a DNA sequence alignment can be directly observed; single nucleotide substitutions, sequential substitutions, coincidental substitutions.
Additional mutational events can be inferred by analysis of ancestral sequences.

Stage 2: multiple sequence alignment

The fundamental basis of a phylogenetic tree is a multiple sequence alignment
If there is a misalignment, or if a nonhomologous sequence is included in the alignment, it will still be possible to generate a tree.
Confirm that all sequences are homologous(同源的)
Adjust gap creation and extension penalties as needed to optimize the alignment
Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data)

Stage 3: models of DNA and amino acid substitution

The simplest approach to measuring distances between sequence is to align pairs of sequences, and then to count the number of differences.
The degree of divergence (发散度) is called Hamming distance
But observed differences do not equal genetic distance! Genetic distance involves mutations that are not observed directly.

Stage 4: tree-building methods (distance-based; maximum parsimony(最大简约法); maximum likelihood; Bayesian methods)

UPGMA distance-based
Neighbor-joining distance-based
Maximum parsimony character-based
Maximum likelihood character-based (model-based) Maximum likelihood is an alternative to maximum parsimony. It is computationally intensive. A likelihood is calculated for the probability of each residue in an alignment, based upon some model of the substitution proceess.
Bayesian character-based (model-based) Bayesian inference is extremely popular for phylogenetic analyses (as is maximum likelihood); this approach require you to specify prior assumptions about the model of evolution
Distance-based methods involve a distance metric, calculate the pariwise alignment: if two sequences are related, put them next to each other on the tree.
Character-based methods include maximum and maximum likelihood; identify positions that best describe how characters (amino acids) are derived from common ancestors

Stage 5: evaluating trees

The main criteria by which the accuracy of a phylogentic tree is assessed are consistency efficiency, and robustness.
Evaluation of accuracy can refer to an approach or to a particular tree.
Bootstrapping is a commonly used approach to measuring the robustness of a tree topology.
Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?