Lecture 5——DNA-seq-2_Bioinformatics and Statistical Topics

本文图片来自于学习视频——新一代测序技术数据分析第五讲 DNA-seq2_Bioinformatics and Statistical Topics
Lecture 5——DNA-seq-2_Bioinformatics and Statistical Topics

Sequence mappability

Human genome
The minimum length (number of nucleotides) can be uniquely mapped back to human genome?
In theory, reads with 16 bases or more can be uniquely mapped back to human genome
~half of human genome is repetitive DNA
Available for download at UCSC Genome Browser, display how uniquely k-mer sequences align to a region of the genome (k=24, 36, 40, 50, 75, and 100)
S= l/(number of matches found in the genome), 2 mismatches allowed; i.e. S= 1(unique) Lecture 5——DNA-seq-2_Bioinformatics and Statistical Topics
Generate your own mappability track
Mappability is determined by multiple factors
the alignment algorithm (Koehler et al. Bioinformatics, 2010)
the regions where mapping occur
the biochemistry assay
Regions can be slightly different
Refined alignment
Sequence alignment
Number of mismatching bases minimized across one read
Sequence refined alignment
Number of mismatching bases minimized across all the reads
What went wrong for the initial alignment?
A large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome
The aligner prefers one/two mismatches over a 4bp insertion
Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are mistaken as SNPs
Initial mapping treats each read independently
Even when some of the reads are correctly mapped with indels, reads covering the indelsnear just the start and end are often misaligned.
Local realignment
Transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.
Two steps
1: Determining (small) suspicious intervals which are likely in need of realignment
2: Running the realigner over those intervals
Step 1
De novo indels in initial alignment
If one or more reads contain an indel (and are aligned correctly). one would want to make sure that the indel containing reads in the pileup are aligned correctly
No indels identified in the initial alignment
with base call: clustered SNP calls, which is suspicious and are often caused by indels
Without base call: detect clustered loci with high entropy ( i.e. lots of mismatches)
For known indels in dbSNP
Step 2
Construct all possible haplotypes by integrating
Reference genome
Known gaps
De novo gaps (by BFAST/BWA, or Smith-Waterman)
Conduct gapless alignment against all possible haplotypes, and calculate the likelihood of each haplotype
Key information on realignment
Minimizing mismatches for one read vs. multiple reads
Realignment process:
Enumerate potential haplotype candidates
Conduct gapless alignment on all haplotypes
Calculate likelihood for each haplotype
USE WITH CAUTION
Major assumption: consistency in the inferred haplotypes among all individuals
Doesn’t work on somatic SNP and indel alling (semi-random process)
Needs to have significant improvement to replace the initial alignment. May not work for:
pool seq experiment if only a small portion of individual has the indel
RNA-seq experiment if the low expressed allele contains the indel

Quality and recalibration

available covariates
Cycle Covariate(machine cycle for this biase), DinucCovariate, HomopolymerCovariate, MappingQualityCovariate, Minimum NQSCovariate, PositionCovariate, PrimerRoundCovariate, QualityScoreCovariate, ReadGroupCovariate

Variant identification

Diploid genome, Multiple individuals, Cancer genome/pooled sequencing
So, It is not that straightforward
Sequencing error should be considered
How to call a variant
Factors to be considered
Number of reads supporting each genotype(10G/1A vs. 5G/6A)
Base quality for each nucleotide
Alignment quality for each read
Sequence depth
Sequencing error —— machine related
Output: probability for each genotype(AA, A/G, or GG)
Bayesian approach
Bayesian inference is a method of statistical inference in which evidence is used to update the uncertainty of parameters and predictions in a probability model
One locus, n reads
k reads support A
n-k reads support G
Three possible genotypes
: observing n-k errors (G) in n reads
: binomial mode. In theory, we should have half A reads, and half G reads
: observing k errors (A) in n reads
Variant Quality
Prior of genotypes:
P = P = (1-r)/2
P = r
r is pre-defined probability of observing heterozygotes
r = 0.2 for known SNP loci
r = 0.001 for unknown loci
Posterior probability of p(g | D)
p(g | D) = p(g)* p(D|G)/P(D)
P(D) = p(D|)P()+p(D|)p()+p(D|)p()
Variant quality:
Q = -10log₁₀(1-p(g| D))
Additional comments
This method only works for diploid genome
Require substantial coverage for each genomic loci (>20x coverage)
This is not always the case
For low to moderate sequence coverage, this won’t work
Lead to under-calling heterozygous
One assumption: independence among reads
Genotype based on multiple individuals
Variant calls based on multiple individuals dramatically increased the accuracy
Neilsen et al. Nature Reviews Genetics, 2010