本文图片来自于学习视频——新一代测序技术数据分析第二讲
Lecture 2——Basics of data processing
Review Lecutre 1
Outline
Date analysis workflow
Sequence qualify evaluation
Phred scores
NGS error rates
Alignment
Smith-Waterman algorithm
Theories on short reads alignment
Suffix free, indexing, and Burrows-Wheeler transformation
Comparison of different aligners
Data formats
FASTQ, SAM, pileups, VCF
Data visualization
Genome Browsers, IGV,…
Date analysis workflow
HiSeq 2000 200G run
Image data: 32TB
Intensity Data: 2TB
Base call/quality score data: 250GB
Alignment output: 6TB(3TB), 1.2TB after intermediate files removed
Major steps for secondary analysis
Raw data —— QC Filter —— Alignment —— Annotation
Sequence quality
Base quality
For every nucleotide
Reported by the sequencer
Mapping quality (alignment quality)
For every read
Reported by the aligners
Consensus quality (variant call quality)
For every genomic locus
Reported by the variant callers
Quality scores
Phred scores
Published in 1998
Initially developed for human genome project
Widely used to characterize the quality of DNA sequence
Q= -10log10§
Q = 10; P = 0.1; acc = 90%
Q = 20; P = 0.01; acc = 99%
…
Sequence alignment
A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity
Helps in inferring functional, structural, or evolutionary relationships between the sequences
Goal: find out the best matching sequences
Global vs Local
Alignment theories
Scoring matrix, or penalty scheme
Protein: PAM and BLOSUM
DNA/RNA
Match = 1
Mismatch = 0
Gap
d = 3 (gap opening)
e = 0.1 (gap extension)
Global alignment
must account for all characters of each sequence
Needleman-Wunsch algorithm
Local alignment
accounts for only a continuous portion of each sequence
Smith-Waterman algorithm
Searching can start/end anywhere
Fast alignment for short reads
Short reads aligner
Major challenge: Going through 1 trillion times (reads) dynamic programming is not practical
Strategy: making a dictionary (index)
Problem: Making a 50-nt index is too huge: 450 = 1.3*1030
Things to consider
Features: Short and massive amounts
Cost
Speed, Resources required (memory)
Alignment quality
Gaps allowed?
Information considered
Base sequence quality considered?
Accuracy
Short reads aligner strategies
Three common strategies
Hash table
Seed-extend paradigm
Space allowance
Suffix/Prefix tree
Suffix array
Burrows-Wheeler transformation
==Merge sorting ==(not commonly used)
Hash table - based algorithm
Algorithms
Hashing reads
Eland, MAQ, Mosaik…
Hashing reference genome
BFAST, Mosaik, SOAP
Hash table - space allowance
Perfect match is straightforward, but not useful to identify genetic variants
Solution: using multiple indices that allow mismatches
More than one way to build mask
Allow 1-nt mismatch m: seed length;
w: weight (number of counted nt)
k: number of allowed mismatches
n: number of indices
What’s the bast mask design?
The seed weight w too small—— too many false positives that slow down the mapping process
The seed weight w too higher—— more seeds needed to achieve full sensitivity —— more memory
Optimal mask design: Lin et al. Bioinformatics (2008): ZOOM! Zillions of oligos mapped
mismatch k = 2
Suffix/prefix tree
Problems of hash table based strategy:
Alignment to multiple identical copies of a substring in the reference mast be performed for each copy.
当基因组中重复序列过多时,使得align的速度变慢
Suffix/prefix tree (trie) can handle this well
fast query, O(n), where n is the length of the query sequence.